


This notebook describes a simple case of finetuning. You can finetune either the `twitter-roberta-base` (https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m) language model, or `twitter-roberta-base-sentiment` (https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest), which has already been fine-tuned on sentiment analysis English twitter data.

This notebook was modified from https://huggingface.co/transformers/v3.2.0/custom_datasets.html

# Fine-tuning and Evaluation of Language Models

Install necessary libraries

In [None]:
%pip install datasets
%pip install transformers
%pip install scikit-learn
%pip install matplotlib
%pip install numpy
%pip install pandas
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
%pip install accelerate -U
%pip install gdown


Import relevant libraries

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, EarlyStoppingCallback, set_seed
from sklearn.metrics import classification_report
import datasets
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
plt.rc("font", size=25)

In [None]:
device = ''
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
    
torch.cuda.empty_cache()


In [None]:
import psutil

def find_max_batch_size_vmem(percentage=0.8):
    # Get the total available RAM in bytes
    total_memory = psutil.virtual_memory()
    # Total physical memory (RAM) in bytes
    total_physical_memory = total_memory.total

    # Available physical memory (RAM) in bytes
    available_physical_memory = total_memory.available

    # Total virtual memory (swap) in bytes
    total_virtual_memory = total_memory.total - total_memory.available

    print(f"Total Physical Memory (RAM): {total_physical_memory / (1024 ** 3):.2f} GB")
    print(f"Available Physical Memory (RAM): {available_physical_memory / (1024 ** 3):.2f} GB")
    print(f"Total Virtual Memory (Swap): {total_virtual_memory / (1024 ** 3):.2f} GB")
    
    # Calculate the maximum batch size as a percentage of available RAM
    max_batch_size = int((available_physical_memory * percentage) / (4 * 1024))  # Assuming 4 KB per element
    
    # Assuming BATCH_SIZE is already defined, you can set MAX_BATCH_SIZE accordingly
    MAX_BATCH_SIZE = max_batch_size
    
    return MAX_BATCH_SIZE

# Example: Find the maximum batch size using 80% of available RAM
MAX_BATCH_SIZE = find_max_batch_size_vmem(0.8)
print(f"Max batch size: {MAX_BATCH_SIZE}")


# Data
We will be utilizing the the sentiment dataset for the TweetEval benchmark however feel free to use your own dataset if you prefer!

## Option 1: Download the dataset from CardiffNLP's github.


Loading TweetEval dataset for the sentiment task.
Also available tasks for:
- Emoji Prediction (emoji)
- Emotion Recognition (emotion)
- Hate Speech Detection (hate)
- Irony Detection (irony)
- Offensive Language Identification (offensive)
- Stance Detection (stance)

See: https://github.com/cardiffnlp/tweeteval/tree/main/datasets for more details


In [None]:
import requests 
task = "sentiment"

files = """test_labels.txt
test_text.txt
train_labels.txt
train_text.txt
val_labels.txt
val_text.txt""".split('\n')

for f in files:
  p = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/{f}"
  response = requests.get(p)
  if response.status_code == 200:
    # Get the content of the response and save it to a local file
    with open(f"{f}", "wb") as f:
        f.write(response.content)
    print("File downloaded successfully.")
  else:
    print(f"Failed to download the file. Status code: {response.status_code}")


We now read the data from the files we donwloaded, format the data in a more usable structure and create the train, validation, and test sets  i.e. ``` { 'train': { 'text': ['foobar', ...], 'labels': [0, ...] }, ... } ```.


In [None]:
dataset_dict = {}
for i in ['train','val','test']:
  dataset_dict[i] = {}
  for j in ['text','labels']:
    dataset_dict[i][j] = open(f"{i}_{j}.txt", encoding="utf-8").read().split('\n')[:-1] # ignore last line of file
    if j == 'labels':
      dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]]

MAX_TRAINING_EXAMPLES = 7500 # set this to -1 if you want to use the whole training set

dataset_dict['train']['text']=dataset_dict['train']['text'][:MAX_TRAINING_EXAMPLES]
dataset_dict['train']['labels']=dataset_dict['train']['labels'][:MAX_TRAINING_EXAMPLES]



In [None]:
# Transform dictionaries to datasets.Dataset for easier preprocessing (https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html#from-a-python-dictionary)
train_dataset = datasets.Dataset.from_dict(dataset_dict['train'])
val_dataset = datasets.Dataset.from_dict(dataset_dict['val'])
test_dataset = datasets.Dataset.from_dict(dataset_dict['test'])

Initialize and use model's tokenizer to get the text encodings.

In [None]:
from transformers import AutoTokenizer

# Replace 'bert-base-uncased' with the name of the pre-trained model you want to use
model_name_or_path = 'bert-base-uncased'

# Create the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)


train_dataset = train_dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)
val_dataset = val_dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)
test_dataset = test_dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)

## Option 2: Download the dataset directly from huggingface (https://huggingface.co/datasets/tweet_eval).

In [None]:
# load dataset using 'datasets' library by specifying the name of the dataset and the subset (task).
task = 'sentiment'
dataset = datasets.load_dataset('tweet_eval', task)

In [None]:
MODEL = "cardiffnlp/twitter-roberta-base-2021-124m" # use this to finetune the language model

# use model's tokenizer to get text encodings
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

dataset =dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)

# make sure to use whole train dataset if MAX_TRAINING_EXAMPLES == -1
if MAX_TRAINING_EXAMPLES == -1: MAX_TRAINING_EXAMPLES = dataset['train'].shape[0]
# split into train/val/test sets
train_dataset = dataset['train']
val_dataset = dataset['validation']
test_dataset = dataset['test']

In [None]:
print(dataset['train'][3])
print(dataset['test'])
print(dataset['validation'])

# Parameters

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

More information about the Trainer's arguments can be be found here: https://huggingface.co/docs/transformers/v4.20.0/en/main_classes/trainer#transformers.TrainingArguments

| Parameter                        | Description                                                | Increase in Practice                                             | Decrease in Practice                                             | Default Value            |
|----------------------------------|------------------------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------------|--------------------------|
| `output_dir`                     | Directory for saving the trained model and files.         | Model saved in a custom location.                            | Model saved in the default directory.                          | `'./output'`             |
| `num_train_epochs`               | Total training epochs.                                     | Longer training, may lead to better performance.              | Shorter training, may result in underfitting.                  | `1`                      |
| `per_device_train_batch_size`    | Batch size per device during training.                    | Faster training, but higher GPU memory usage.                | Slower training, lower GPU memory usage.                       | `32`              |
| `per_device_eval_batch_size`     | Batch size per device during evaluation.                  | Faster evaluation, but higher GPU memory usage.              | Slower evaluation, lower GPU memory usage.                     | `32`              |
| `warmup_steps`                   | Number of warm-up steps for learning rate scheduler.      | Smoother learning rate curve, potentially better convergence. | Sharper learning rate curve, may result in training instability. | `0`                  |
| `weight_decay`                   | Strength of weight decay (regularization).                | Stronger regularization, potentially better generalization. | Weaker regularization, may result in overfitting.               | `0.01`                    |
| `logging_dir`                    | Directory for storing training logs.                      | Logs saved in a custom directory.                           | Logs saved in the default directory.                          | `'./logs'`               |
| `logging_steps`                  | Frequency of training log printing.                       | Less frequent logs, reduced log volume.                      | More frequent logs, increased log volume.                      | `500`                    |
| `evaluation_strategy`            | Model evaluation strategy during training.                | Evaluate more often, higher granularity in monitoring.       | Evaluate less often, lower granularity in monitoring.         | `'steps'`           |
| `eval_steps`                     | Evaluation frequency in steps if `evaluation_strategy` is set to `'steps'`. | More frequent evaluation, better monitoring. | Less frequent evaluation, less monitoring. | `500`                    |
| `load_best_model_at_end`         | Load the best model based on evaluation metrics at the end of training. | Load the best model for deployment. | Do not load the best model at the end.  | `True`                   |
| `save_steps`                     | Frequency of model checkpoint creation.                   | More checkpoints saved, potential for finer-grained recovery. | Fewer checkpoints saved, less recovery granularity.           | `500`                    |
| `seed`                           | Random seed for reproducibility.                          | Reproducible results across runs.                            | Non-reproducible results, different outcomes on each run.     | `None`                   |




1. **output_dir='./results'**: This parameter specifies the directory where the trained model and related files will be saved. It plays a crucial role in organizing and storing the outputs of your training process. By setting `output_dir`, you can choose where your model checkpoints, training logs, and other artifacts will be stored. This directory is often customized based on your project's file structure and storage preferences. For instance, you might want to store models in a specific location, such as a dedicated folder for model versions, making it easier to manage and access your trained models.

2. **num_train_epochs=EPOCHS**: `num_train_epochs` determines the total number of training epochs, with each epoch representing one pass through the entire training dataset. Increasing this value allows the model to learn from the data for more iterations, potentially improving its performance. However, a higher number of epochs also increases the risk of overfitting, where the model memorizes the training data instead of generalizing to unseen data. Therefore, you should carefully select the appropriate number of epochs based on your dataset's size and complexity. It's common practice to start with a small number of epochs and gradually increase it while monitoring training and validation performance. This iterative approach helps strike a balance between underfitting and overfitting, leading to a well-trained model.

3. **per_device_train_batch_size=BATCH_SIZE** and **per_device_eval_batch_size=BATCH_SIZE**: These parameters control the batch size per device during training and evaluation, respectively. Batch size determines how many training examples are processed in each forward and backward pass on each GPU or device. A larger batch size can accelerate training, as it allows for parallelism, but it also requires more GPU memory. The choice of batch size depends on your available hardware resources. You may increase it to train faster if your GPU has sufficient memory or decrease it to fit within GPU constraints. When setting these values, consider the total batch size (batch size per device multiplied by the number of devices) to ensure efficient GPU utilization.

4. **warmup_steps=100**: The `warmup_steps` parameter specifies the number of warm-up steps for the learning rate scheduler. During warm-up, the learning rate gradually increases from a very small value to its desired value. This technique helps stabilize the training process, preventing large gradient updates in the early stages, which can lead to convergence issues. Setting `warmup_steps` to 100 means that the model's learning rate will gradually increase over the first 100 steps of training. The specific value of `warmup_steps` depends on your model architecture, dataset, and training schedule. You may adjust it based on empirical observations and experimentation to find the optimal balance between stability and convergence speed.

5. **weight_decay=0.01**: `weight_decay` is a regularization hyperparameter that controls the strength of L2 regularization during training. Regularization techniques like L2 regularization add a penalty term to the loss function, discouraging the model from learning large parameter values. A value of 0.01 for `weight_decay` represents relatively strong regularization. This means that the model will be penalized more for having large weight values, encouraging it to have a simpler, smoother, and less complex solution. Stronger regularization helps prevent overfitting, where the model fits the training data too closely and fails to generalize to new data. By adjusting `weight_decay`, you can fine-tune the balance between model complexity and generalization to achieve better performance on your specific task. Common practice involves experimenting with different `weight_decay` values during hyperparameter tuning to find the one that works best for your model and dataset.

6. **logging_dir='./logs'** and **logging_steps=160**: These parameters are related to logging training progress. `logging_dir` specifies the directory where training logs, including loss and evaluation metrics, are saved. It provides a record of how the model's performance changes during training. `logging_steps` determines how often training logs are printed. In this case, logs are printed every 160 steps. You can adjust `logging_dir` to customize where your logs are stored, and you can modify `logging_steps` to control the frequency of log messages. Logging is crucial for monitoring training progress, diagnosing issues, and analyzing model behavior.

7. **evaluation_strategy='steps'** and **eval_steps=160**: These parameters control the strategy for model evaluation during training. `evaluation_strategy` is set to 'steps', indicating that the model will be evaluated at regular intervals specified by `eval_steps`. Evaluating the model during training is essential for monitoring its performance and ensuring that it's learning effectively. By setting `eval_steps` to 160, the model will be evaluated every 160 training steps. This frequency allows you to track how the model performs throughout the training process. You can adjust these parameters based on your specific needs and hardware capabilities. For example, if you have limited resources, you might increase `eval_steps` to reduce evaluation frequency.

8. **load_best_model_at_end=True**: This parameter determines whether the best-performing model, based on evaluation metrics, should be loaded at the end of training. Setting it to True is common practice because it ensures that you deploy the model with the highest observed performance. When training completes, the best model checkpoint is saved, and you can use it for inference and deployment. This option is particularly useful when you want to maintain the highest model quality for your application.

9. **save_steps=160**: `save_steps` defines the frequency at which model checkpoints are created during training. A checkpoint is a saved model state that allows you to recover the model's parameters and continue training or perform further analysis. By setting `save_steps` to 160, a checkpoint will be created every 160 training steps. The choice of this parameter depends on how often you want to save model progress and whether you anticipate the need to recover training from intermediate points. It can also impact the storage space required for checkpoints.

10. **seed=seed**: `seed` sets the random seed for reproducibility. When you use the same seed value across different runs, you ensure that the training process and random initialization of model weights are consistent. Reproducibility is crucial for research and debugging, allowing you to compare results and identify issues. However, in production scenarios, where you want random initialization for model diversity, you may choose not to set a seed.


In [None]:
MAX_BATCH_SIZE = find_max_batch_size_vmem(0.1)
LR = 2e-5
EPOCHS = 50

# set transformers seed
seed = 223
set_seed(seed)

training_args = TrainingArguments(
    output_dir='./results',                   # output directory
    num_train_epochs=EPOCHS,                  # total number of training epochs
    per_device_train_batch_size=100,   # batch size per device during training
    per_device_eval_batch_size=100,    # batch size for evaluation
    warmup_steps=500,                          # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                        # strength of weight decay
    logging_dir='./logs',                     # directory for storing logs
    logging_steps=160,                         # when to print log
    evaluation_strategy='steps',              # evaluate every n number of steps.
    eval_steps=160,                            # how often to evaluate. If not set defaults to number of logging_steps
    load_best_model_at_end=True,              # to load or not the best model at the end
    save_steps=160,                            # create a checkpoint every time we evaluate,
    seed=seed                                 # seed for consistent results

)

print(MAX_BATCH_SIZE, training_args)


num_labels = len(set(train_dataset['labels'])) if 'labels' in train_dataset.features.keys() else len(set(train_dataset['label']))

model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels).to(device)


In [None]:

train_dataset.set_format(type="torch", device=device)  # Move the training dataset to the selected device
val_dataset.set_format(type="torch", device=device)    # Move the validation dataset to the selected device


trainer = Trainer(
    model=model,                              # the instantiated 🤗 Transformers model to be trained
    tokenizer=tokenizer,                      # tokenizer to be used to pad the inputs
    args=training_args,                       # training arguments, defined above
    train_dataset=train_dataset,              # training dataset
    eval_dataset=val_dataset,                  # evaluation dataset
    callbacks = [EarlyStoppingCallback(3, 0.001)], # early stopping which stops the training after 3 evaluation calls with no improvement of performance of at least 0.001
)

# import gc
# gc.collect()
# torch.cuda.empty_cache()

trainer.train()

# Save the trained model and tokenizer to the specified directory
trainer.save_model("./results/best_model")

# Save the tokenizer to the same directory
tokenizer.save_pretrained("./results/best_model")

# Get the best model's state dictionary
best_model_state_dict = torch.load(f"./results/best_model/pytorch_model.bin")

# Define the path to save the PyTorch model's state dictionary as a .pth file
output_path = "./results/best_model.pth"

# Save the state dictionary to a .pth file
torch.save(best_model_state_dict, output_path)

# Evaluate on Test set

| Metric          | Description                                                             |
|-----------------|-------------------------------------------------------------------------|
| True Positive   | The number of correctly predicted positive instances.                   |
| True Negative   | The number of correctly predicted negative instances.                   |
| False Positive  | The number of negative instances incorrectly predicted as positive.     |
| False Negative  | The number of positive instances incorrectly predicted as negative.     |
| Accuracy        | The proportion of correctly classified instances (TP + TN) over total.  |
| Precision       | The proportion of true positives over the total predicted positives.    |
| Recall (Sensitivity or True Positive Rate) | The proportion of true positives over the total actual positives. |
| Specificity     | The proportion of true negatives over the total actual negatives.       |
| F1-Score        | The harmonic mean of precision and recall, balances precision-recall.  |
| Support         | The number of instances in each class in the true dataset.             |

For fine-tuning a model like RoBERTa for sentiment analysis on Twitter data, you can choose a set of evaluation metrics that are suitable for binary or multi-class classification tasks. Here are some common evaluation metrics to consider:

    Accuracy: Accuracy measures the proportion of correctly classified instances among all instances. It's a good overall metric for classification tasks but may not be ideal if the classes are imbalanced.

    Precision: Precision measures the proportion of true positive predictions out of all positive predictions made by the model. It's useful when you want to minimize false positives. In sentiment analysis, high precision means the model is good at correctly identifying positive or negative sentiment.

    Recall (Sensitivity): Recall measures the proportion of true positive predictions out of all actual positive instances. It's useful when you want to minimize false negatives. High recall means the model is good at capturing all positive instances in the data.

    F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall, making it a good metric when you want to find a compromise between false positives and false negatives.

    ROC-AUC (Receiver Operating Characteristic - Area Under Curve): ROC-AUC measures the area under the ROC curve, which plots the true positive rate (recall) against the false positive rate. It's useful for evaluating the model's ability to distinguish between positive and negative classes across different threshold values.

    Macro-F1 Score: If you have a multi-class sentiment analysis task (e.g., classifying tweets into positive, negative, and neutral), you can use the macro-F1 score, which calculates the F1 score for each class independently and then takes the average. It's useful for assessing the model's performance across all classes.

    Confusion Matrix: A confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives for each class. It's helpful for understanding where the model is making errors and which classes are challenging to classify.

    Cohen's Kappa: Cohen's Kappa is a statistic that measures the agreement between the model's predictions and the actual labels, considering the possibility of agreement occurring by chance. It's a useful metric for inter-rater agreement in sentiment analysis.

    Matthews Correlation Coefficient: Matthews Correlation Coefficient (MCC) takes into account true positives, true negatives, false positives, and false negatives to provide a single metric that summarizes the quality of a binary classification model. It's particularly useful when dealing with imbalanced datasets.

In [None]:
# for every prediction the model ouptuts logits where largest value indicates the predicted class
test_preds_raw, test_labels , _ = trainer.predict(test_dataset)
''' trainer.predict(test_dataset): This calls the predict method of the trainer object on the test_dataset. The predict method is used to make predictions on a dataset using a trained model. In this case, it's applied to the test_dataset, which likely contains a set of test examples.
    test_preds_raw: This variable stores the raw predictions made by the model. These raw predictions are often referred to as logits. Each row in test_preds_raw corresponds to an example in the test dataset, and each column represents a class. The values in this variable are the raw scores for each class before applying softmax.
    test_labels: This variable stores the true labels or ground truth for the test dataset. These are the correct class labels for each example in the test dataset. It's assumed that test_labels is structured in a way that corresponds to the examples in the same order as test_preds_raw.
    The _ variable is often used as a placeholder for a value that is not needed or not used further in the code. In this case, it seems that the third return value from trainer.predict is not needed, so it's assigned to _.'''
    
test_preds = np.argmax(test_preds_raw, axis=-1)
'''This line of code calculates the predicted class for each example in the test dataset using the test_preds_raw logits. Here's what happens:
    np.argmax(test_preds_raw, axis=-1): This is a NumPy operation that finds the index of the maximum value along the last axis (axis=-1) of the test_preds_raw array. In the context of classification, this effectively identifies the class with the highest probability for each example.
    test_preds: This variable stores the predicted class labels for each example in the test dataset. It's a NumPy array where each element represents the predicted class for the corresponding example.'''
    
print(classification_report(test_labels, test_preds, digits=3))
'''This line of code generates a classification report based on the true labels (test_labels) and the predicted labels (test_preds). Here's what it does:
    classification_report(test_labels, test_preds, digits=3): This function computes and prints a classification report, which includes various metrics such as precision, recall, F1-score, and support for each class. It provides a detailed summary of how well the model's predictions match the true labels.
    digits=3: This argument specifies that the report should display numerical values with three decimal places for precision, recall, and F1-score.'''
'''

We can also check how "sure" the model is for every prediction by getting the softmax scores for each prediction.

In [None]:
from scipy.special import softmax

scores = softmax(test_preds_raw, axis=1)
print(scores)

# Make predictions on unseen titles

First we will see how to get predictions using a custom function.

In [None]:
import numpy as np
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
import jsonlines  # Import the jsonlines library for reading JSONL files
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Function to read tweet text from a JSONL file using a specified column title
def extract_titles_from_json(file_path):
    titles = []
    with jsonlines.open(file_path, 'r') as json_file:
        for item in json_file:
            if 'Title' in item:
                titles.append(item['Title'])
    return titles



def load_local_model(model_path, tokenizer_path):
    # Load the model
    model = AutoModelForSequenceClassification.from_pretrained(model_path,local_files_only=True)

    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path,local_files_only=True)

    return model, tokenizer

jsonl_file_path = '..\dump.jsonl'  # Replace with your file path

# Read tweets from the JSONL file
tweets = extract_titles_from_json(jsonl_file_path)



model_path = "../fine_tuning/results"  # Replace with the actual path to your saved model
tokenizer_path = "../fine_tuning/results/best_model"  # Replace with the appropriate tokenizer name or path

# Load the model and tokenizer
model, tokenizer = load_local_model(model_path, tokenizer_path)



# Calculate F1-score and F1-macro score
f1_macro = f1_score(true_labels, predictions, labels=["negative", "neutral", "positive"], average="macro")
f1_micro = f1_score(true_labels, predictions, labels=["negative", "neutral", "positive"], average="micro")
f1_weighted = f1_score(true_labels, predictions, labels=["negative", "neutral", "positive"], average="weighted")

print("F1-Macro Score:", f1_macro)
print("F1-Micro Score:", f1_micro)
print("F1-Weighted Score:", f1_weighted)

# Create a bar chart for F1 scores
classes = ['negative', 'neutral', 'positive']
f1_scores = [f1_macro, f1_micro, f1_weighted]

plt.figure(figsize=(8, 5))
plt.bar(classes, f1_scores, color='skyblue')
plt.xlabel('Classes')
plt.ylabel('F1 Score')
plt.title('F1 Scores for Different Classes')
plt.ylim(0, 1)  # Set the y-axis limit between 0 and 1
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


    F1-Macro Score: F1-macro (or Macro-F1) is an F1-score calculated independently for each class and then averaged. It gives equal weight to each class, regardless of its size or prevalence in the dataset. This means that F1-macro treats each class as equally important when calculating the score. It's a good metric to use when you want to ensure that the model performs well across all classes, especially in situations where class imbalances exist. The F1-macro score ranges from 0 (worst) to 1 (best).

    F1-Micro Score: F1-micro (or Micro-F1) is a global F1-score calculated by considering all instances and their predictions together. It aggregates true positives, false positives, and false negatives across all classes and then calculates the F1-score. F1-micro gives more weight to larger classes because it considers the overall performance across all instances. It's useful when you want to prioritize the overall performance of the model regardless of class sizes. The F1-micro score also ranges from 0 (worst) to 1 (best).

    F1-Weighted Score: F1-weighted (or Weighted F1) is another type of F1-score that calculates a weighted average of the F1-scores for each class. The weights are determined by the class distribution in the dataset. This means that F1-weighted gives more importance to classes with larger populations while still considering all classes. It's suitable when you want to balance the importance of classes based on their prevalence. The F1-weighted score ranges from 0 (worst) to 1 (best).
    