


This notebook describes a simple case of finetuning. You can finetune either the `twitter-roberta-base` (https://huggingface.co/cardiffnlp/twitter-roberta-base-2021-124m) language model, or `twitter-roberta-base-sentiment` (https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest), which has already been fine-tuned on sentiment analysis English twitter data.

This notebook was modified from https://huggingface.co/transformers/v3.2.0/custom_datasets.html

# Fine-tuning and Evaluation of Language Models

Install necessary libraries

In [6]:
%pip install datasets
%pip install transformers
%pip install scikit-learn
%pip install matplotlib
%pip install numpy
%pip install pandas
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
%pip install accelerate -U
%pip install gdown


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://download.pytorch.org/whl/cu117
Note: you may need to restart the kernel to use updated packages.


Import relevant libraries

In [12]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments, EarlyStoppingCallback, set_seed
from sklearn.metrics import classification_report
import datasets
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from transformers import AutoTokenizer
plt.rc("font", size=25)

In [58]:
device = ''
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [53]:
import psutil

def find_max_batch_size_vmem(percentage=0.8):
    # Get the total available RAM in bytes
    total_memory = psutil.virtual_memory()
    # Total physical memory (RAM) in bytes
    total_physical_memory = total_memory.total

    # Available physical memory (RAM) in bytes
    available_physical_memory = total_memory.available

    # Total virtual memory (swap) in bytes
    total_virtual_memory = total_memory.total - total_memory.available

    print(f"Total Physical Memory (RAM): {total_physical_memory / (1024 ** 3):.2f} GB")
    print(f"Available Physical Memory (RAM): {available_physical_memory / (1024 ** 3):.2f} GB")
    print(f"Total Virtual Memory (Swap): {total_virtual_memory / (1024 ** 3):.2f} GB")
    
    # Calculate the maximum batch size as a percentage of available RAM
    max_batch_size = int((available_physical_memory * percentage) / (4 * 1024))  # Assuming 4 KB per element
    
    # Assuming BATCH_SIZE is already defined, you can set MAX_BATCH_SIZE accordingly
    MAX_BATCH_SIZE = max_batch_size
    
    return MAX_BATCH_SIZE

# Example: Find the maximum batch size using 80% of available RAM
MAX_BATCH_SIZE = find_max_batch_size_vmem(0.8)
print(f"Max batch size: {MAX_BATCH_SIZE}")


Total Physical Memory (RAM): 31.91 GB
Available Physical Memory (RAM): 17.10 GB
Total Virtual Memory (Swap): 14.81 GB
Max batch size: 3586173


# Parameters

 one has to consider before training a transformer model including:
- The learning rate (LR) which indicates how fast the model's weights are going to be updated (larger values results in faster training)
- The number of epochs (EPOCHS) indicating how many times the model will go through the train data (1 epoch means that the model will see the train set only once).
- The batch size (BATCH_SIZE) indicating the number of samples that will be pass through to the model at one time.

There many other hyper-parameters that you can experiment with such as `weight_decay` and `warmup_ratio` (find more at: https://huggingface.co/docs/transformers/v4.20.1/en/main_classes/trainer#transformers.TrainingArguments) and feel free to experiment with them but depending on your understanding of the models used it may be best to use the default values provided.

In [13]:
LR = 2e-5
EPOCHS = 30
BATCH_SIZE = 64
MODEL = "cardiffnlp/twitter-roberta-base-2021-124m" # use this to finetune the language model
#MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest" # use this to finetune the sentiment classifier
MAX_TRAINING_EXAMPLES = 7500 # set this to -1 if you want to use the whole training set


As the models are non-deterministic (i.e. can produce different results even if trained on the same dataset) we can set a seed so we can reproduce our experiments. In this notebook we are going to use the seed 223.

In [14]:
# set transformers seed
seed = 223
set_seed(seed)

# Data
We will be utilizing the the sentiment dataset for the TweetEval benchmark however feel free to use your own dataset if you prefer!

## Option 1: Download the dataset from CardiffNLP's github.


Loading TweetEval dataset for the sentiment task.
Also available tasks for:
- Emoji Prediction (emoji)
- Emotion Recognition (emotion)
- Hate Speech Detection (hate)
- Irony Detection (irony)
- Offensive Language Identification (offensive)
- Stance Detection (stance)

See: https://github.com/cardiffnlp/tweeteval/tree/main/datasets for more details


In [20]:
import requests 
task = "sentiment"

files = """test_labels.txt
test_text.txt
train_labels.txt
train_text.txt
val_labels.txt
val_text.txt""".split('\n')

for f in files:
  p = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/{f}"
  response = requests.get(p)
  if response.status_code == 200:
    # Get the content of the response and save it to a local file
    with open(f"{f}", "wb") as f:
        f.write(response.content)
    print("File downloaded successfully.")
  else:
    print(f"Failed to download the file. Status code: {response.status_code}")


File downloaded successfully.
File downloaded successfully.
File downloaded successfully.
File downloaded successfully.
File downloaded successfully.
File downloaded successfully.


We now read the data from the files we donwloaded, format the data in a more usable structure and create the train, validation, and test sets  i.e. ``` { 'train': { 'text': ['foobar', ...], 'labels': [0, ...] }, ... } ```.


In [22]:
dataset_dict = {}
for i in ['train','val','test']:
  dataset_dict[i] = {}
  for j in ['text','labels']:
    dataset_dict[i][j] = open(f"{i}_{j}.txt", encoding="utf-8").read().split('\n')[:-1] # ignore last line of file
    if j == 'labels':
      dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]]

if MAX_TRAINING_EXAMPLES > 0:
  dataset_dict['train']['text']=dataset_dict['train']['text'][:MAX_TRAINING_EXAMPLES]
  dataset_dict['train']['labels']=dataset_dict['train']['labels'][:MAX_TRAINING_EXAMPLES]



In [23]:
# Transform dictionaries to datasets.Dataset for easier preprocessing (https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html#from-a-python-dictionary)
train_dataset = datasets.Dataset.from_dict(dataset_dict['train'])
val_dataset = datasets.Dataset.from_dict(dataset_dict['val'])
test_dataset = datasets.Dataset.from_dict(dataset_dict['test'])

Initialize and use model's tokenizer to get the text encodings.

In [25]:
from transformers import AutoTokenizer

# Replace 'bert-base-uncased' with the name of the pre-trained model you want to use
model_name_or_path = 'bert-base-uncased'

# Create the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)


train_dataset = train_dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)
val_dataset = val_dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)
test_dataset = test_dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)

Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 570kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.24MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 4.26MB/s]
Map: 100%|██████████| 7500/7500 [00:00<00:00, 35040.25 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 36021.47 examples/s]
Map: 100%|██████████| 12284/12284 [00:00<00:00, 40333.92 examples/s]


## Option 2: Download the dataset directly from huggingface (https://huggingface.co/datasets/tweet_eval).

In [34]:
# load dataset using 'datasets' library by specifying the name of the dataset and the subset (task).
task = 'sentiment'
dataset = datasets.load_dataset('tweet_eval', task)

In [36]:
# use model's tokenizer to get text encodings
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

dataset =dataset.map(lambda e: tokenizer(e['text'], truncation=True), batched=True)

# make sure to use whole train dataset if MAX_TRAINING_EXAMPLES == -1
if MAX_TRAINING_EXAMPLES == -1: MAX_TRAINING_EXAMPLES = dataset['train'].shape[0]
# split into train/val/test sets
train_dataset = dataset['train']
val_dataset = dataset['validation']
test_dataset = dataset['test']

Downloading (…)okenizer_config.json: 100%|██████████| 345/345 [00:00<?, ?B/s] 
Downloading (…)olve/main/vocab.json: 100%|██████████| 798k/798k [00:00<00:00, 2.75MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 4.21MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 4.97MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 239/239 [00:00<?, ?B/s] 
Map: 100%|██████████| 45615/45615 [00:01<00:00, 41707.12 examples/s]
Map: 100%|██████████| 12284/12284 [00:00<00:00, 48449.94 examples/s]
Map: 100%|██████████| 2000/2000 [00:00<00:00, 36025.96 examples/s]


In [37]:
print(dataset['train'][3])
print(dataset['test'])
print(dataset['validation'])

{'text': "Chase Headley's RBI double in the 8th inning off David Price snapped a Yankees streak of 33 consecutive scoreless innings against Blue Jays", 'label': 1, 'input_ids': [0, 4771, 3175, 3928, 607, 18, 4515, 1457, 11, 5, 290, 212, 3715, 160, 871, 3655, 10098, 10, 6742, 3963, 9, 2357, 3396, 1471, 1672, 2699, 136, 2692, 10929, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 12284
})
Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 2000
})


# Fine-tuning

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

More information about the Trainer's arguments can be be found here: https://huggingface.co/docs/transformers/v4.20.0/en/main_classes/trainer#transformers.TrainingArguments

In [59]:
MAX_BATCH_SIZE = find_max_batch_size_vmem(0.7)

training_args = TrainingArguments(
    output_dir='./results',                   # output directory
    num_train_epochs=EPOCHS,                  # total number of training epochs
    per_device_train_batch_size=BATCH_SIZE,   # batch size per device during training
    per_device_eval_batch_size=BATCH_SIZE,    # batch size for evaluation
    warmup_steps=100,                          # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                        # strength of weight decay
    logging_dir='./logs',                     # directory for storing logs
    logging_steps=160,                         # when to print log
    evaluation_strategy='steps',              # evaluate every n number of steps.
    eval_steps=160,                            # how often to evaluate. If not set defaults to number of logging_steps
    load_best_model_at_end=True,              # to load or not the best model at the end
    save_steps=160,                            # create a checkpoint every time we evaluate,
    seed=seed                                 # seed for consistent results

)

print(MAX_BATCH_SIZE, training_args)


num_labels = len(set(train_dataset['labels'])) if 'labels' in train_dataset.features.keys() else len(set(train_dataset['label']))

model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels).to(device)

Total Physical Memory (RAM): 31.91 GB
Available Physical Memory (RAM): 16.83 GB
Total Virtual Memory (Swap): 15.08 GB
3088974 TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=160,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_ste

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-2021-124m and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [60]:

train_dataset.set_format(type="torch", device=device)  # Move the training dataset to the selected device
val_dataset.set_format(type="torch", device=device)    # Move the validation dataset to the selected device


trainer = Trainer(
    model=model,                              # the instantiated 🤗 Transformers model to be trained
    tokenizer=tokenizer,                      # tokenizer to be used to pad the inputs
    args=training_args,                       # training arguments, defined above
    train_dataset=train_dataset,              # training dataset
    eval_dataset=val_dataset,                  # evaluation dataset
    callbacks = [EarlyStoppingCallback(3, 0.001)], # early stopping which stops the training after 3 evaluation calls with no improvement of performance of at least 0.001
)

trainer.train()

  1%|          | 160/21390 [00:27<59:04,  5.99it/s]  

{'loss': 0.7982, 'learning_rate': 4.985908877407234e-05, 'epoch': 0.22}



  1%|          | 160/21390 [00:29<59:04,  5.99it/s]

{'eval_loss': 0.6984858512878418, 'eval_runtime': 1.8728, 'eval_samples_per_second': 1067.905, 'eval_steps_per_second': 17.086, 'epoch': 0.22}


  1%|▏         | 320/21390 [00:57<58:08,  6.04it/s]  

{'loss': 0.6386, 'learning_rate': 4.9483325504931894e-05, 'epoch': 0.45}



  1%|▏         | 320/21390 [00:59<58:08,  6.04it/s]

{'eval_loss': 0.602426290512085, 'eval_runtime': 1.8698, 'eval_samples_per_second': 1069.622, 'eval_steps_per_second': 17.114, 'epoch': 0.45}


  2%|▏         | 480/21390 [01:27<58:43,  5.93it/s]  

{'loss': 0.6067, 'learning_rate': 4.9107562235791455e-05, 'epoch': 0.67}



  2%|▏         | 480/21390 [01:29<58:43,  5.93it/s]

{'eval_loss': 0.5889315009117126, 'eval_runtime': 1.8618, 'eval_samples_per_second': 1074.203, 'eval_steps_per_second': 17.187, 'epoch': 0.67}


  3%|▎         | 611/21390 [01:53<58:12,  5.95it/s]  

Tip: In cases where you are facing memory issue during training try a smaller batch size.

In [41]:
trainer.save_model("./results/best_model") # save best model

# Evaluate on Test set

In [42]:
# for every prediction the model ouptuts logits where largest value indicates the predicted class
test_preds_raw, test_labels , _ = trainer.predict(test_dataset)
test_preds = np.argmax(test_preds_raw, axis=-1)
print(classification_report(test_labels, test_preds, digits=3))

100%|██████████| 192/192 [00:08<00:00, 21.65it/s]

              precision    recall  f1-score   support

           0      0.777     0.609     0.683      3972
           1      0.695     0.745     0.719      5937
           2      0.658     0.779     0.713      2375

    accuracy                          0.708     12284
   macro avg      0.710     0.711     0.705     12284
weighted avg      0.715     0.708     0.706     12284






We can also check how "sure" the model is for every prediction by getting the softmax scores for each prediction.

In [43]:
from scipy.special import softmax

scores = softmax(test_preds_raw, axis=1)
scores

array([[0.70912045, 0.28368035, 0.00719918],
       [0.06748365, 0.67580575, 0.2567106 ],
       [0.22985502, 0.67808396, 0.09206098],
       ...,
       [0.3809683 , 0.59830207, 0.02072967],
       [0.84317213, 0.14105885, 0.01576903],
       [0.01285315, 0.1366287 , 0.85051817]], dtype=float32)

# Make predictions on unseen tweets

We are going to apply the model we trained on tweets made by the Prime Ministers of UK (Boris Johnson) and Australia (Anthony Albanese) and their respective oppossition leaders (Keir Starmer & Scott Morrison). Tweets were extracted from January 1 2022 to June 19 2022.

You can find more details on how to extract tweets using the Twiiter api in this notebook: https://colab.research.google.com/drive/1RyiRY3aCUQ_K-PiXp1qN-8l7479uQa9f.

Download and load the dataset in a pandas Dataframe.

In [45]:
import subprocess
file_id = "1EN1jGxwprKxvzV2D4ML3dFp1fMlSrXEb"
output_file = "output_file_name.ext"  # Specify the name of the output file

# Run the gdown command to download the file
subprocess.call(["gdown", f"https://drive.google.com/uc?id={file_id}", "-O", output_file])


Collecting gdown
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Installing collected packages: gdown
Successfully installed gdown-4.7.1
Note: you may need to restart the kernel to use updated packages.


0

First we will see how to get predictions using a custom function.

In [46]:
def get_predictions(tweets):
  """ wrapper function to predict sentiment of tweets"""
  with torch.no_grad():
    encoded_input = tokenizer(
        tweets, padding=True, truncation=True, return_tensors='pt'
    )

    # set model on evaluation mode to deactivate Dropout
    trainer.model.eval()
    # pass encoded text to model
    output = trainer.model(**{k: v.to('cuda') for k, v in encoded_input.items()})
    # get logits and move them to cpu to get the predictions
    output = output.logits.detach().cpu().numpy()
    predictions = np.argmax(output, axis=1)

  return predictions

tweets = ["RT @UKLabour: Britain is facing the biggest rail strike in a generation but @GrantShapps hasn’t spent a single second in talks to avert it…",
          "Good news in today’s jobs stats: the number of employees on payrolls increased again in March.",
          "I'm #live in Gladstone with my Labor team: https://t.co/chWrHtumLc"]

# get predictions
predictions = get_predictions(tweets)
print(predictions)

# map predictions to negative/neutral/positive
sentiment_mapping = {
    0: 'negative',
    1: 'neutral',
    2: 'positive'
}

predictions = [sentiment_mapping[x] for x in predictions]
print(predictions)

[0 2 1]
['negative', 'positive', 'neutral']
