<a href="https://colab.research.google.com/github/Xelvise/FineTuning-pretrained-models-with-HuggingFace/blob/main/FineTuning_DistilBERT_for_Emotion_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Transformers are a type of Neural network architecture that relies on the attention mechanism. The attention mechanism allows the model to learn long-range dependencies between different parts of a sequence.

- Transformers are typically composed of two main parts: an encoder and a decoder. The encoder takes the input sequence and produces a sequence of hidden states. The decoder then takes these hidden state and produces the output sequence.

In [None]:
!pip install -U transformers accelerate datasets bertviz umap-learn --quiet

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset

# load emotion dataset from huggingface
tweets = load_dataset("dair-ai/emotion")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [None]:
tweets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [None]:
tweets.set_format(type='pandas')    # convert to a pandas data format
df = tweets['train'][:]
df

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3
...,...,...
15995,i just had a very brief time in the beanbag an...,0
15996,i am now turning and i feel pathetic that i am...,0
15997,i feel strong and good overall,1
15998,i feel like this was such a rude comment and i...,3


In [None]:
classes = tweets['train'].features['label'].names
classes

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

In [None]:
df['label name'] = df['label'].apply(lambda x: classes[x])
df

Unnamed: 0,text,label,label name
0,i didnt feel humiliated,0,sadness
1,i can go from feeling so hopeless to so damned...,0,sadness
2,im grabbing a minute to post i feel greedy wrong,3,anger
3,i am ever feeling nostalgic about the fireplac...,2,love
4,i am feeling grouchy,3,anger
...,...,...,...
15995,i just had a very brief time in the beanbag an...,0,sadness
15996,i am now turning and i feel pathetic that i am...,0,sadness
15997,i feel strong and good overall,1,joy
15998,i feel like this was such a rude comment and i...,3,anger


#### Data Analysis on Tweets

In [None]:
df['label name'].value_counts()

label name
joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: count, dtype: int64

- The distribution of classes looks imbalanced, but we'd ignore it since transformers aren't impacted by imbalanced datasets

In [None]:
# split each tweet into tokens/words so as to know the max and min

df['words_per_tweet'] = df['text'].str.split().apply(len)
df['words_per_tweet'].max(), df['words_per_tweet'].min()

(66, 2)

In [None]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

- testing out tokenization...

In [None]:
encoded_text = tokenizer('I love Machine Learning!. Tokenization is awesome')
encoded_text


{'input_ids': [101, 1045, 2293, 3698, 4083, 999, 1012, 19204, 3989, 2003, 12476, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.vocab_size, tokenizer.model_max_length

(30522, 1000000000000000019884624838656)

In [None]:
tokenizer.convert_ids_to_tokens(encoded_text.input_ids)

['[CLS]',
 'i',
 'love',
 'machine',
 'learning',
 '!',
 '.',
 'token',
 '##ization',
 'is',
 'awesome',
 '[SEP]']

### Tokenization of entire tweets

In [None]:
tweets.reset_format()

In [None]:
# define tokenization method

def tokenize(batch):    # for every data batch (like train, test, validation), this function tokenizes each
    return tokenizer(batch['text'], padding=True, truncation=True)     # batch_size defaults to length of longest sequence in a batch

In [None]:
# testing out the tokenize function with the train_batch

tokenize(tweets['train'][:5])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102], [101, 10047, 9775, 1037, 3371, 2000, 2695, 1045, 2514, 20505, 3308, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 2412, 3110, 16839, 9080, 12863, 2055, 1996, 13788, 1045, 2097, 2113, 2008, 2009, 2003, 2145, 2006, 1996, 3200, 102, 0], [101, 1045, 2572, 3110, 24665, 7140, 11714, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

In [None]:
encoded_tweet = tweets.map(tokenize, batched=True, batch_size=None)     # batched=True allows for parallel tokenization of all the batches
encoded_tweet

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

In [None]:
import torch
from transformers import AutoModel

base_model = AutoModel.from_pretrained(checkpoint)
base_model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Li

The Embedding layer in the BaseModel takes in the tokenized `input_ids` and computes a 768-shaped `non-contextual` feature vector for each token, along with a single positional encoding showing the original positions of each token prior to embedding.

In the Transformer layer, the token embeddings and positional encodings are then passed through a 6-layered stack of encoders, each of which applies a self-attention mechanism to the embeddings, passing the output through a Feed-Forward neural network.

This multi-headed self-attention mechanism (together with the FFN) allows the model infer a bi-directional context-aware representation of each token relative to other tokens, having learned the relationships in its pretraining phase.

The ouput layer of the Feed-Forward Neural network of the last encoder layer is what yields the most-contextualised 768-shaped vector representation of the input sequence (aka, last hidden state).

So running inference on the BaseModel yields the `last_hidden_state` from the output layer of the last encoder layer (and without adjusting model weights).

Should a Decoder stack be attached to the Encoder stack, we then have a Transformer (sequence-to-sequence) model which is typically a RNN and capable of tasks such as Translation, Text generation etc.


In [None]:
# lets start with generating encodings
text = 'I love Machine Learning!. Tokenization is awesome'

inputs = tokenizer(text, return_tensors='pt')
inputs

{'input_ids': tensor([[  101,  1045,  2293,  3698,  4083,   999,  1012, 19204,  3989,  2003,
         12476,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
from transformers import AutoModel

with torch.no_grad():     # implies no weights are loaded
    outputs = base_model(**inputs)

outputs     # this reveals the BaseModel output_layer also called the last hidden state

BaseModelOutput(last_hidden_state=tensor([[[-0.0886, -0.2149, -0.1689,  ..., -0.0873,  0.2789,  0.5058],
         [ 0.4888,  0.1527, -0.1430,  ...,  0.0074,  0.5571,  0.4564],
         [ 0.6903,  0.6416,  0.4528,  ..., -0.0707,  0.4100,  0.1986],
         ...,
         [-0.0752,  0.0228,  0.1196,  ..., -0.0155,  0.0925,  0.5272],
         [ 0.3473, -0.1479,  0.1116,  ...,  0.1363, -0.0076,  0.0483],
         [ 0.9303,  0.1891, -0.5633,  ...,  0.0727, -0.6529, -0.1950]]]), hidden_states=None, attentions=None)

In [None]:
last_hidden_state = outputs.last_hidden_state
last_hidden_state     # this reveals bidirectional context-aware 768-shaped vector embeddings for each token

tensor([[[-0.0886, -0.2149, -0.1689,  ..., -0.0873,  0.2789,  0.5058],
         [ 0.4888,  0.1527, -0.1430,  ...,  0.0074,  0.5571,  0.4564],
         [ 0.6903,  0.6416,  0.4528,  ..., -0.0707,  0.4100,  0.1986],
         ...,
         [-0.0752,  0.0228,  0.1196,  ..., -0.0155,  0.0925,  0.5272],
         [ 0.3473, -0.1479,  0.1116,  ...,  0.1363, -0.0076,  0.0483],
         [ 0.9303,  0.1891, -0.5633,  ...,  0.0727, -0.6529, -0.1950]]])

#### Fine-Tuning DistilBERT on tweets (by attaching a classification head model to pretrained model's hidden_state)

- Instead of AutoModel,
 we use AutoModelForSequenceClassification model as it has a classification head on top of the pretrained model outputs and can be easily trained with the base model.

In [None]:
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")    # Use GPU if present, else use CPU

# Initialize the classication head adding the expected number of labels
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=len(classes)).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="finetuned-tweet-classifier",
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    disable_tqdm=False
)

In [None]:
from sklearn.metrics import accuracy_score, f1_score
# For every epoch, evaluation is done in which actual and predicted labels is computed

def compute_metrics(pred):
    labels = pred.label_ids     # actual labels
    preds = pred.predictions.argmax(-1)      # predicted label
    f1 = f1_score(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {"accuracy":acc, "f1":f1}

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=encoded_tweet['train'],
    eval_dataset=encoded_tweet['validation'],
    tokenizer=tokenizer
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.306791,0.911,0.910649
2,0.537200,0.214972,0.923,0.92311


TrainOutput(global_step=500, training_loss=0.5372147216796875, metrics={'train_runtime': 236.6632, 'train_samples_per_second': 135.213, 'train_steps_per_second': 2.113, 'total_flos': 720342861696000.0, 'train_loss': 0.5372147216796875, 'epoch': 2.0})

In [None]:
pred_output = trainer.predict(encoded_tweet['test'])
pred_output     #  running inference on test_set (output in logits)

PredictionOutput(predictions=array([[ 4.4836698 , -0.40733036, -1.9579751 , -0.35597053, -1.1519755 ,
        -2.067747  ],
       [ 4.6086483 , -0.573427  , -1.7071173 , -0.6739273 , -1.0726796 ,
        -1.8129954 ],
       [ 4.556236  , -0.7625482 , -1.6709337 , -0.9177241 , -0.85526234,
        -1.7037042 ],
       ...,
       [-0.69484454,  4.787004  , -0.42780933, -0.94915456, -1.6941081 ,
        -1.0789921 ],
       [-0.82307017,  4.7231073 , -0.53491503, -0.98338675, -1.4194639 ,
        -1.0220554 ],
       [-1.1427377 , -1.2669117 , -1.1247737 , -1.1663691 ,  2.265744  ,
         1.950044  ]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 4]), metrics={'test_loss': 0.21847674250602722, 'test_accuracy': 0.9175, 'test_f1': 0.9166814788533307, 'test_runtime': 4.0006, 'test_samples_per_second': 499.93, 'test_steps_per_second': 7.999})

In [None]:
pred_output.metrics     # model evaluation on test set

{'test_loss': 0.21847674250602722,
 'test_accuracy': 0.9175,
 'test_f1': 0.9166814788533307,
 'test_runtime': 4.0006,
 'test_samples_per_second': 499.93,
 'test_steps_per_second': 7.999}

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

# get predictions
y_pred = pred_output.predictions.argmax(axis=-1)

# get actual values
y_test = encoded_tweet['test']['label']

# Compute confusion matrix
# cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
# sns.heatmap(cm, annot=True, cmap='Blues', fmt='g', xticklabels=classes, yticklabels=classes)
# plt.xlabel('Predicted')
# plt.ylabel('Actual')
# plt.title('Confusion Matrix')
# plt.show()

# Show classification report
print(classes)
print(classification_report(y_true=y_test, y_pred=y_pred))

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
              precision    recall  f1-score   support

           0       0.95      0.96      0.96       581
           1       0.94      0.94      0.94       695
           2       0.78      0.81      0.79       159
           3       0.92      0.92      0.92       275
           4       0.88      0.90      0.89       224
           5       0.87      0.61      0.71        66

    accuracy                           0.92      2000
   macro avg       0.89      0.86      0.87      2000
weighted avg       0.92      0.92      0.92      2000



In [None]:
# running inference on external data

def classify(statement:str):
    encoded_statement = tokenizer(statement, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(**encoded_statement)    # yields a SequenceClassifierOutput containing logits (i.e predictions)

    logits = outputs.logits
    pred = torch.argmax(logits, dim=1).item()
    return pred, classes[pred]


In [None]:
classify('i want to kill you')

(3, 'anger')

Having added a SequenceClassification head, we've successfully fine-tuned the model with respect to a classification task. The output layer yields 6 classes

In [None]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 