# Fine tuning transformers models on a custom dataset in a down-stream classification task

Today, we will return to the dataset that we've used on day 1 of our course: The ImDB data. Go back to the code you've written, and inspect the `recall`, `precision`, and `f1-scores`. 

In this notebook, we will try to improve the performance of our classifier by using `transfer learning`. In this notebook, we will use a `BERT` model, but feel free to check out the HuggingFace liberary whether there are alternatives that you might want to use. 


If your system does not run on GPU's, it is adviced to run this Notebook in Colab. 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/annekroon/gesis-machine-learning/blob/main/fall-2022/day5/exercise/transformers-custom-dataset.ipynb)

In [3]:
!pip3 install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.0-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 9.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 64.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 43.6 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.22.0


### Install packages

In [4]:
from collections import defaultdict
import gzip
import json
import random
import pickle

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.metrics import f1_score
from pathlib import Path
from sklearn.model_selection import train_test_split

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import ticker
sns.set(style='ticks', font_scale=1.2)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

### Define constants

In [5]:

MODEL = 'distilbert-base-cased'  #Insert here the name of the model that you want to work with. You can inspect different models at huggingface: https://huggingface.co/models
DEVICE = 'cuda'       
MAX_LENGTH = 512   # This is the maximum token length                                                  
CACHED_DIR = 'my-awesome-model'  # directory that we'll use for saving the model 

In [6]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2022-09-16 19:38:34--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2022-09-16 19:38:35 (71.8 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



### Read IMBD data

In [7]:

def read_imdb_split(split_dir):
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ["pos", "neg"]:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)
    return texts, labels

Create train and test samples

In [8]:
train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

Split train samples in train and validation samples

In [9]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

### Run a simply traditional classifier

In [10]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

In [11]:
model = LogisticRegression(max_iter=1000).fit(X_train, train_labels)
predictions = model.predict(X_test)

In [12]:
print(classification_report(test_labels, predictions))

              precision    recall  f1-score   support

           0       0.88      0.88      0.88     12500
           1       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



### Let's start with our transformer-based approach

First, we need to tokenize the data using a tokenizer provided by HuggingFace. In particular, you need a tokenizer that belongs to the particular language model you will be using.

In [13]:
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL) 

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

tokenize the train/ val and test datasets, apply pedding and truncation. 

In [14]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=MAX_LENGTH)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=MAX_LENGTH)
test_encodings  = tokenizer(test_texts, truncation=True, padding=True, max_length=MAX_LENGTH)

### Use the `PyTorch` Dataset class to transform the data 

In [15]:

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

### Inspect the results of the tokenization proces

In [16]:
' '.join(train_encodings[0].tokens[0:100])

"[CLS] I used to write comments at I ##MD ##b , but I don ' t do so anymore . It happens that I ##MD ##b has become massive , and consequently subjective ##ness has ruined scores . What do I mean ? That anyone that is not particularly fond of movies and doesn ' t have any expertise on the subject , watches some crap ( or the opposite ) , and in case he likes it , delivers a 10 , and if he doesn ' t he goes for a 1 . This of course , cannot"

In [17]:
' '.join(test_encodings[0].tokens[0:100])

'[CLS] I would have to say that in general Barbie Movies have impressed me . I have a 5 year old Barbie fan ##atic niece and she watches them all the time so needles ##s to say I have seen quite a lot of Barbie these holidays , but I am not sick of them . < br / > < br / > This film , visually , has a lot to offer , especially the backgrounds , and the animation of the characters has improved with each new movie . One thing I noticed in particular was a'

In [18]:
' '.join(train_dataset.encodings[0].tokens[0:100])
' '.join(test_dataset.encodings[1].tokens[0:100])

"[CLS] This movie rocked ! ! ! ! saw it at a screen ##er a coup ##la weeks ago . Kind ##a a strange story , where James Franco plays this jerk who marries Sienna Miller just to get out of the country and they go to Niagara Falls for their honeymoon . Don ' t wanna give it away cu ##z the movie isn ' t released yet but its totally cool and you would never expect the stuff that happens . I kinda thought I would hate it cu ##z its a romance but its also kinda twisted"

### You can custimize the evaluation metrics that the model will provide

In [19]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average='weighted')
    return {
      'accuracy': acc, 
      'f1' : f1,
  }

In [20]:
# Initialize a ForSequenceClassification model
model = DistilBertForSequenceClassification.from_pretrained(MODEL, num_labels=2).to(DEVICE)

Downloading:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.wei

### If needed, tweak the `Trainer` class parameter settings, and start training

In [21]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)


trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 20000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2500


Step,Training Loss
10,0.696
20,0.7042
30,0.6904
40,0.6844
50,0.6836
60,0.6974
70,0.6917
80,0.6793
90,0.6861
100,0.6811


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json
Model weights saved in ./results/checkpoint-2500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2500, training_loss=0.35843663766384126, metrics={'train_runtime': 1004.2149, 'train_samples_per_second': 19.916, 'train_steps_per_second': 2.49, 'total_flos': 2649347973120000.0, 'train_loss': 0.35843663766384126, 'epoch': 1.0})

### Evaluate the model on the validation data

In [22]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 5000
  Batch size = 32


{'eval_loss': 0.2610227167606354,
 'eval_accuracy': 0.9158,
 'eval_f1': 0.9157924831629632,
 'eval_runtime': 86.5045,
 'eval_samples_per_second': 57.8,
 'eval_steps_per_second': 1.815,
 'epoch': 1.0}

In [23]:
predicted_validation = trainer.predict(val_dataset)

***** Running Prediction *****
  Num examples = 5000
  Batch size = 32


In [24]:
predicted_val_labels = predicted_validation.predictions.argmax(-1) # Get the highest probability prediction
predicted_val_labels = predicted_val_labels.flatten().tolist()      # Flatten the predictions into a 1D list

In [25]:
print(classification_report(val_labels, predicted_val_labels))

              precision    recall  f1-score   support

           0       0.91      0.92      0.92      2537
           1       0.92      0.91      0.91      2463

    accuracy                           0.92      5000
   macro avg       0.92      0.92      0.92      5000
weighted avg       0.92      0.92      0.92      5000



### Evaluation on the test data

In [26]:
predicted_test = trainer.predict(test_dataset)
predicted_test_labels = predicted_test.predictions.argmax(-1) # Get the highest probability prediction

***** Running Prediction *****
  Num examples = 25000
  Batch size = 32


In [27]:
predicted_test_labels = predicted_test_labels.flatten().tolist()     

In [28]:
print(classification_report(test_labels, predicted_test_labels))

              precision    recall  f1-score   support

           0       0.91      0.92      0.92     12500
           1       0.92      0.91      0.91     12500

    accuracy                           0.92     25000
   macro avg       0.92      0.92      0.92     25000
weighted avg       0.92      0.92      0.92     25000

