[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/UN-GCPDS/Transformers-Attention-Mechanisms-VQA-MRC/blob/main/Language%20Models/transformers-fine-tuning.ipynb)

# Fine tuning of Transformer models for different NLP tasks

[![Open In Github](https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/UN-GCPDS/Transformers-Attention-Mechanisms-VQA-MRC)

In [1]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate -U

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━

## Supervised Fine Tuning (SFT)

### Task repurposing


#### Full Fine Tuning

To illustrate supervised fine tuning with a pre-trained transformer we will fine-tune BERT using a spam detection dataset. Notice that BERT was trained for a masked language modeling task, meaning we will change the task it was first trained for.

In [3]:
# First, we load the dataset using hugging face Dataset library.

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("Deysi/spam-detection-dataset") # Load the dataset in its raw form
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # Load the tokenizer from the model (in our case, BERT)


def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)


def preprocessing(example):

  if example["label"] == "not_spam":
    example["label"] = 0
  else:
    example["label"] = 1

  return example


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) # Tokenize the entire dataset using the map function
tokenized_datasets = tokenized_datasets.map(preprocessing) # Apply the preprocessing to the dataset
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading readme:   0%|          | 0.00/581 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/Deysi___parquet/Deysi--spam-detection-dataset-393b2a235e6c9981/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/663k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/8175 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2725 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Deysi___parquet/Deysi--spam-detection-dataset-393b2a235e6c9981/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/8175 [00:00<?, ? examples/s]

Map:   0%|          | 0/2725 [00:00<?, ? examples/s]

Map:   0%|          | 0/8175 [00:00<?, ? examples/s]

Map:   0%|          | 0/2725 [00:00<?, ? examples/s]

##### Dataset inspection

In [4]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8175
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2725
    })
})

In [5]:
import pandas as pd

data = pd.DataFrame(data = {"Text": tokenized_datasets["train"]["text"], "Label": tokenized_datasets["train"]["label"]},)
data

Unnamed: 0,Text,Label
0,hey I am looking for Xray baggage datasets can...,0
1,"""Get rich quick! Make millions in just days wi...",1
2,URGENT MESSAGE: YOU WON'T BELIEVE WHAT WE HAVE...,1
3,[Google AI Blog: Contributing Data to Deepfake...,0
4,Trying to see if anyone already has timestamps...,0
...,...,...
8170,"Hi all,\n\nWe create datasets by taking pictur...",0
8171,DEALS! DEALS! DEALS!\n\nHey peeps! You won't b...,1
8172,Hi\n\nI am working on a project and need penal...,0
8173,Do you want to BLOW UP your social media follo...,1


##### Train

Now we load the model using the AutoModel class from Huggingface, which loads the adapted model for the downstream task we defined.

In [6]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As you see, we get a warning saying that some weights of the model were not initialized and that some new weights were added. This is exactly what we expected, since we are doing task repurposing.

In [8]:
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

Now, we can train the model on our dataset using Huggingface Trainer API:

In [9]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [10]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.0305
1000,0.0115
1500,0.0121
2000,0.0056
2500,0.0
3000,0.0


TrainOutput(global_step=3066, training_loss=0.009733732196116463, metrics={'train_runtime': 999.5497, 'train_samples_per_second': 24.536, 'train_steps_per_second': 3.067, 'total_flos': 2562485845918800.0, 'train_loss': 0.009733732196116463, 'epoch': 3.0})

##### Test

In [12]:
import numpy as np

predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

preds = np.argmax(predictions.predictions, axis=-1)

(2725, 2) (2725,)


In [15]:
import evaluate

metric = evaluate.load("accuracy")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.9988990825688073}

In [19]:
from sklearn.metrics import classification_report

y_test = np.array(tokenized_datasets["test"]["label"])
y_pred = preds

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1350
           1       1.00      1.00      1.00      1375

    accuracy                           1.00      2725
   macro avg       1.00      1.00      1.00      2725
weighted avg       1.00      1.00      1.00      2725



#### Head Tuning

It is also possible to freeze the BERT model and train only the added classification layer.

In [22]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)


for param in model.bert.parameters(): # Freeze all the layers of BERT
    param.requires_grad = False

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, we can use the Trainer class as before:

##### Train

In [23]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [24]:
trainer.train()



Step,Training Loss
500,0.6314
1000,0.5325
1500,0.478
2000,0.4396
2500,0.4109
3000,0.4045


TrainOutput(global_step=3066, training_loss=0.4813930569951704, metrics={'train_runtime': 354.7648, 'train_samples_per_second': 69.13, 'train_steps_per_second': 8.642, 'total_flos': 2562485845918800.0, 'train_loss': 0.4813930569951704, 'epoch': 3.0})

As you see, this time the training is faster, as only one layer is being tuned.

##### Test

In [25]:
predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

preds = np.argmax(predictions.predictions, axis=-1)

(2725, 2) (2725,)


In [26]:
import evaluate

metric = evaluate.load("accuracy")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.913394495412844}

In [27]:
from sklearn.metrics import classification_report

y_test = np.array(tokenized_datasets["test"]["label"])
y_pred = preds

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.97      0.92      1350
           1       0.96      0.86      0.91      1375

    accuracy                           0.91      2725
   macro avg       0.92      0.91      0.91      2725
weighted avg       0.92      0.91      0.91      2725



As expected, there was also a drop in the model performance.

## Unsupervised Fine Tuning

It is possible to take a pre-trained generative model, such as GPT-2 and further fine-tune it on a custom dataset. For it to, for example learn a specific way of writing.

Let's take as example DistilGPT-2, which is a lighter version of GPT-2.

In [63]:
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling


tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

tokenizer.pad_token = tokenizer.eos_token

In [55]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-5): 6 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

We will fine-tune this model on a dataset with python code.

In [15]:
from datasets import load_dataset

dataset = load_dataset("dipesh/python-code-ds-mini")

Downloading readme:   0%|          | 0.00/489 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/dipesh___parquet/dipesh--python-code-ds-mini-a9c45b807e3d4bd9/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/80.7k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/2521 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/281 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/dipesh___parquet/dipesh--python-code-ds-mini-a9c45b807e3d4bd9/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

#### Dataset inspection

In [21]:
print(dataset["train"]["code"][1])

# Getting all CSV files from a directory using Python 
 # importing the required modules
import glob
import pandas as pd
  
# specifying the path to csv files
path = "csvfoldergfg"
  
# csv files in the path
files = glob.glob(path + "/*.csv")
  
# defining an empty list to store 
# content
data_frame = pd.DataFrame()
content = []
  
# checking all the csv files in the 
# specified path
for filename in files:
    
    # reading content of csv file
    # content.append(filename)
    df = pd.read_csv(filename, index_col=None)
    content.append(df)
  
# converting content to data frame
data_frame = pd.concat(content)
print(data_frame)


Preprocess and tokenize the dataset:

In [64]:
context_lenght=64

def tokenize(element):
    outputs = tokenizer(
        element["code"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets = dataset.map(
    tokenize, batched=True, remove_columns=dataset["train"].column_names
)

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

tokenized_datasets



Map:   0%|          | 0/281 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids'],
        num_rows: 30659
    })
    validation: Dataset({
        features: ['input_ids'],
        num_rows: 3580
    })
})

In [65]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [66]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

In [67]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,3.5955
1000,3.0578
1500,2.9036
2000,2.7857
2500,2.7027
3000,2.5657
3500,2.5424
4000,2.474
4500,2.351
5000,2.3264


TrainOutput(global_step=11499, training_loss=2.4119578566444635, metrics={'train_runtime': 826.527, 'train_samples_per_second': 111.281, 'train_steps_per_second': 13.912, 'total_flos': 375520175456256.0, 'train_loss': 2.4119578566444635, 'epoch': 3.0})

#### Text generation example:

In [126]:
from transformers import pipeline

generator = pipeline("text-generation", model=model, tokenizer = tokenizer, device="cuda")

gens = generator("plt.", max_length=10, num_return_sequences=10, temperature=1.0)

for gen in gens:
  print(gen["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


plt.figurefig(figsize=(10
plt.figure(figsize=(20,
plt.figure([9, 10, 13
plt.ylabel("Sorted order:
plt.to_hex('2080 bytes
plt.title('Company Data:', '
plt.fromarray("Python Exercises
plt.to_datetime('2020-
plt.figure(figsize=(20,
plt.ylabel('VI', 2)


## Reinforcement learning from human feedback (RLHF)

https://huggingface.co/docs/trl/main/en/index