## Custom GPT

In this notebook we shall learn the steps to fine tune a base LLM on a custom dataset for a specified task

[Here we shall be fine tuning GPT2 model for text classification task]

There are 3 broad steps we need to look into

1. Preparing & loading the dataset

2. loading the model & tokenizer

3. Training pipeline

### Data Preparation

In [None]:
#we shall be using an emotion classification dataset from Huggingface
#we shall make use of datasets library of huggingface
!pip install datasets

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [None]:
from datasets import load_dataset
emotion_data = load_dataset("dair-ai/emotion")

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
#data cardinality
print(len(emotion_data["train"]), len(emotion_data["validation"]), len(emotion_data["test"]))
print(emotion_data["train"].features)
print(emotion_data["train"][10])

16000 2000 2000
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}
{'text': 'i feel like i have to make the suffering i m seeing mean something', 'label': 0}


In [None]:
#we shall subsample this dataset for our training purposes
emotion_sample = load_dataset("dair-ai/emotion", split='train[:1000]+validation[:250]')
#we need to further split it using train_test_split
print(emotion_sample)

Dataset({
    features: ['text', 'label'],
    num_rows: 1250
})


In [None]:
data = emotion_sample.train_test_split(test_size=0.2)
#emotion_train = data["train"]
#emotion_val = data["test"]

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 250
    })
})

### Loading the model & tokenizer

In [None]:
!pip install transformers evaluate accelerate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, accelerate, evaluate
Successfully installed accelerate-0.25.0 evaluate-0.4.1 responses-0.18.0


In [None]:
#from huggingface_hub import notebook_login
#notebook_login()

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
def preprocess_function(examples):
  return tokenizer(examples["text"], truncation=True, padding=True)

In [None]:
#tokenized_train_data = emotion_train.map(preprocess_function, batched=True)
#tokenized_val_data = emotion_val.map(preprocess_function, batched=True)
tokenized_data = data.map(preprocess_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [None]:
#tokenized_train_data[0]
tokenized_data["train"][0]

In [None]:
#mapping of id2label and label2id [useful for evaluation and correspondence]
id2label = {0:"sadness", 1:"joy", 2:"love", 3:"anger", 4:"fear", 5:"surprise"}
label2id = {"sadness": 0, "joy":1, "love":2, "anger":3, "fear":4, "surprise":5}

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=6, id2label=id2label, label2id=label2id)
model = model.to("cuda")

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Creating the Data & Training pipeline

https://huggingface.co/docs/transformers/tasks/sequence_classification

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
import numpy as np
np.unique(data["train"]['label'], return_counts=True)

(array([0, 1, 2, 3, 4, 5]), array([265, 344,  95, 143, 108,  45]))

In [None]:
import evaluate

#since we are dealing with multi class imbalanced setting, accuracy itself would not be a suitable metric
f1_score = evaluate.load("f1")

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [None]:
# refer here: https://github.com/huggingface/evaluate/blob/main/metrics/f1/f1.py
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return f1_score.compute(predictions = predictions, references = labels, average='weighted')

#### Training pipeline

It comprises of 3 steps only:

1. Defining the training hyperparameters [the only required parameter is the output_dir, where the model would be saved]

2. Passing the training arguments to trainer along with the model, dataset, tokenizer, data collator and compute metrics function

3. Calling train() to fine tune our model

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
model.config.pad_token_id = model.config.eos_token_id
print(model.config)

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "sadness",
    "1": "joy",
    "2": "love",
    "3": "anger",
    "4": "fear",
    "5": "surprise"
  },
  "initializer_range": 0.02,
  "label2id": {
    "anger": 3,
    "fear": 4,
    "joy": 1,
    "love": 2,
    "sadness": 0,
    "surprise": 5
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 50256,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    

In [None]:
training_args = TrainingArguments(
    output_dir = "my_test_model",
    learning_rate = 2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=8,
    num_train_epochs=20,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    push_to_hub=False
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

Note: Padding tokens were not used during the pre-training of GPT and GPT-2, therefore they have none

In [None]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,1.9751,1.675958,0.214781
2,1.524,1.544322,0.295445
3,1.3519,1.385103,0.37406
4,1.0889,1.205356,0.517473
5,0.8574,1.054951,0.593887
6,0.6463,0.955325,0.639255
7,0.4709,0.904568,0.662469
8,0.3575,0.941198,0.673147
9,0.2602,0.918183,0.702254
10,0.1871,0.929034,0.707521


TrainOutput(global_step=640, training_loss=0.4718184880912304, metrics={'train_runtime': 490.052, 'train_samples_per_second': 40.812, 'train_steps_per_second': 1.306, 'total_flos': 683887288320000.0, 'train_loss': 0.4718184880912304, 'epoch': 20.0})

### Inferencing

In [None]:
from transformers import pipeline

cls = pipeline(task="text-classification", model="my_test_model/checkpoint-640")

In [None]:
cls("Hi! I am feeling nostalgic")

[{'label': 'love', 'score': 0.9313643574714661}]

In [None]:
text = data["test"][20]['text']
print("text: ",text)
cls(text)

text:  i am feeling miserable but c i am also the proudest mum on earth


[{'label': 'sadness', 'score': 0.99224853515625}]