# **T5 Model Fine-Tuning Using Hugging Face**

**Install Libraries**

In [2]:
!pip install pytorch-lightning

Collecting pytorch-lightning
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics>=0.7.0 (from pytorch-lightning)
  Downloading torchmetrics-1.4.1-py3-none-any.whl.metadata (20 kB)
Collecting lightning-utilities>=0.10.0 (from pytorch-lightning)
  Downloading lightning_utilities-0.11.6-py3-none-any.whl.metadata (5.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=2.1.0->pytorch-lightning)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=2.1.0->pytorch-lightning)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=2.1.0->pytorch-lightning)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=2.1.0->pytorch-lightning)
  Using cached nv

**Import Libraries**

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration, AdamW
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
pl.seed_everything(100)
import warnings
warnings.filterwarnings("ignore")

INFO:lightning_fabric.utilities.seed:Seed set to 100


**Load Dataset**

In [4]:
df = pd.read_csv("SQuAD_csv.csv")
df.columns

Index(['Unnamed: 0', 'context', 'question', 'id', 'answer_start', 'text'], dtype='object')

In [5]:
df = df[['context','question', 'text']]
print("Number of records: ", df.shape[0])

Number of records:  77097


In [6]:
df["context"] = df["context"].str.lower()
df["question"] = df["question"].str.lower()
df["text"] = df["text"].str.lower()

df.head()

Unnamed: 0,context,question,text
0,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,when did beyonce start becoming popular?,in the late 1990s
1,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,what areas did beyonce compete in when she was...,singing and dancing
2,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,when did beyonce leave destiny's child and bec...,2003
3,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,in what city and state did beyonce grow up?,"houston, texas"
4,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,in which decade did beyonce become famous?,late 1990s


**Initialize Parameters**

In [7]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
INPUT_MAX_LEN = 512 # Input length
OUT_MAX_LEN = 128 # Output Length
TRAIN_BATCH_SIZE = 8 # Training Batch Size
VALID_BATCH_SIZE = 2 # Validation Batch Size
EPOCHS = 5 # Number of Iteration

**Define T5 Tokenizer**

The T5 model is based on the Transformer architecture, a neural network designed to handle sequential input data effectively. It comprises an encoder and a decoder, which include a sequence of interconnected “layers.”

T5Tokenizer is used to turn a text into a list of tokens, each representing a single word or punctuation mark. The tokenizer additionally inserts unique tokens into the input text to denote the text’s start and end and distinguish various phrases.

In [8]:
MODEL_NAME = "t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME, model_max_length= INPUT_MAX_LEN)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
print("eos_token: {} and id: {}".format(tokenizer.eos_token,
                   tokenizer.eos_token_id)) # End of token (eos_token)
print("unk_token: {} and id: {}".format(tokenizer.unk_token,
                   tokenizer.eos_token_id)) # Unknown token (unk_token)
print("pad_token: {} and id: {}".format(tokenizer.pad_token,
                 tokenizer.eos_token_id)) # Pad token (pad_token)

eos_token: </s> and id: 1
unk_token: <unk> and id: 1
pad_token: <pad> and id: 1


**Dataset Preparation**

In [10]:
class T5Dataset:

    def __init__(self, context, question, target):
        self.context = context
        self.question = question
        self.target = target
        self.tokenizer = tokenizer
        self.input_max_len = INPUT_MAX_LEN
        self.out_max_len = OUT_MAX_LEN

    def __len__(self):
        return len(self.context)

    def __getitem__(self, item):
        context = str(self.context[item])
        context = " ".join(context.split())

        question = str(self.question[item])
        question = " ".join(question.split())

        target = str(self.target[item])
        target = " ".join(target.split())


        inputs_encoding = self.tokenizer(
            context,
            question,
            add_special_tokens=True,
            max_length=self.input_max_len,
            padding = 'max_length',
            truncation='only_first',
            return_attention_mask=True,
            return_tensors="pt"
        )


        output_encoding = self.tokenizer(
            target,
            None,
            add_special_tokens=True,
            max_length=self.out_max_len,
            padding = 'max_length',
            truncation= True,
            return_attention_mask=True,
            return_tensors="pt"
        )


        inputs_ids = inputs_encoding["input_ids"].flatten()
        attention_mask = inputs_encoding["attention_mask"].flatten()
        labels = output_encoding["input_ids"]

        labels[labels == 0] = -100  # As per T5 Documentation

        labels = labels.flatten()

        out = {
            "context": context,
            "question": question,
            "answer": target,
            "inputs_ids": inputs_ids,
            "attention_mask": attention_mask,
            "targets": labels
        }


        return out

**DataLoader**

In [11]:
class T5DatasetModule(pl.LightningDataModule):

    def __init__(self, df_train, df_valid):
        super().__init__()
        self.df_train = df_train
        self.df_valid = df_valid
        self.tokenizer = tokenizer
        self.input_max_len = INPUT_MAX_LEN
        self.out_max_len = OUT_MAX_LEN


    def setup(self, stage=None):

        self.train_dataset = T5Dataset(
        context=self.df_train.context.values,
        question=self.df_train.question.values,
        target=self.df_train.text.values
        )

        self.valid_dataset = T5Dataset(
        context=self.df_valid.context.values,
        question=self.df_valid.question.values,
        target=self.df_valid.text.values
        )

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
         self.train_dataset,
         batch_size= TRAIN_BATCH_SIZE,
         shuffle=True,
         num_workers=4
        )


    def val_dataloader(self):
        return torch.utils.data.DataLoader(
         self.valid_dataset,
         batch_size= VALID_BATCH_SIZE,
         num_workers=1
        )

**Build T5 Model**

In [12]:
class T5Model(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)

    def forward(self, input_ids, attention_mask, labels=None):

        output = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        return output.loss, output.logits


    def training_step(self, batch, batch_idx):

        input_ids = batch["inputs_ids"]
        attention_mask = batch["attention_mask"]
        labels= batch["targets"]
        loss, outputs = self(input_ids, attention_mask, labels)


        self.log("train_loss", loss, prog_bar=True, logger=True)

        return loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch["inputs_ids"]
        attention_mask = batch["attention_mask"]
        labels= batch["targets"]
        loss, outputs = self(input_ids, attention_mask, labels)

        self.log("val_loss", loss, prog_bar=True, logger=True)

        return loss


    def configure_optimizers(self):
        return AdamW(self.parameters(), lr=0.0001)

**Train Model**

In [13]:
def run():

    df_train, df_valid = train_test_split(
        df[0:10000], test_size=0.2, random_state=101
    )

    df_train = df_train.fillna("none")
    df_valid = df_valid.fillna("none")

    df_train['context'] = df_train['context'].apply(lambda x: " ".join(x.split()))
    df_valid['context'] = df_valid['context'].apply(lambda x: " ".join(x.split()))

    df_train['text'] = df_train['text'].apply(lambda x: " ".join(x.split()))
    df_valid['text'] = df_valid['text'].apply(lambda x: " ".join(x.split()))

    df_train['question'] = df_train['question'].apply(lambda x: " ".join(x.split()))
    df_valid['question'] = df_valid['question'].apply(lambda x: " ".join(x.split()))


    df_train = df_train.reset_index(drop=True)
    df_valid = df_valid.reset_index(drop=True)

    dataModule = T5DatasetModule(df_train, df_valid)
    dataModule.setup()

    device = DEVICE
    models = T5Model()
    models.to(device)

    checkpoint_callback  = ModelCheckpoint(
        dirpath="/kaggle/working",
        filename="best_checkpoint",
        save_top_k=2,
        verbose=True,
        monitor="val_loss",
        mode="min"
    )

    trainer = pl.Trainer(
        callbacks = checkpoint_callback,
        max_epochs= EPOCHS,
        devices=1,
        accelerator="gpu"
    )

    trainer.fit(models, dataModule)

run()

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type                       | Params | Mode
------------------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M  | eval
------------------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)
0         Modules in train mode
541       Modules in eval mode


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 0, global step 1000: 'val_loss' reached 0.20473 (best 0.20473), saving model to '/kaggle/working/best_checkpoint.ckpt' as top 2


Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 1, global step 2000: 'val_loss' reached 0.20750 (best 0.20473), saving model to '/kaggle/working/best_checkpoint-v1.ckpt' as top 2


Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 2, global step 3000: 'val_loss' was not in top 2


Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 3, global step 4000: 'val_loss' was not in top 2


Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:Epoch 4, global step 5000: 'val_loss' was not in top 2
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


**Test Model**

In [14]:
train_model = T5Model.load_from_checkpoint("/kaggle/working/best_checkpoint-v1.ckpt")

train_model.freeze()

def generate_question(context, question):

    inputs_encoding =  tokenizer(
        context,
        question,
        add_special_tokens=True,
        max_length= INPUT_MAX_LEN,
        padding = 'max_length',
        truncation='only_first',
        return_attention_mask=True,
        return_tensors="pt"
        )


    generate_ids = train_model.model.generate(
        input_ids = inputs_encoding["input_ids"],
        attention_mask = inputs_encoding["attention_mask"],
        max_length = INPUT_MAX_LEN,
        num_beams = 4,
        num_return_sequences = 1,
        no_repeat_ngram_size=2,
        early_stopping=True,
        )

    preds = [
        tokenizer.decode(gen_id,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True)
        for gen_id in generate_ids
    ]

    return "".join(preds)

In [None]:
context = "Classification is used when your target is categorical,\
 while regression is used when your target variable\
is continuous. Both classification and regression belong to the category \
of supervised machine learning algorithms."

que = "When is classification used?"

print(generate_question(context, que))