<a href="https://colab.research.google.com/github/ambideXtrous9/Finetune-LLMs-using-LoRA-in-Colab-on-Custom-Datasets/blob/main/Finetune_Seq2SeqLLM_on_Custom_QA_Dataset_using_LoRA_in_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!nvidia-smi

Fri Dec 15 09:44:34 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
!pip install --quiet transformers
!pip install --quiet pytorch-lightning
!pip install --quiet peft
!pip install --quiet sentencepiece
!pip install --quiet datasets
!pip install --quiet accelerate
!pip install --quiet bitsandbytes
!pip install --quiet evaluate

In [3]:
!pip install --quiet evaluate

In [4]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from sklearn.model_selection import train_test_split
from termcolor import colored
import textwrap

In [5]:
pl.seed_everything (42)

INFO:lightning_fabric.utilities.seed:Seed set to 42


42

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
path = '/content/drive/MyDrive/MTP CODE/NewsQA_SPAN.feather'


In [8]:
df = pd.read_feather(path)
df

Unnamed: 0,question,answer,ans_pos,paragraph,answer_start,answer_end
0,Who is the managing director of Synergee Capital?,Vikram Dalal,"[133, 145]","""Investors can use a combination of governmen...",133,145
1,What is the yield of 30- and 40-year governmen...,7%,"[565, 567]","""Investors can use a combination of governmen...",565,567
2,What is the name of the ETF 2027 that a conser...,SDL,"[209, 212]","According to financial planners, an example o...",209,212
3,When would a conservative fixed income investo...,2027,"[217, 221]","According to financial planners, an example o...",217,221
4,What year would a conservative fixed income in...,2040,"[260, 264]","According to financial planners, an example o...",260,264
...,...,...,...,...,...,...
481753,When does Uncle Sam reopen for fully vaccinate...,November 8,"[295, 305]",NEW DELHI: This could be the last expansion of...,295,305
481754,When will there be three more weekly flights b...,from second week of November,"[116, 144]",It currently has 23 weekly flights to America....,116,144
481755,What type of 777s would have helped AI have mo...,Boeing,"[306, 312]",It currently has 23 weekly flights to America....,306,312
481756,What was the first wave of AI nonstops?,second,"[11, 17]","Before the second wave this summer, AI had abo...",11,17


In [9]:
df = df.iloc[:5000]

In [10]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorWithPadding

In [11]:
MODEL_NAME = "google/flan-t5-xl"

In [12]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [13]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME,
                                             load_in_8bit=True,
                                             device_map='auto')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [14]:
train_df, val_df = train_test_split(df,test_size=0.1)

In [15]:
class NQADataset(Dataset):
  def __init__(self,data : pd.DataFrame,tokenizer ,source_max_token_len : int = 400,target_max_token_len : int = 32):

    self.tokenizer = tokenizer
    self.data = data
    self.source_max_token_len = source_max_token_len
    self.target_max_token_len = target_max_token_len

  def __len__(self):
    return len(self.data)

  def __getitem__(self,index : int):
    data_row = self.data.iloc[index]

    source_encoding = tokenizer(
        data_row['question'],
        data_row['paragraph'],
        max_length = self.source_max_token_len,
        padding = "max_length",
        truncation = "only_second",
        return_attention_mask = True,
        add_special_tokens = True,
        return_tensors = "pt")

    target_encoding = tokenizer(
        data_row['answer'],
        max_length = self.target_max_token_len,
        padding = "max_length",
        truncation = True,
        return_attention_mask = True,
        add_special_tokens = True,
        return_tensors = "pt")

    labels = target_encoding["input_ids"]
    labels[labels == 0] = -100

    return dict(
        input_ids = source_encoding['input_ids'].flatten(),
        attention_mask = source_encoding['attention_mask'].flatten(),
        labels = labels.flatten())

In [16]:
class NQADataModule(pl.LightningDataModule):
  def __init__(self,train_df : pd.DataFrame,test_df : pd.DataFrame,tokenizer ,batch_size : int = 8,source_max_token_len : int = 400,target_max_token_len : int = 32):
    super().__init__()
    self.batch_size = batch_size
    self.train_df = train_df
    self.test_df = test_df
    self.tokenizer = tokenizer
    self.source_max_token_len = source_max_token_len
    self.target_max_token_len = target_max_token_len

  def setup(self,stage=None):
    self.train_dataset = NQADataset(self.train_df,self.tokenizer,self.source_max_token_len,self.target_max_token_len)
    self.test_dataset = NQADataset(self.test_df,self.tokenizer,self.source_max_token_len,self.target_max_token_len)

  def train_dataloader(self):
    return DataLoader(self.train_dataset,batch_size = self.batch_size,shuffle=True,num_workers=4)

  def val_dataloader(self):
    return DataLoader(self.test_dataset,batch_size = self.batch_size,num_workers=4)

  def test_dataloader(self):
    return DataLoader(self.test_dataset,batch_size = self.batch_size,num_workers=4)

In [17]:
from peft import LoraConfig, get_peft_model

In [18]:
config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    target_modules=["q", "k", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM" # set this for CLM or Seq2Seq
)

In [19]:
model = get_peft_model(model, config)

In [20]:
model.print_trainable_parameters()

trainable params: 14,155,776 || all params: 2,863,912,960 || trainable%: 0.49428094351023855


In [21]:
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

0

In [22]:
BATCH_SIZE = 2
N_EPOCHS = 2

In [23]:
data_module = NQADataModule(train_df,val_df,tokenizer,batch_size = BATCH_SIZE)
data_module.setup()

In [24]:
import transformers
from torch import nn
import torch
from transformers import Trainer

In [25]:
# class WeightedLossTrainer (Trainer):
#   def compute_loss(self, model, inputs, return_outputs=False):
#     # Feed inputs to model and extract logits
#     outputs = model (**inputs)
#     Logits = outputs.get("logits")
#     # Extract labels
#     Labels = inputs.get ("labels")
#     # Define loss function with class weights
#     loss_func = nn.CrossEntropyLoss()
#     # Compute loss
#     loss = loss_func(Logits, Labels)
#     return (loss, outputs) if return_outputs else loss

In [26]:
trainer = transformers.Seq2SeqTrainer(
    model=model,
    train_dataset = data_module.train_dataset,
    eval_dataset = data_module.test_dataset,
    args=transformers.Seq2SeqTrainingArguments(
        per_device_train_batch_size = 2,
        per_device_eval_batch_size = 2,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        evaluation_strategy = 'steps',
        max_steps=200,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    # data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)

)

In [27]:
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

Step,Training Loss,Validation Loss
1,8.1792,1.9581
2,7.8281,1.9581
3,8.0489,1.9581
4,8.1238,1.9581


KeyboardInterrupt: ignored

In [28]:
# model.push_to_hub("samwit/bloom-7b1-lora-tagger",
#                   use_auth_token=True,
#                   commit_message="basic training",
#                   private=True)

In [29]:
sample_question = val_df.iloc[1]

In [30]:
print(sample_question['question'])
print(sample_question['paragraph'])
print(sample_question['answer'])


When is the first developmental flight of the SSLV scheduled for?
 After the Gisat-1 launch, the other satellite to go up will be EOS-4 or Risat-1A, which is a radar imaging satellite with synthetic aperture radar (SAR) that can take pictures during day and night and can also see through clouds. The satellite weighing over 1,800 kg will be launched by a PSLV in September. The satellite will play a strategic role in the country's defence with its capability to operate in day, night and all weather conditions. The first developmental flight of the Small Satellite Launch Vehicle (SSLV) or mini-PSLV is also scheduled for the fourth quarter of this year from Sriharikota.
the fourth quarter of this year


In [31]:
sentence = sample_question['question'] + " " + sample_question['paragraph']

In [32]:
print(sentence)

When is the first developmental flight of the SSLV scheduled for?  After the Gisat-1 launch, the other satellite to go up will be EOS-4 or Risat-1A, which is a radar imaging satellite with synthetic aperture radar (SAR) that can take pictures during day and night and can also see through clouds. The satellite weighing over 1,800 kg will be launched by a PSLV in September. The satellite will play a strategic role in the country's defence with its capability to operate in day, night and all weather conditions. The first developmental flight of the Small Satellite Launch Vehicle (SSLV) or mini-PSLV is also scheduled for the fourth quarter of this year from Sriharikota.


In [33]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [34]:
batch = tokenizer(sentence, max_length = 400,
        padding = "max_length",
        truncation = "only_second",
        return_attention_mask = True,
        add_special_tokens = True,
        return_tensors = "pt").to(device)

In [35]:
with torch.cuda.amp.autocast():
  output_tokens = model.generate(**batch,
                                num_beams = 1,
                                max_length = 30,
                                repetition_penalty = 2.5,
                                length_penalty = 1.0,
                                early_stopping = True,
                                use_cache = True
                               )

print('\nANSWER : ', tokenizer.decode(output_tokens[0], skip_special_tokens=True))




ANSWER :  The first flight of the mini-PSLV is scheduled for the fourth quarter of this year from Sriharikot.
