<div style="padding:20px; 
            color:#150d0a;
            margin:10px;
            font-size:220%;
            text-align:center;
            display:fill;
            border-radius:20px;
            border-width: 5px;
            border-style: solid;
            border-color: #150d0a;
            background-color: pink;
            overflow:hidden;
            font-weight:500">Text2Fillups : A T5 approach</div>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:auto;
           font-family:Verdana;">

<p style="padding: 10px; color:white;">In this notebook, I will be using QA2D dataset to fine tune T5 transformer model to convert question-answer pairs to generate normal declarative sentences as a part of Text2Fillups project.
</p>
</div>

<p style="color:red; font-weight:600; font-size:35px;">Installing required libraries 📝</p>

In [1]:
!pip install -q datasets transformers nltk wandb

[0m

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:auto;
           font-family:Verdana;">

<p style="padding: 10px; color:white;">I will be using wandb (Weights and biases) to log the model metrics, configuration files etc. It helps to make the fine tuning process faster and fun!
</p>
</div>

In [2]:
import wandb
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

wandb_api = user_secrets.get_secret("wandb_api") 

wandb.login(key=wandb_api)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [3]:
import transformers
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
import re
from datasets import load_dataset, load_metric
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import T5ForConditionalGeneration, T5Tokenizer, AdamW, get_linear_schedule_with_warmup
from torch.cuda.amp import autocast, GradScaler
from nltk.translate.bleu_score import corpus_bleu
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import WandbLogger
pl.seed_everything(100)
import warnings
warnings.filterwarnings("ignore")

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


<p style="color:red; font-weight:600; font-size:35px;">Initialising WandB logger🪵</p>

In [4]:
wandb_logger = WandbLogger(project="Text2Questions", name="Pytorch-Lightning", log_model="all")

[34m[1mwandb[0m: Currently logged in as: [33mdhaneshv[0m ([33mwordless-souls[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230623_173544-nzgilan5[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mPytorch-Lightning[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/wordless-souls/Text2Questions[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/wordless-souls/Text2Questions/runs/nzgilan5[0m


<p style="color:red; font-weight:600; font-size:35px;">CONFIG Dictionary📒</p>

In [5]:
CONFIG = {}
CONFIG['max_length'] = 512
CONFIG['device'] = torch.device("cuda" if torch.cuda.is_available() else "cpu")
CONFIG['ans_max_length'] = 128
CONFIG['num_beams'] = 3
CONFIG['batch_size'] = 8
CONFIG['epochs'] = 4
CONFIG['model_checkpoint'] = 't5-base'
CONFIG['lr'] = 1e-4
CONFIG['warmup_steps'] = 3000
CONFIG['training_steps'] = 20000

In [6]:
wandb.config.update(CONFIG)

<p style="color:red; font-weight:600; font-size:35px;">Inititaling tokenizer of T5 😊</p>

In [7]:
tokenizer = T5Tokenizer.from_pretrained(CONFIG['model_checkpoint'], model_max_length= CONFIG['max_length'])

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

<p style="color:red; font-weight:600; font-size:35px;">Loading the Dataset 📅</p>

In [8]:
df = load_dataset("domenicrosati/QA2D")

Downloading: 0.00B [00:00, ?B/s]

Downloading and preparing dataset csv/default (download: 12.52 MiB, generated: 18.56 MiB, post-processed: Unknown size, total: 31.08 MiB) to /root/.cache/huggingface/datasets/parquet/domenicrosati--QA2D-e810a5ba933d6655/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/domenicrosati--QA2D-e810a5ba933d6655/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
df = df.rename_column('question', 'input')
df = df.rename_column('turker_answer','target')

In [10]:
#Taking only 70% of data for training
df['train']= df['train'].train_test_split(test_size=0.7)['train']

In [11]:
print(df['train'])
print(df['dev'])

Dataset({
    features: ['dataset', 'example_uid', 'input', 'answer', 'target', 'rule-based'],
    num_rows: 18213
})
Dataset({
    features: ['dataset', 'example_uid', 'input', 'answer', 'target', 'rule-based'],
    num_rows: 10344
})


In [12]:
dataset = df
dataset = dataset.filter(lambda example: example["answer"] is not None)
print(dataset)

  0%|          | 0/11 [00:00<?, ?ba/s]

  0%|          | 0/19 [00:00<?, ?ba/s]

DatasetDict({
    dev: Dataset({
        features: ['dataset', 'example_uid', 'input', 'answer', 'target', 'rule-based'],
        num_rows: 10344
    })
    train: Dataset({
        features: ['dataset', 'example_uid', 'input', 'answer', 'target', 'rule-based'],
        num_rows: 18213
    })
})


In [13]:
#Check some data
dataset['train'][1233]

{'dataset': 'SQuAD',
 'example_uid': '572c82d4dfb02c14005c6b8a',
 'input': 'In 2008 , what percentage of Tennessee residents were born outside the South ?',
 'answer': 'Twenty percent',
 'target': 'In 2008 , twenty percent of Tennessee residents were born outside the South .',
 'rule-based': 'In 2008 , twenty percent were born outside the South .'}

In [14]:
# Testing the tokenization
t=tokenizer(dataset['train'][1233]['input'],dataset['train'][1233]['answer'],add_special_tokens=True,
            max_length=13,
            padding = 'max_length',
            truncation='only_first',
            return_attention_mask=True,
            return_tensors="pt"
        )
input_ids = t['input_ids'].squeeze().tolist()

print("The way tokenizer would tokenize the dataset:\n",tokenizer.decode(input_ids))

The way tokenizer would tokenize the dataset:
 In 2008, what percentage of Tennessee residents</s> Twenty percent</s>


<p style="color:red; font-weight:600; font-size:35px;">Dataset Class 🏛️</p>

In [15]:
class FillupsDataset(Dataset):

    def __init__(self, dataframe, tokenizer, source_max_len, ans_max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = source_max_len
        self.ans_len = ans_max_len
        self.question = dataframe['input']
        self.answer = dataframe['answer']
        self.target = dataframe['target']

    def __len__(self):
        return len(self.question)

    def __getitem__(self, index):
        question = str(self.question[index])
        question = ' '.join(question.split())
        
        answer = str(self.answer[index])
        answer = ' '.join(answer.split())

        target = str(self.target[index])
        target = ' '.join(target.split())

        source_encoding = self.tokenizer(
            question,
            answer,
            add_special_tokens=True,
            max_length=self.source_len,
            padding = 'max_length',
            truncation='only_first',
            return_attention_mask=True,
            return_tensors="pt"
        )
        
        target_encoding = self.tokenizer(
            target,
            None,
            add_special_tokens=True,
            max_length=self.ans_len,
            padding = 'max_length',
            truncation= True,
            return_attention_mask=True,
            return_tensors="pt"
        )
        
        source_ids = source_encoding['input_ids'].flatten()
        source_mask = source_encoding['attention_mask'].flatten()
        target_ids = target_encoding['input_ids']
        y = target_encoding['input_ids'].clone().flatten()
        target_ids[target_ids == 0] = -100 
        target_ids = target_ids.flatten()

        
        return {
            'question': question,
            'answer':answer,
            'target':target,
            'source_ids': source_ids,
            'source_mask': source_mask, 
            'target_ids': target_ids,
            'y':y
        }

In [16]:
# Testing the Dataset Class
DummyDataset = FillupsDataset(dataset["train"], tokenizer,20,20)
print(f"Length of Dataset is : {len(DummyDataset)}")
DummyDataset.__getitem__(34)

Length of Dataset is : 18213


{'question': 'After an accidental assention of a bill with same name in 1976 , when did a similar mistaken assention occur in Australia ?',
 'answer': '2001',
 'target': 'After an accidental assention of a bill with the same name in 1976 , a similar mistaken assention occurred in Australia in 2001 .',
 'source_ids': tensor([  621,    46, 24306,    38,     7,    35,  1575,    13,     3,     9,
          2876,    28,   337,   564,    16, 16164,     3,     1,  4402,     1]),
 'source_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'target_ids': tensor([  621,    46, 24306,    38,     7,    35,  1575,    13,     3,     9,
          2876,    28,     8,   337,   564,    16, 16164,     3,     6,     1]),
 'y': tensor([  621,    46, 24306,    38,     7,    35,  1575,    13,     3,     9,
          2876,    28,     8,   337,   564,    16, 16164,     3,     6,     1])}

<p style="color:red; font-weight:600; font-size:35px;">Data Module 💁</p>

In [17]:
# Printing the config dictionary containing hyperparameters
CONFIG

{'max_length': 512,
 'device': device(type='cuda'),
 'ans_max_length': 128,
 'num_beams': 3,
 'batch_size': 8,
 'epochs': 4,
 'model_checkpoint': 't5-base',
 'lr': 0.0001,
 'warmup_steps': 3000,
 'training_steps': 20000}

In [18]:
class FillupsDatasetModule(pl.LightningDataModule):

    def __init__(self, df_train, df_valid, tokenizer,source_max_len, target_max_len):
        super().__init__()
        self.df_train = df_train
        self.df_valid = df_valid
        self.tokenizer = tokenizer
        self.source_len = source_max_len
        self.ans_len = target_max_len


    def setup(self, stage=None):

        self.train_dataset = FillupsDataset(
        dataframe = self.df_train,
        tokenizer = self.tokenizer,
        source_max_len = self.source_len,
        ans_max_len = self.ans_len
        )

        self.valid_dataset = FillupsDataset(
        dataframe = self.df_valid,
        tokenizer = self.tokenizer,
        source_max_len = self.source_len,
        ans_max_len = self.ans_len
        )

    def train_dataloader(self):
        return DataLoader(
         self.train_dataset,
         batch_size= CONFIG['batch_size'],
         shuffle=True, 
         num_workers=2
        )


    def val_dataloader(self):
        return DataLoader(
         self.valid_dataset,
         batch_size= CONFIG['batch_size'],
         num_workers=1
        )

<p style="color:red; font-weight:600; font-size:35px;">Model 🙌</p>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:auto;
           font-family:Verdana;">

<p style="padding: 10px; color:white;"> I have used Adam optimizer with learning rate scheduler that allows a gradual increase of the learning rate during the initial warm-up phase and then applies a linear decay schedule!
</p>
</div>

In [19]:
class FillupsModel(pl.LightningModule):
    
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(CONFIG['model_checkpoint'], return_dict=True)
        self.learning_rate = CONFIG['lr']
        self.targets = []
        self.predictions = []
        self.save_hyperparameters()
        
    # Forward pass    
    def forward(self, input_ids, attention_mask, labels=None):

        output = self.model(
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )

        return output.loss, output.logits

    # During training phase
    def training_step(self, batch, batch_idx):

        input_ids = batch["source_ids"]
        attention_mask = batch["source_mask"]
        labels= batch["target_ids"]
        loss, outputs = self(input_ids, attention_mask, labels)
        self.log("train_loss", loss)

        return loss

    # During testing phase
    def validation_step(self, batch, batch_idx):
        
        input_ids = batch["source_ids"]
        attention_mask = batch["source_mask"]
        labels= batch["target_ids"]
        y = batch["y"]
  
        loss, outputs = self(input_ids, attention_mask, labels)
        
        generated_ids = self.model.generate(
                input_ids = input_ids,
                attention_mask = attention_mask, 
                max_length= CONFIG['ans_max_length'], 
                num_beams= CONFIG['num_beams'],
                repetition_penalty=2.1, 
                early_stopping=True
                )
        preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
        truth = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True) for t in y]
 
        
        self.predictions.extend(preds)
        self.targets.extend(truth)
        
        self.log("val_loss", loss)
        
        return loss

    # Calculate the BLEU score at end of validation
    def on_validation_epoch_end(self):
        bleu_score = corpus_bleu([[actual] for actual in self.targets], self.predictions) * 100.0
        self.targets = []
        self.predictions = []
        self.log('bleu_score',bleu_score)
        return bleu_score
    
    # To configure optimisers and LR schedulers
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=CONFIG['warmup_steps'],
            num_training_steps=CONFIG['training_steps'],
        )
        scheduler = {"scheduler": scheduler, "interval": "step", "frequency": 1}
        return [optimizer], [scheduler]

<p style="color:red; font-weight:600; font-size:35px;">Training Loop 🔃</p>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:auto;
           font-family:Verdana;">

<p style="padding: 10px; color:white;">I have used Pytorch Lightning ⚡ for this as it simplifies distributed training across multiple GPUs or machines. Additionally, it integrates with automatic mixed-precision training, enabling faster and more memory-efficient training ! 
</p>
</div>

In [20]:
def run():
    
    df_train, df_valid = train_test_split(dataset['train'], test_size=0.2, random_state=2021)
    
    dataModule = FillupsDatasetModule(df_train, df_valid, tokenizer, CONFIG['max_length'],CONFIG['ans_max_length'])
    dataModule.setup()

    models = FillupsModel()
 

    checkpoint_callback  = ModelCheckpoint(
        dirpath="/kaggle/working",
        filename="best_checkpoint",
        save_top_k=2,
        verbose=True,
        monitor="val_loss",
        mode="min"
    )


    trainer = pl.Trainer(
        callbacks = checkpoint_callback,
        max_epochs= CONFIG['epochs'],
        devices = 2,
        accelerator="gpu",
        logger = wandb_logger,
        precision = 16
    )

    trainer.fit(models, dataModule)


In [21]:
# To start the fine tuning!
run()

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

<p style="color:red; font-weight:600; font-size:35px;">Model evaluation ✒️</p>

In [22]:
# Loading the saved artifact from wandb
run = wandb.init()
artifact = run.use_artifact('wordless-souls/Text2Questions/model-9ouqiax8:v2', type='model')
model = artifact.download()

[34m[1mwandb[0m: Downloading large artifact model-9ouqiax8:v2, 2551.26MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:14.2


In [23]:
# Preparing the test dataset 
test_dataset, _ = train_test_split(dataset['dev'], test_size=0.6, random_state= 420)
print(f"Length of test datset {len(test_dataset['answer'])}")

Length of test datset 4137


In [24]:
# preparing the test data module using the Dataset class that we have defined already
test_datamodule = FillupsDataset(
        dataframe = test_dataset,
        tokenizer = tokenizer,
        source_max_len = CONFIG['max_length'],
        ans_max_len = CONFIG['ans_max_length']
        )

In [25]:
# Preparing the Dataloader for test
test_loader = DataLoader(
         test_datamodule,
         batch_size= CONFIG['batch_size'],
         num_workers=1
        )

<p style="color:red; font-weight:600; font-size:35px;">Inference on Test Data 🧪</p>

In [26]:
model = FillupsModel.load_from_checkpoint("/kaggle/working/artifacts/model-9ouqiax8:v2/model.ckpt")
model.to(CONFIG['device'])
model.freeze()
outputs = []
targets = []
converted_sentences = []
truth_sentences = []
for batch in tqdm(test_loader):
    outs = model.model.generate(
        input_ids = batch["source_ids"].to(CONFIG['device']),
        attention_mask = batch["source_mask"].to(CONFIG['device']),
        max_length = CONFIG['max_length'],
        num_beams = CONFIG['num_beams'],
        repetition_penalty=2.1, 
        early_stopping=True,
        )
    

    preds = [
        tokenizer.decode(ids,
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=True)
        for ids in outs
    ]
    
    truth = [
      tokenizer.decode(ids,
                       skip_special_tokens=True, 
                       clean_up_tokenization_spaces=True
                      ) for ids in batch['y']]
  
    outputs.extend(preds)
    targets.extend(truth)


  0%|          | 0/518 [00:00<?, ?it/s]

<p style="color:red; font-weight:600; font-size:35px;">Saving the results in a DataFrame 🥳</p>

In [27]:
final_df = pd.DataFrame({'Generated Text':outputs,'Actual Text': targets})
final_df.to_csv('predictions.csv', index=None)

<p style="color:red; font-weight:600; font-size:35px;">Calculating the metrics 🔥</p>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:auto;
           font-family:Verdana;">

<p style="padding: 10px; color:white;">BLEU (Bilingual Evaluation Understudy) is a metric commonly used to evaluate the quality of machine-generated translations or text generation tasks. It measures the similarity between the generated text and one or more reference texts based on n-grams (contiguous sequences of words).
</div>

In [28]:
bleu_score = corpus_bleu([[actual] for actual in targets], outputs) * 100.0

In [29]:
markdown_text = '''
<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana;">
    📌 <b>BLEU Score on test dataset:</b><br> {:.2f}
</div>
'''.format(bleu_score)

from IPython.display import Markdown
display(Markdown(markdown_text))


<div class="alert alert-block alert-info" style="font-size:14px; font-family:verdana;">
    📌 <b>BLEU Score on test dataset:</b><br> 91.92
</div>
