# T5 - Fine-tuning on dataset

In this notebook, we implement the pipeline for fine-tuning [T5](https://github.com/google-research/text-to-text-transfer-transformer) (Text-To-Text Transfer Transformer) on the Medical Meadows Anki flashcards dataset. We will use, due to disk and GPU constraints, the [flan-t5-small](https://huggingface.co/google/flan-t5-small) (60 million parameters) version by HuggingFace's transformers library, using PyTorch.\
T5 is a NLP model, based on the transformer architecture, that treats every task as a text-to-text one, and it can handle different tasks such as text classification, translation, and question-answering. It was the first big model to be designed for fine-tuning on specific tasks.\
T5 has been pre-trained on C4 (Colossal Clean Crawled Corpus), a web-page crawl dataset derived by Google from the Common Crawl dataset.

## 0) Imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install nltk
!pip install datasets
!pip install transformers[torch]
!pip install tokenizers
!pip install evaluate
!pip install rouge_score
!pip install sentencepiece
!pip install huggingface_hub
!pip install json

import nltk
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
import json
import pandas as pd

[31mERROR: Could not find a version that satisfies the requirement json (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for json[0m[31m
[0m

## 1) Loading model
We now load the pre-trained model from the HuggingFace hub, with the corresponding Tokenizer. This is the pre-trained model by Google.

In [None]:
# Load the tokenizer, model, and data collator
MODEL_NAME = "google/flan-t5-small"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## 2) Data pre-processing

### 2.1) Data loading
We load our dataset from the .json file, using the json library. For convenience, we use a DataFrame to store and manipulate the data. We then discarded the 'instruction' column and dropped any row containing a missing question or answer

In [None]:
# Load the dataset
dataset_path = "/content/drive/MyDrive/Colab Notebooks/NLP/"
dataset_name = 'medical_meadow_wikidoc_medical_flashcards.json'
with open(dataset_path + dataset_name, "r") as f:
    dataset = json.load(f)
dataset_df = pd.DataFrame(dataset)
dataset_df = dataset_df.drop(['instruction'], axis=1)
dataset_df = dataset_df.replace('', np.nan).dropna()
display(dataset_df)

Unnamed: 0,input,output
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca..."
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...
...,...,...
33946,"What is Opsoclonus-Myoclonus Ataxia Syndrome, ...",Opsoclonus-Myoclonus Ataxia Syndrome is a para...
33947,"What is Opsoclonus-Myoclonus Ataxia Syndrome, ...",Opsoclonus-Myoclonus Ataxia Syndrome is a para...
33948,Is A part of B in a proportion of A/B?,"Yes, A is part of B in a proportion of A/B."
33949,"What is the mnemonic ""Microtubules Get Constru...","The mnemonic ""Microtubules Get Constructed Ver..."


### 2.2) Data splitting
We split the data into train and test partitions

In [None]:
# data train test split
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(dataset_df, test_size=0.3, random_state=42)
train_df[:5]

Unnamed: 0,input,output
6026,What is the name of the enzyme responsible for...,Hepatic TG lipase is the enzyme responsible fo...
5419,"During which stage of sleep does bruxism, or t...","Bruxism, or teeth grinding, occurs during stag..."
21496,What is one source of von Willebrand factor (v...,One source of von Willebrand factor (vWF) is f...
14104,What are some of the common bacterial pathogen...,Some of the common bacterial causes of sinusit...
1977,What is a local vascular complication that can...,Arteriovenous fistula is a local vascular comp...


### 2.3) Dataset tokenization
We then tokenized the whole dataset (train and test) using the T5 Tokenizer. We've used the class ```DatasetDict``` from datasets, since this is the required format for the model.

In [None]:
from datasets import DatasetDict, Dataset

training_data_string = [(train_df.iloc[i]['input'], train_df.iloc[i]['output']) for i in range(len(train_df))]
test_data_string = [(test_df.iloc[i]['input'], test_df.iloc[i]['output']) for i in range(len(test_df))]

train_data = Dataset.from_dict({'text': training_data_string})
test_data = Dataset.from_dict({'text': test_data_string})


data = DatasetDict()
data['train'] = train_data
data['test'] = test_data

def tokenize_function(examples):
    input_encodings = tokenizer([examples["text"][i][0] for i in range(len(examples['text']))], truncation=True, padding='max_length', max_length=512)
    label_encodings = tokenizer([examples["text"][i][1] for i in range(len(examples['text']))], truncation=True, padding='max_length', max_length=512)
    sample = {
        'input_ids': input_encodings.input_ids,
        'labels': label_encodings.input_ids
    }
    return sample

tokenized_data = data.map(tokenize_function, batched=True)

Map:   0%|          | 0/23482 [00:00<?, ? examples/s]

Map:   0%|          | 0/10065 [00:00<?, ? examples/s]

In [None]:
print(tokenized_data)

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'labels'],
        num_rows: 23482
    })
    test: Dataset({
        features: ['text', 'input_ids', 'labels'],
        num_rows: 10065
    })
})


We can see the field 'input_ids' and 'labels': these contain the tokenized versions of the question and the answer, respectively.

## 3) Training

### 3.1) Compute metrics function
We've used the ROGUE score to compute metrics during the training phase. To achieve this, we must define a ```compute_metrics()``` function to be passed to the Trainer.\

ROGUE is a measure of how close the generated text is to the reference text (the ground truth answer). ROGUE-n is based on the overlapping of the n-grams in the two texts.

In [None]:
# ROGUE score
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

# Define the compute_metrics function
def compute_metrics(eval_preds):
   preds, labels = eval_preds

   # decode preds and labels
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

   # rougeLSum expects newline after each sentence
   decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
   decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

   result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

   return result

### 3.2) Fine-tuning
Now we ran the fine-tuning, using the Seq2SeqTrainer object. We've opted for 10 epochs, a batch size of 4 and 8 steps of gradient accumulation due to hardware constraints. The values of the other parameters were taken from [this tutorial](https://www.datacamp.com/tutorial/flan-t5-tutorial).

In [None]:
# Training

# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 4
PER_DEVICE_EVAL_BATCH = 2
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
GRADIENT_ACC_STEPS = 8
#NUM_EPOCHS = 3

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="./t5-trainer-results",
   evaluation_strategy="steps",
   gradient_accumulation_steps=GRADIENT_ACC_STEPS,
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   #num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False,
   fp16=True,
   lr_scheduler_type="linear"
)

# instantiate the trainer
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_data['train'],
   eval_dataset=tokenized_data['test'],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

Now we ran the training:

In [None]:
import torch
torch.cuda.empty_cache()
trainer.train()

Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
500,0.0,,0.000365,2e-06,0.000343,0.000355




Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
500,0.0,,0.000365,2e-06,0.000343,0.000355
1000,0.0,,0.000365,2e-06,0.000343,0.000355




Note that the output on this notebook is not referred to the true execution of the 10-epochs run. That specific final run was done on a Kaggle machine, and we could not save outputs from that run.

## 4) Save model, reload model and generation

### 4.1) Saving the model

In [None]:
# save model
from datetime import datetime

checkpoint_path = f"t5_fine_tuning_{datetime.now().strftime('%Y_%m_%d_%H_%M_%S')}"
tokenizer.save_pretrained(checkpoint_path)
model.save_pretrained(checkpoint_path)
print(f"Checkpoint saved at: \'{checkpoint_path}\'")

### 4.2) Reloading the model

In [None]:
# reload model
tokenizer = T5Tokenizer.from_pretrained(checkpoint_path)
model = T5ForConditionalGeneration.from_pretrained(checkpoint_path , device_map="cuda")

### 4.3) Generation on a new question

In [None]:
# Ask a question
input_question = ""

# Encode context
input_encoding = tokenizer(input_question, return_tensors='pt').to(device)
# Generate response
output_ids = model.generate(input_encoding.input_ids, max_new_tokens=64, do_sample=True, temperature=1.0, top_p=0.9, top_k=0, pad_token_id=tokenizer.eos_token_id)
# Decode generated response
generated_response = tokenizer.decode(output_ids[0, input_encoding.input_ids.size(1):], skip_special_tokens=True)
generated_response