## Full fine tuning of flan-T5 LLM model

What is covered?
1. Load flan-t5 model & dialogue-summarization dataset.
2. Full fine-tune flan-T5 model on nvidia A6000 GPU
3. Test inference of Base model and Fine-tuned model
4. Test the fine-tuned model with rough and bleu scores
5. Track the experiment with wandb (weights and biases)
6. Learn to use Paperspace Gradient service to train your model for finetuning

In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    wandb \
    peft==0.3.0 --quiet

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer, EarlyStoppingCallback
import torch
import time
import pandas as pd
import numpy as np
import wandb

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33maambekar234[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## 1. Load the flan-t5 model and dialogue summarization dataset
1. check the datatype of model's tensor
2. Check where exactly the model is loaded (cpu or gpu)
3. Redo the datasplits for balalanced & optimum test/validation/test split
4. Tokenize the dataset for training

In [3]:
# load dialogue-summary dataset
huggingface_dataset_name = "knkarthick/dialogsum"
dataset = load_dataset(huggingface_dataset_name)
#load model and tokenzier
original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

# dtype check on model tensor. You could change it to bfloat16 to reduce the memory usage. 
# Note: bfloat16 won't work on Apple Silicon Macs
dtype = next(original_model.parameters()).dtype
print(f"Tensor's dataType -->{dtype}")

#check where the model is loaded (should print either cpu or cuda)
print(f"Model is loaded on -->{next(original_model.parameters()).device}")

Found cached dataset csv (/root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


  0%|          | 0/3 [00:00<?, ?it/s]

### 1.3 Redoing the datasplits for balalanced & optimum test/validation/test split 

In [5]:
#redoing the dataset split as the default one is not balanced for model training
from datasets import load_dataset, concatenate_datasets, DatasetDict

# Combine the splits (train, test, validation)
combined_dataset = concatenate_datasets([dataset["train"], dataset["test"], dataset["validation"]])

# Shuffle the combined dataset
combined_dataset = combined_dataset.shuffle(seed=42)

# Split the dataset into 80% train, 10% test, 10% validation
train_test_split = combined_dataset.train_test_split(test_size=0.20)  # Splitting 20% for test+validation
test_validation_split = train_test_split['test'].train_test_split(test_size=0.5)  # Splitting the 20% into two equal halves

# Creating the final DatasetDict
final_dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': test_validation_split['test'],
    'validation': test_validation_split['train']
})

test_summaries = final_dataset['test']['summary']

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-112982c8e6f56127.arrow


### 1.4 Tokenizing the dataset for training

In [7]:
def tokenize_function(examples):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompts = [start_prompt + dialogue + end_prompt for dialogue in examples["dialogue"]]
    model_max_input_length = tokenizer.model_max_length

    # Tokenize the input dialogue text
    tokenized_inputs = tokenizer(prompts, max_length=model_max_input_length, padding="max_length", truncation=True)
    
    # Tokenize the labels for the dialogues
    tokenized_labels = tokenizer(examples["summary"], max_length=model_max_input_length, padding="max_length", truncation=True)

    # We need to replace the labels token ids of padding with -100 so they are not taken into account in the loss computation
    tokenized_labels["input_ids"] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in labels] for labels in tokenized_labels["input_ids"]
    ]

    return {"input_ids": tokenized_inputs["input_ids"], "labels": tokenized_labels["input_ids"]}

# Tokenize the entire dataset
tokenized_datasets = final_dataset.map(tokenize_function, batched=True)

# Remove columns which are not necessary for training
columns_to_remove = ['id', 'topic', 'dialogue', 'summary']
tokenized_datasets = tokenized_datasets.remove_columns(columns_to_remove)

Map:   0%|          | 0/11568 [00:00<?, ? examples/s]

Map:   0%|          | 0/1736 [00:00<?, ? examples/s]

Map:   0%|          | 0/1156 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 11568
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1736
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1156
    })
})


## 2 Full-finetune the flan-t5 model by training with above dataset & track experiment with wandb

In [10]:
lr_rate = 3e-5
wt_decay = 0.01
early_st_th = 0.009 
early_st_ptnce = 3
steps = 250

# wandb configuration for experiment tracking
config={
    'learning_rate': lr_rate,
    'weight_decay': wt_decay,
    'early_stopping_threshold' : early_st_th,
    'early_stopping_patience':early_st_ptnce,
    'steps':steps,
    'per_device_train_batch_size':32,
    'per_device_eval_batch_size':16,
}

timestamp = str(int(time.time()))

output_dir = f'/notebooks/models/flant5-fullfinetuned-{timestamp}'

# early stopping callback will help to stop the training if no siginficant reduction in error is observed.
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=early_st_ptnce, early_stopping_threshold=early_st_th)

training_args = TrainingArguments(
    report_to="wandb"
    output_dir=output_dir,
    learning_rate=lr_rate,
    auto_find_batch_size=True,
    weight_decay=wt_decay,
    logging_steps=steps,
    eval_steps=steps,
    max_steps=1000,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end = True,
    gradient_accumulation_steps=2,   
    max_grad_norm=1.0,
    warmup_steps=500, 
)

trainer = Trainer(
    model=original_model.to("cuda:0"),
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    callbacks=[early_stopping_callback]
)

In [11]:
run = wandb.init(project='genai-llm', config=config, name=f'flant5-fullfinetune-{timestamp}')
start_time = time.time()
trainer.train()
training_time = time.time() - start_time
run.log({"Training time (seconds)":training_time})
run.log({"Training configuration":config})

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.016669691300194245, max=1.0‚Ä¶



Step,Training Loss,Validation Loss
250,1.3756,1.176898
500,1.2782,1.124501
750,1.2328,1.091984
1000,1.1791,1.077777
1250,1.1562,1.064261
1500,1.1532,1.05903
1750,1.1149,1.058452
2000,1.1161,1.055144


In [14]:
# save the best model and tokenizer
trainer.save_model(f"{output_dir}/final")
tokenizer.save_pretrained(f"{output_dir}/final")

model_artifact = wandb.Artifact('model_artifact', type='model')
model_artifact.add_dir(f"{output_dir}/final")
run.log_artifact(model_artifact)


[34m[1mwandb[0m: Adding directory to artifact (/notebooks/models/flant5-fullfinetuned-1703124915/final)... Done. 4.8s


<wandb.sdk.wandb_artifacts.Artifact at 0x7f26d0599490>

## 3. Now let's compare the inference of the original and the fine-tuned model with zero shot prompt

In [18]:
## load the new model and tokenizer
finetuned_model = AutoModelForSeq2SeqLM.from_pretrained(f"{output_dir}/final")
tokenizer2 = AutoTokenizer.from_pretrained(f"{output_dir}/final")

original_model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-base')
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-base')

In [19]:
#let's get inference from original model
example_record = 200
dialogue = dataset['test'][example_record]['dialogue']

print(dialogue)

start_prompt = 'Summarize the following conversation.\n\n'
end_prompt = '\n\nSummary: '
prompt = start_prompt + dialogue + end_prompt


input = tokenizer(prompt, return_tensors='pt')
output_tokens = original_model.generate(input["input_ids"], max_new_tokens=50,)
original_model_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

print("Summary-->")
print(original_model_output)

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.
Summary-->
#Person1#: I'm thinking of upgrading my computer.


In [20]:
#lets get inference from finetuned model

input = tokenizer2(prompt, return_tensors='pt')
output_tokens = finetuned_model.generate(input["input_ids"], max_new_tokens=50,)
finetuned_model_output = tokenizer2.decode(output_tokens[0], skip_special_tokens=True)

print("#### Human Baseline Summary -->")
print(dataset['test'][example_record]['summary'])
print("#### Summary Generated by original model->")
print(original_model_output)
print("#### Summary Generated by finetuned model->")
print(finetuned_model_output)

#### Human Baseline Summary -->
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
#### Summary Generated by original model->
#Person1#: I'm thinking of upgrading my computer.
#### Summary Generated by finetuned model->
#Person2# wants to upgrade #Person2#'s system and hardware. #Person1# suggests adding a painting program to #Person2#'s software and adding a CD-ROM drive.


### Now lets Evaluate the model with ROUGE & BLEU Score & compare them with the original model

In [21]:
from tqdm import tqdm

# to save time we will only use 150 items from test split for evaluation
dialogues = final_dataset['test']['dialogue'][:150]
print(len(dialogues))

human_baseline_summaries = final_dataset['test']['dialogue'][:150]
original_model_summaries = []
finetuned_model_summaries = []

# moving both models to gpu for faster inference
original_model.to("cuda:0")
finetuned_model.to("cuda:0")

for dialogue in tqdm(dialogues, desc="Generating summaries from original & finetuned models..."):
    prompt = f"""
    Summarize the following conversation.

    {dialogue}

    Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda:0")

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    finetuned_model_outputs = finetuned_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    finetuned_model_text_output = tokenizer.decode(finetuned_model_outputs[0], skip_special_tokens=True)
    finetuned_model_summaries.append(finetuned_model_text_output)


150


Generating summaries from original & finetuned models...:  55%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 82/150 [01:24<01:12,  1.06s/it]Token indices sequence length is longer than the specified maximum sequence length for this model (524 > 512). Running this sequence through the model will result in indexing errors
Generating summaries from original & finetuned models...: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 150/150 [02:31<00:00,  1.01s/it]


### ROUGE Score

In [22]:
import evaluate
rouge = evaluate.load('rouge')
human_baseline_summaries = test_summaries[:150]

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

finetuned_model_results = rouge.compute(
    predictions=finetuned_model_summaries,
    references=human_baseline_summaries[0:len(finetuned_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('Finetuned MODEL:')
print(finetuned_model_results)

run.log({"rouge_score": finetuned_model_results})

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'rouge1': 0.23894258937571508, 'rouge2': 0.08332541521688881, 'rougeL': 0.2055799206592445, 'rougeLsum': 0.20576620785444855}
Finetuned MODEL:
{'rouge1': 0.48914574569513863, 'rouge2': 0.2351370231982014, 'rougeL': 0.40066456430517255, 'rougeLsum': 0.3989172126204001}


### BLEU Score

In [23]:
bleu = evaluate.load("bleu")
    
original_model_results = bleu.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries
)

finetuned_model_results = bleu.compute(
    predictions=finetuned_model_summaries,
    references=human_baseline_summaries,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('Finetuned MODEL:')
print(finetuned_model_results)

run.log({"bleu_score": finetuned_model_results})

run.finish()

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'bleu': 0.06830401333488964, 'precisions': [0.25926829268292684, 0.11721518987341772, 0.05973684210526316, 0.019452054794520546], 'brevity_penalty': 0.8860555704408019, 'length_ratio': 0.8920800696257616, 'translation_length': 4100, 'reference_length': 4596}
Finetuned MODEL:
{'bleu': 0.23373348653030568, 'precisions': [0.4885386819484241, 0.2929701877070298, 0.18981831945495836, 0.1098558628749513], 'brevity_penalty': 1.0, 'length_ratio': 1.2149695387293298, 'translation_length': 5584, 'reference_length': 4596}


0,1
Training time (seconds),‚ñÅ
eval/loss,‚ñà‚ñÖ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÅ‚ñÅ
eval/runtime,‚ñÑ‚ñÅ‚ñà‚ñÑ‚ñÑ‚ñÜ‚ñÖ‚ñÑ
eval/samples_per_second,‚ñÖ‚ñà‚ñÅ‚ñÖ‚ñÖ‚ñÉ‚ñÑ‚ñÖ
eval/steps_per_second,‚ñÖ‚ñà‚ñÅ‚ñÖ‚ñÖ‚ñÉ‚ñÑ‚ñÖ
train/epoch,‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñà
train/global_step,‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñÑ‚ñÑ‚ñÖ‚ñÖ‚ñÜ‚ñÜ‚ñÜ‚ñÜ‚ñá‚ñá‚ñà‚ñà‚ñà‚ñà‚ñà
train/learning_rate,‚ñÑ‚ñà‚ñá‚ñÜ‚ñÖ‚ñÉ‚ñÇ‚ñÅ
train/loss,‚ñà‚ñÖ‚ñÑ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÅ
train/total_flos,‚ñÅ

0,1
Training time (seconds),2491.59128
eval/loss,1.05514
eval/runtime,23.0015
eval/samples_per_second,50.258
eval/steps_per_second,6.304
train/epoch,3.0
train/global_step,2169.0
train/learning_rate,0.0
train/loss,1.1161
train/total_flos,2.376381915935539e+16


### Conclusion
As we can see that with full-finetuning we managed to get great summaries without employing few-shot learning. As this process is very resource intensive we will explore much more efficient technique called LoRA in next article. 

**üåü Connect on LinkedIn!** 

If you've found this content _useful_ and would like to explore more about **data science**, **machine learning**, and related fields, I'd be delighted to see you on my LinkedIn network. I share insights, resources, and the latest trends that could be beneficial for your learning journey.

‚û§ [**_Follow on LinkedIn_**](https://www.linkedin.com/in/aambekar234/)

_Your support and interaction are always appreciated._

**Best Regards,**
**Abhijeet Ambekar**