# Fine Tuning Large Language Model - Model

In this workshop, you will learn how to fine tune the prompts and the LLMs to enhance and improves its response.

In [1]:
!pip install evaluate
!pip install protobuf<5.0.0

/bin/bash: line 1: 5.0.0: No such file or directory


In [1]:
# Import libraries
import torch, time
import pandas as pd
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, GenerationConfig, TrainingArguments

from peft import PeftModel, LoraConfig, get_peft_model, TaskType

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

In [4]:
# TODO: Load and explore the following datasets

dataset_name = "knkarthick/dialogsum"
model_name = "google/flan-t5-small"

dataset = load_dataset(dataset_name)
print(dataset.shape)


{'train': (12460, 4), 'validation': (500, 4), 'test': (1500, 4)}


In [None]:
# TODO: Print a record
idx = 5000
for k, v in dataset['train'][idx].items():
  print(f'{k}: {v}')
  print()

id: train_5000

dialogue: #Person1#: do you like animals? I really like dogs.
#Person2#: so do i. I don't like cats.
#Person1#: why? I think cats are ok.
#Person2#: I can't bear being near cats. They don't seem to like me either.
#Person1#: I like wild animals. I don't like spiders and snakes. I think spiders and snakes are disgusting.
#Person2#: I'm fond of snakes. I think they're great. I agree with you about spiders though. I think spiders are horrible. I think it's because they have so many legs.
#Person1#: I think bears are wonderful. Pandas are fantastic. I low the people who kill them for their fur.
#Person2#: I agree. I'm carry about mice. I think they're so cute!
#Person1#: really? I don't see the attraction. I'm afraid of mice.

summary: #Person1# likes dogs, wild animals but doesn't like spiders and snakes. #Person2# doesn't like cats but likes snakes and mice.

topic: animals



## Fine tuning the LLM model

In this workshop we will be turning the <code>google/flan-t5-base</code> model.

In [None]:
# Utility function to dump a model's tunable parameters

def print_trainable_model_params(model):
   trainable_model_params = 0
   all_model_params = 0
   for _, param in model.named_parameters():
      all_model_params += param.numel()
      if param.requires_grad:
         trainable_model_params += param.numel()
   return f"Trainable parameters: {trainable_model_params}\nTotal parameters: {all_model_params}\nPercentable of trainable parameters: {100 * trainable_model_params / all_model_params:.2f}%"

In [None]:
# TODO: Load model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
# TODO: Print number of trainable parameters
print(print_trainable_model_params(model))

Trainable parameters: 76961152
Total parameters: 76961152
Percentable of trainable parameters: 100.00%


### Preprocess the dialogue dataset

We will train the model to summarize dialogue by creating a dialogue-summary pair for the LLM to process. The dialogue is the training data and the summary is the label. This is supervized learning.

The prompt will be as follows

```
Summarize the following dialogue.\n
\n
Fred: ...\n
Barney: ...\n
\n
Summary:\n
Summary of the conversation between Fred and Barney
```

The prompt and the summary will be tokenized for the LLM

In [None]:
# Utitlity function to prepare the data for training
# Tokenize function
def tokenize_fn(data):
   start_prompt = 'Summarize the following dialogue.\n\n'
   end_prompt = '\n\nSummary:'
   prompt = [ start_prompt + d + end_prompt for d in data['dialogue'] ]
   summary = data['summary']
   data['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
   data['labels'] = tokenizer(summary, padding="max_length", truncation=True, return_tensors="pt").input_ids
   return data


In [None]:
# TODO: prepare the data for training with the tokenize_fn function
tokenized_dataset = dataset.map(tokenize_fn, batched=True)



Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
# TODO: Verify prepared data
for k, v in tokenized_dataset['train'][idx].items():
  print(f'{k}:\n{v}\n')


id:
train_5000

dialogue:
#Person1#: do you like animals? I really like dogs.
#Person2#: so do i. I don't like cats.
#Person1#: why? I think cats are ok.
#Person2#: I can't bear being near cats. They don't seem to like me either.
#Person1#: I like wild animals. I don't like spiders and snakes. I think spiders and snakes are disgusting.
#Person2#: I'm fond of snakes. I think they're great. I agree with you about spiders though. I think spiders are horrible. I think it's because they have so many legs.
#Person1#: I think bears are wonderful. Pandas are fantastic. I low the people who kill them for their fur.
#Person2#: I agree. I'm carry about mice. I think they're so cute!
#Person1#: really? I don't see the attraction. I'm afraid of mice.

summary:
#Person1# likes dogs, wild animals but doesn't like spiders and snakes. #Person2# doesn't like cats but likes snakes and mice.

topic:
animals

input_ids:
[12198, 1635, 1737, 8, 826, 7478, 5, 1713, 345, 13515, 536, 4663, 10, 103, 25, 114, 312

In [None]:
text = tokenizer.decode(tokenized_dataset['train'][idx]['input_ids'])
print(text)

print('----------------')
text = tokenizer.decode(tokenized_dataset['train'][idx]['labels'])
print(text)

Summarize the following dialogue. #Person1#: do you like animals? I really like dogs. #Person2#: so do i. I don't like cats. #Person1#: why? I think cats are ok. #Person2#: I can't bear being near cats. They don't seem to like me either. #Person1#: I like wild animals. I don't like spiders and snakes. I think spiders and snakes are disgusting. #Person2#: I'm fond of snakes. I think they're great. I agree with you about spiders though. I think spiders are horrible. I think it's because they have so many legs. #Person1#: I think bears are wonderful. Pandas are fantastic. I low the people who kill them for their fur. #Person2#: I agree. I'm carry about mice. I think they're so cute! #Person1#: really? I don't see the attraction. I'm afraid of mice. Summary:</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><p

In [None]:
# TODO: Remove id, dialogue, summary and topic columns from dataset. We only want input_ids and labels
tokenized_dataset = tokenized_dataset.remove_columns([
    'id', 'dialogue', 'summary', 'topic'
])


In [None]:
# TODO: Verify dataset again
for k, v in tokenized_dataset['train'][idx].items():
  print(f'{k}:\n{v}\n')


input_ids:
[12198, 1635, 1737, 8, 826, 7478, 5, 1713, 345, 13515, 536, 4663, 10, 103, 25, 114, 3127, 58, 27, 310, 114, 3887, 5, 1713, 345, 13515, 357, 4663, 10, 78, 103, 3, 23, 5, 27, 278, 31, 17, 114, 10003, 5, 1713, 345, 13515, 536, 4663, 10, 572, 58, 27, 317, 10003, 33, 3, 1825, 5, 1713, 345, 13515, 357, 4663, 10, 27, 54, 31, 17, 4595, 271, 1084, 10003, 5, 328, 278, 31, 17, 1727, 12, 114, 140, 893, 5, 1713, 345, 13515, 536, 4663, 10, 27, 114, 3645, 3127, 5, 27, 278, 31, 17, 114, 18612, 7, 11, 17599, 7, 5, 27, 317, 18612, 7, 11, 17599, 7, 33, 27635, 53, 5, 1713, 345, 13515, 357, 4663, 10, 27, 31, 51, 3036, 13, 17599, 7, 5, 27, 317, 79, 31, 60, 248, 5, 27, 2065, 28, 25, 81, 18612, 7, 713, 5, 27, 317, 18612, 7, 33, 17425, 5, 27, 317, 34, 31, 7, 250, 79, 43, 78, 186, 6217, 5, 1713, 345, 13515, 536, 4663, 10, 27, 317, 4595, 7, 33, 1627, 5, 28248, 7, 33, 2723, 5, 27, 731, 8, 151, 113, 5781, 135, 21, 70, 4223, 5, 1713, 345, 13515, 357, 4663, 10, 27, 2065, 5, 27, 31, 51, 2331, 81, 13214, 5,

### Tune model with pre-processed dataset

We will use [<code>Trainer</code>](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#api-reference%20][%20transformers.Trainer) to train the original model. The training result will be written out. The trainer will be configure with [<code>TrainingArgument</code>](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments)

In [None]:
# CUDA information

print('CUDA available: ', torch.cuda.is_available())
if torch.cuda.is_available():
   print('B16 supported: ', torch.cuda.is_bf16_supported())
   torch.cuda.set_device(0)
   print('Current device: ', torch.cuda.current_device())
   print('CUDA device name: ', torch.cuda.get_device_name(0))

CUDA available:  False


## Fine tuning the LLM Model with Low-Rank Adaptation (LoRA) / Parameter Efficient Fine Tuning (PEFT)

We will add a LoRA adapter to the LLM (flan-t5-base) and train the adapter. The original LLM will be frozen. The adapter can be combined with the original LLM during inferencing.

In [None]:
print(model)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

In [None]:
# TODO: Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=8,
    target_modules=['q', 'v'],
    lora_dropout = 0.05,
    task_type = TaskType.SEQ_2_SEQ_LM
)

In [None]:
# TODO: Add LoRA to the LLM model to be trained
# load the base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# adapt the base_model
lora_model = get_peft_model(base_model, lora_config)

In [None]:
# TODO: Print number of parameters, compare LoRA to the original model
print(print_trainable_model_params(lora_model))

Trainable parameters: 344064
Total parameters: 77305216
Percentable of trainable parameters: 0.45%


Trainable parameters: 76961152
Total parameters: 76961152
Percentable of trainable parameters: 100.00%


In [None]:
# TODO: Train model with LoRA
output_dir = f'lora-{str(int(time.time()))}'

lora_train_args = TrainingArguments(
    output_dir = output_dir, # save the model after training
    learning_rate=1e-3,
    num_train_epochs=1,
    max_steps=1,
    auto_find_batch_size=True
)

AttributeError: module 'wandb.sdk' has no attribute 'lib'

In [None]:
# TODO: Create trainer and train model
lora_trainer = Trainer(
    model = lora_model,
    args = lora_train_args,
    train_dataset = tokenized_dataset['train'],
    eval_dataset = tokenized_dataset['validation']
)

# will not run
# peft_trainer.train()

# save the model
#lora_model_name = "my-lora-model"
# Save the trained model
#lora_trainer.save_pretrained(lora_model_name)
# IMPORTANT: Save the tokenizer
#tokenizer.save_pretrained(lora_model_name)


### Use a trained LoRA model

The training will take a few hours and over many iterations.

For the purpose of this workshop we use a save model [intotheverse/peft-dialogue-summary-checkpoint](https://huggingface.co/intotheverse/peft-dialogue-summary-checkpoint).

In [2]:
#TODO: Load the original model and add the pre-trained LoRA adaptation to the model
peft_dialogue_summary_checkpoint = 'intotheverse/peft-dialogue-summary-checkpoint'

# Load the base model
model_name = "google/flan-t5-base"
lora_base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name
    , torch_dtype=torch.bfloat16)

# Load the trained LoRA model
lora_model = PeftModel.from_pretrained(lora_base_model
    , peft_dialogue_summary_checkpoint
    , torch_dtype=torch.bfloat16
    , is_trainable=False)

# Load the orignal model
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)


`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

## Evaluate LoRA model

In [None]:
# TODO: Evaluate LoRA model against the original



In [5]:
# Prepare data for evaluation
dialogues = []
summaries = []
orig_model_summaries = []
lora_model_summaries = []
config = GenerationConfig(max_new_tokens=200)

for i in range(5):
   print(f'i = {i}')
   d = dataset['test'][i]['dialogue']
   s = dataset['test'][i]['summary']
   prompt = f"Summarize the following conversation.\n\n{d}\n\nSummary:"
   tokenized_prompt = tokenizer(prompt, return_tensors='pt').input_ids
   orig_resp = original_model.generate(input_ids=tokenized_prompt, generation_config=config)
   orig_resp_text = tokenizer.decode(orig_resp[0], skip_special_tokens=True)
   lora_resp = lora_model.generate(input_ids=tokenized_prompt, generation_config=config)
   lora_resp_text = tokenizer.decode(lora_resp[0], skip_special_tokens=True)

   summaries.append(s)
   orig_model_summaries.append(orig_resp_text)
   lora_model_summaries.append(lora_resp_text)

zipped_summaries = list(zip(summaries, orig_model_summaries, lora_model_summaries))
df = pd.DataFrame(zipped_summaries, columns=['label', 'original_model_summary', 'lora_model_summary'])
df

i = 0
i = 1
i = 2
i = 3
i = 4


Unnamed: 0,label,original_model_summary,lora_model_summary
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation for you.,#Person1# asks Ms. Dawson to take a dictation ...
3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic and #Person1# s...
4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic and #Person1# s...


### Evaluate models with ROUGE/Bleu metrics

Recall-Oriented Understudy for Gisting Evaluate ([ROUGE](https://pub.aimind.so/unveiling-the-power-of-rouge-metrics-in-nlp-b6d3f96d3363)) is a set of metrics used to evaluate the quality of machine-generated text, such as summaries and translations. ROUGE metrics compare the generated text to a human-written reference and measure the overlap between the two.

The metrics range between 0 and 1, with higher scores indicating higher similarity between the baseline and generated text.

In [7]:
!pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=310fa292d0be3ddd6cd008ee7d2cf2a098c0338b89ddb8cfcede325710fb5e21
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [8]:
# TODO: create ROUGE metrics
rouge = evaluate.load('rouge')


In [9]:
# TODO: Evaluate the original model's result
orig_model_results = rouge.compute(
    references = summaries, # actual summary
    predictions = orig_model_summaries,
    use_aggregator=True,
    use_stemmer=True
)

# LoRA model
lora_model_results = rouge.compute(
    references = summaries, # actual summary
    predictions = lora_model_summaries,
    use_aggregator=True,
    use_stemmer=True
)

In [10]:
print(orig_model_results)

print('-----------')

print(lora_model_results)

{'rouge1': np.float64(0.16126984126984129), 'rouge2': np.float64(0.04741532976827094), 'rougeL': np.float64(0.13968253968253969), 'rougeLsum': np.float64(0.13873015873015873)}
-----------
{'rouge1': np.float64(0.3408045927322088), 'rouge2': np.float64(0.1024924201890494), 'rougeL': np.float64(0.2728511771470072), 'rougeLsum': np.float64(0.27285117714700724)}


In [None]:
# TODO: Evaluate with Bleu metrics
