<a href="https://colab.research.google.com/github/apa017/hugging-face-learn/blob/main/03_HF_FineTuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Fine Tuning

In this notebook we will perform fine-tuning on a BART model to perform summarization tasks.

In our case we will perform **full parameter fine-tuning** which is the most expensive and sometimes least effective.

<i>In a separate Notebook we will perform **LoRA (Low Rank Adaptation fine-tuning)**, which is a form of **parameter-efficient fine-tuning**.</i>

To do that we use the `PEFT` module from HuggingFace's `transformer` class.

<br>

### WARNING

Online tools like [Google Colab](https://colab.research.google.com/) allow for use of GPU over CPU.

Running a fine-tuning locally (i.e. on CPU) requires lot of time and is computationally intensive.

For this reason it is recommended to execute this notebook on Cloud or having provided GPU.

<hr>

## Notebook Settings

Install the following packages on this runtime.

In [1]:
!pip install transformers datasets evaluate transformers[torch]



In [2]:
!pip install py7zr



In [3]:
import torch
print(torch.__version__)

2.4.1+cu121


## Load Model and Tokenizer

We will directly load [BART (400M parameters)](https://github.com/facebookresearch/fairseq/tree/main/examples/bart) and its tokenizer.

In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import time

In [5]:
start_time = time.time()
print('Loading Tokenizer...')

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

print('Tokenizer was loaded.\n')
print('Loading model... ')
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

print('Model was loaded. \n')

end_time = time.time()
exec_time = end_time - start_time

print(f'Execution Time (loading): {exec_time:.2f} seconds')


Loading Tokenizer...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Tokenizer was loaded.

Loading model... 
Model was loaded. 

Execution Time (loading): 10.27 seconds


## Load Dataset

In [6]:
from datasets import load_dataset

In [7]:
dataset = load_dataset("samsum")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

<br>

The downloaded dataset is an example of fine-tuning set for summarization task.

In [8]:
# Data Sample (summarization task)
dataset['train'][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

In [9]:
# In depth-exploration
xsample = dataset['train'][0]['dialogue']
xlabel = dataset['train'][0]['summary']

print(f'Sample: {xsample}')
print()
print(f'Label: {xlabel}')

Sample: Amanda: I baked  cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)

Label: Amanda baked cookies and will bring Jerry some tomorrow.


<br>

We will use test data from the dataset and a prompt to generate a new summary.

In [10]:
# Select Test Data
sample = dataset['test'][0]['dialogue']
label = dataset['test'][0]['summary']

print(f'Sample: {sample}')
print()
print(f'Label: {label}')

Sample: Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Label: Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


In [11]:
# helper function
def generate_summary(input, llm):

  ## Prompt Template
  prompt = f"""
  Summarize the following conversation.

  {input}

  Summary:
  """

  ## Tokenize Input Ids
  input_ids = tokenizer(prompt, return_tensors="pt")

  ## Generate tokenized output
  tokenized_output = llm.generate(
      input_ids['input_ids'],
      min_length=30,
      max_length=200,
      early_stopping=True
      )

  ## Decode output
  output = tokenizer.decode(
      tokenized_output[0],
      skip_special_tokens=True
      )

  ## Return the decoded output
  return output


In [12]:
# Try generating an output
test_output = generate_summary(sample, llm=model)

In [13]:
# Print results
print("SAMPLE:")
print(sample)
print("-"*50)
print('MODEL GENERATED SUMMARY:')
print(test_output)
print('-'*50)
print('ORIGINAL SUMMARY:')
print(label)

SAMPLE:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
--------------------------------------------------
MODEL GENERATED SUMMARY:
phthalmologist.com is an online dating site. The site allows users to send text messages to friends and family members. In this example, a friend asks for Betty's number. The friend asks her to text Larry.
--------------------------------------------------
ORIGINAL SUMMARY:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


## Prepare dataset for training

We are about to parse the dataset to make sense for the model to ingest

In [14]:
# helper function
# helper function
def tokenize_inputs(example):

  ## Create a structure for the prompt
  start_prompt = "Summarize the following conversation:\n\n"
  end_prompt = "\n\nSummary: "

  ## Cpnstruct the prompt
  prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]

  ## Tokenize the prompt
  example['input_ids'] = tokenizer(prompt,
                                   padding="max_length",
                                   truncation=True,
                                   return_tensors="pt",
                                   max_length=1024).input_ids

  ## Tokenize the label
  example['labels'] = tokenizer(example["summary"],
                                padding="max_length",
                                truncation=True,
                                return_tensors="pt",
                                max_length=1024).input_ids

  ## return tokenized example
  return example



In [15]:
# Set the padding token to be the same as the end-of-sequence (eos) token
# This ensures that padding is consistent with the tokenizer's handling of sequences.
tokenizer.pad_token = tokenizer.eos_token

# Apply the `tokenize_inputs` function to each dataset example.
# The `batched=True` argument ensures that the function is applied to batches of examples at a time (faster for large datasets).
tokenized_datasets = dataset.map(tokenize_inputs, batched=True)

# Remove unnecessary columns from the dataset, keeping only the tokenized data.
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'dialogue', 'summary'])

# Filter the dataset to keep only every 100th example.
# The `with_indices=True` allows the lambda function to access both the example and its index.
# This results in a much smaller subset of the original dataset.
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

Filter:   0%|          | 0/819 [00:00<?, ? examples/s]

Filter:   0%|          | 0/818 [00:00<?, ? examples/s]

In [16]:
# Check for output: shape
print(tokenized_datasets['train'].shape)
print(tokenized_datasets['validation'].shape)
print(tokenized_datasets['test'].shape)

(148, 2)
(9, 2)
(9, 2)


In [17]:
# Check for output: keys (i.e. columns)
tokenized_datasets['train'][0].keys()

dict_keys(['input_ids', 'labels'])

## Train the Model

In this section we provide training arguments for the fine-tuning of our model

We need to connect to the Hugging Face Hub so that when providing output directory and model id, those will be used in training to store the fine-tuned model online.

In [23]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [24]:
from transformers import TrainingArguments, Trainer

In [25]:
# Provide training arguments (including HF identifier on hub)

training_args = TrainingArguments(
    output_dir="./bart-cnn-samsum-finetuned", # output is in local directory
    hub_model_id = "Kain17/bart-cnn-samsum-finetuned", # identifier on the Hub
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    auto_find_batch_size=True,
    eval_strategy='epoch',
    logging_steps=10
)

### Explaining the arguments

1. **`output_dir="./bart-cnn-samsum-finetuned"`**:
   - Specifies the local directory where the trained model and checkpoints will be saved after training.

2. **`hub_model_id="Kain17/bart-cnn-samsum-finetuned"`**:
   - Sets the identifier for the model on the Hugging Face Hub, allowing you to push your trained model to the Hub for sharing or reuse.

3. **`learning_rate=1e-5`**:
   - Defines the initial learning rate for the optimizer; a smaller learning rate generally allows for more gradual convergence.

4. **`num_train_epochs=1`**:
   - Indicates the total number of times the entire training dataset will be passed through the model during training.

5. **`weight_decay=0.01`**:
   - Applies L2 regularization to the model's weights to help prevent overfitting by penalizing large weights.

6. **`auto_find_batch_size=True`**:
   - Automatically determines the optimal batch size during training based on available resources and configuration, helping to maximize training efficiency.

7. **`eval_strategy='epoch'`**:
   - Sets the evaluation strategy to run validation at the end of each training epoch, allowing you to monitor the model's performance over time.

8. **`logging_steps=10`**:
   - Determines how often (in terms of training steps) to log training metrics, such as loss, which helps track progress during training.


In [26]:
# Create the trainer

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args, # our training arguments
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)


In [27]:
# Start the training!
trainer.train()

Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
1,0.0876,0.105933


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


TrainOutput(global_step=148, training_loss=0.08669105494344556, metrics={'train_runtime': 246.5033, 'train_samples_per_second': 0.6, 'train_steps_per_second': 0.6, 'total_flos': 325065690316800.0, 'train_loss': 0.08669105494344556, 'epoch': 1.0})

### Explanation

- **Epoch**:
  - Refers to one complete pass of the entire training dataset through the model. In this case, the current epoch is 1, meaning the model has just completed its first pass over the training data.

- **Train Loss (0.098000)**:
  - Represents the error or discrepancy between the model's predictions and the actual labels on the training dataset. A lower value indicates the model is learning and improving on the training data. In this case, the training loss is 0.098, showing good performance on the training data.

- **Validation Loss (0.138821)**:
  - Represents the error on the validation dataset, which is unseen during training. This metric is used to evaluate the model's generalization ability on unseen data. A validation loss of 0.138821 indicates slightly worse performance than on the training data, but still within a reasonable range.

### Interpretation

- The model seems to be learning effectively, as the **train loss** is quite low (0.098). However, the **validation loss** is slightly higher (0.138821), indicating a slight overfitting or the model performing somewhat better on the training data than the unseen validation data.
  
- The difference between the training loss and validation loss is not large, which suggests that the model is not overfitting too much. The training can likely proceed for more epochs, and keeping an eye on this gap will help determine whether adjustments (like regularization or early stopping) are needed.

- If the validation loss continues to decrease over further epochs, the model is improving. However, if the gap between training and validation loss grows larger, it might signal the need for adjustments to prevent overfitting.

### Suggestion for real scenarios

- Increase the number of epochs to verify improvement and overfitting.

In [28]:
# Push the model on the Hub

import time

startTime = time.time()

print('Pushing to hub...\n')


trainer.push_to_hub()


endTime = time.time()

exec_time = endTime - startTime

print('Push Completed !!!')
print(f'Execution Time: {exec_time:.2f} seconds')


Pushing to hub...



Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

events.out.tfevents.1727461875.0adc2c3c7714.19680.1:   0%|          | 0.00/27.0k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

events.out.tfevents.1727461718.0adc2c3c7714.19680.0:   0%|          | 0.00/18.7k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.24k [00:00<?, ?B/s]

Push Completed !!!
Execution Time: 85.92 seconds


## Retest the loaded model

In [29]:
# load the model from Hugging Face Hub
loaded_model = AutoModelForSeq2SeqLM.from_pretrained("Kain17/bart-cnn-samsum-finetuned")

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [30]:
# Test loaded model

tester_output = generate_summary(sample, loaded_model)


In [31]:
print("SAMPLE:")
print(sample)
print('-'*50)
print('SUMMARY:')
print(tester_output)
print('-'*50)
print('GROUND TRUTH:')
print(label)

SAMPLE:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
--------------------------------------------------
SUMMARY:
Hannah asks Amanda for Betty's number. Amanda can't find it. Hannah suggests asking Larry to text Betty. Hannah and Amanda agree that Larry is nice.
--------------------------------------------------
GROUND TRUTH:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


###### End of the Notebook