# Full Fine Tune vs PEFT fine tune BART transformer for summarization
* Notebook by Adam Lang
* Date: 9/24/2024

# Overview
* In this notebook we will FULL finetune an encoder-decoder model for text summarization the well known BART model from Meta/Facebook.
* We will then show what PEFT fine-tuning can do and how easy it is to implement.

In [1]:
## install
!pip install transformers datasets evaluate transformers[torch]

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m


## Load Model and Tokenizer from huggingface
* This is a very large model over 400 million params so GPU is needed for any type of training or fine tuning.
* model card: https://huggingface.co/facebook/bart-large-cnn

In [2]:
## load model from hf hub
from transformers import BartTokenizerFast, BartForConditionalGeneration



## tokenizer
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-large-cnn', clean_up_tokenization_spaces=True)
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

## Load Dataset
* The SAMSum dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger convesations.
* SAMSum dataset: https://huggingface.co/datasets/Samsung/samsum
* py7zr: https://py7zr.readthedocs.io/en/latest/

In [3]:
## load dataset
!pip install py7zr

Collecting py7zr
  Downloading py7zr-0.22.0-py3-none-any.whl.metadata (16 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pycryptodomex>=3.16.0 (from py7zr)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pyzstd>=0.15.9 (from py7zr)
  Downloading pyzstd-0.16.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.4 kB)
Collecting pyppmd<1.2.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting multivolumefile>=0.2.3 (from py7zr)
  Downloading multivolumefile-0.2.3-py3-none-any.whl.metadata (6.3 kB)
Collecting inflate64<1.1.0,>=1.0.0 (from py7zr)
  Downloading inflate64-1.0.0-cp310-cp310-manylinux_2_17_

In [4]:
## load from huggingface
from datasets import load_dataset

##data
dataset = load_dataset('samsum')
dataset

samsum.py:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

The repository for samsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/samsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


corpus.7z:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

Summary:
* The dataset contains train, test, and validation sets.

## Generate Summary with BART base model OOTB
* We will generate a baseline summarization using the BART base model "out of the box" without fine tuning just to see the result.

In [5]:
## take sample of dataset
sample = dataset['test'][0]['dialogue']
label = dataset['test'][0]['summary']


## function to generate summarization
def gen_summary(input, llm):
  input_prompt = f"""
                 Summarize the following conversation.

                 {input}

                 Summary:
                 """

  input_ids = tokenizer(sample, return_tensors='pt')
  tokenized_output = llm.generate(input_ids['input_ids'], min_length=30, max_length=200)
  output = tokenizer.decode(tokenized_output[0], skip_special_tokens=True)

  return output



In [6]:
## generate OOTB summary
output = gen_summary(sample, llm=model)
print("Sample")
print(sample)
print("------------------")
print("Model Generated Summary:")
print(output)
print("Correct Summary (Labeled Data):")
print(label)

Sample
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
------------------
Model Generated Summary:
Hannah: Hey, do you have Betty's number? Amanda: Lemme check. Hannah: Ask Larry. Amanda: He called her last time we were at the park together.
Correct Summary (Labeled Data):
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


Summary:
* The summary generated by the model is not entirely correct.

## Prepare Dataset for training
* Dataset is huge with over 14,000 samples.
* We don't want to use that many to train it would take forever.
* `input_ids` tokenizes prompt
* `labels` tokenizes summary (prompt completion) by the generative model.
* We also have to set the `pad_token` to handle sequences that are not long enough.
* Map `tokenized_inputs` function to every example in dataset.
* `batched=True` to perform batching and improve efficiency.
* Remove old columns keeping only **input_ids** and **labels**.
* lastly, we are going to filter the dataset and keep only every 100 examples to shorten the dataset.

In [7]:
### tokenize function
def tokenize_inputs(example):
  start_prompt = "Summarize the following conversation.\n\n"
  end_prompt = "\n\nSummary"
  prompt = [start_prompt + dialogue + end_prompt for dialogue in example['dialogue']]
  example['input_ids'] = tokenizer(prompt, padding='max_length', truncation=True, return_tensors='pt',max_length=1024).input_ids
  example['labels'] = tokenizer(example['summary'], padding='max_length', truncation=True, return_tensors='pt',max_length=1024).input_ids

  return example

Note: We could use "shuffle" instead of "map" and "filter".

In [8]:
## apply function on the data

##1. pad token
tokenizer.pad_token = tokenizer.eos_token

##2. map tokenized inputs to every dataset example
tokenized_datasets = dataset.map(tokenize_inputs, batched=True) ## batching

##3. remove unwanted columns
tokenized_datasets = tokenized_datasets.remove_columns(['id','dialogue','summary'])

##4. filter dataset keeping only every 100 examples
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

Filter:   0%|          | 0/819 [00:00<?, ? examples/s]

Filter:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset Shape

In [9]:
## shape of data
print(f"Train dataset shape: {tokenized_datasets['train'].shape}")
print(f"Validation dataset shape: {tokenized_datasets['validation'].shape}")
print(f"Test dataset shape: {tokenized_datasets['test'].shape}")

Train dataset shape: (148, 2)
Validation dataset shape: (9, 2)
Test dataset shape: (9, 2)


Summary:
* The dataset went from 14,000 down to 148 training samples.

In [10]:
## lets get dict keys from dataset
tokenized_datasets['train'][0].keys()

dict_keys(['input_ids', 'labels'])

## Huggingface hub login
* If you wanted to push the finetuned model to the hf hub you would login via `notebook_login()` and set the parameter to push to your repo.
* I am not going to do that but for future reference it is noted.

In [11]:
## from huggingface_hub import notebook_login
## notebook_login()

## Fine-Tune Model

In [12]:
## fine-tuning model
from transformers import TrainingArguments, Trainer

## training args
training_args = TrainingArguments(
    output_dir ="./bart-cnn-samsum-finetuned", #local directory
    #hub_model_id=, #model ID on hf hub
    learning_rate=1e-5,
    num_train_epochs=1, # takes about 3 mins per epoch for full fine-tune
    weight_decay=0.01,
    auto_find_batch_size=True,
    eval_strategy='epoch',
    logging_steps=10
)


## trainer
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [13]:
### train
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.0986,0.140967


Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


TrainOutput(global_step=37, training_loss=0.3589502299154127, metrics={'train_runtime': 179.0455, 'train_samples_per_second': 0.827, 'train_steps_per_second': 0.207, 'total_flos': 320731481112576.0, 'train_loss': 0.3589502299154127, 'epoch': 1.0})

## Save Model Locally
* Do this if you dont push to hub


In [14]:
path_to_model = '/content/bart-cnn-samsum-finetuned'

In [15]:
# Assuming you have a Trainer instance named 'trainer'
trainer.save_model(path_to_model)  # This will save the model along with the config.json file

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


Summary:
* We can see the training loss or cross entropy loss was pretty low at 0.098. This is good.

### Push model to HF hub
* I am not going to do this step but for future reference.

In [16]:
## trainer.push_to_hub()

# Re-test model after FULL fine-tune

In [17]:
## load model
loaded_model = BartForConditionalGeneration.from_pretrained(path_to_model)

In [18]:
## generate output with trained model
output = gen_summary(sample, llm=loaded_model)

print("Sample")
print(sample)
print("-----------------")
print("Summary:")
print(output)
print("Ground Truth Summary:")
print(label)

Sample
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
-----------------
Summary:
Hannah asks Amanda for Betty's number. Amanda can't find it. Hannah asks her to text Larry. Hannah says she'll text him.
Ground Truth Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


Summary:
* The fine-tuning process only slightly improved the model.
* This could be further improved and obviously depends on your dataset.

# PEFT Fine tuning

In [19]:
## load model from hf hub
from transformers import BartTokenizerFast, BartForConditionalGeneration


tokenizer = BartTokenizerFast.from_pretrained('/content/bart-cnn-samsum-finetuned')
model = BartForConditionalGeneration.from_pretrained('/content/bart-cnn-samsum-finetuned')

In [20]:
## load dataset
from datasets import load_dataset

dataset = load_dataset('samsum')
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

## Prepare Dataset for Training
* Same as before

In [21]:
### tokenize function
def tokenize_inputs(example):
  start_prompt = "Summarize the following conversation.\n\n"
  end_prompt = "\n\nSummary"
  prompt = [start_prompt + dialogue + end_prompt for dialogue in example['dialogue']]
  example['input_ids'] = tokenizer(prompt, padding='max_length', truncation=True, return_tensors='pt',max_length=1024).input_ids
  example['labels'] = tokenizer(example['summary'], padding='max_length', truncation=True, return_tensors='pt',max_length=1024).input_ids

  return example






In [22]:
## apply function on the data

##1. pad token
tokenizer.pad_token = tokenizer.eos_token

##2. map tokenized inputs to every dataset example
tokenized_datasets = dataset.map(tokenize_inputs, batched=True) ## batching

##3. remove unwanted columns
tokenized_datasets = tokenized_datasets.remove_columns(['id','dialogue','summary'])

##4. filter dataset keeping only every 100 examples
tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

Filter:   0%|          | 0/819 [00:00<?, ? examples/s]

Filter:   0%|          | 0/818 [00:00<?, ? examples/s]

In [23]:
## check shape -- only getting every 100th example
print(tokenized_datasets['train'].shape)
print(tokenized_datasets['validation'].shape)
print(tokenized_datasets['test'].shape)

(148, 2)
(9, 2)
(9, 2)


## Create PEFT Model using LoRA
* LoRA is the method being used here.
* A little about the parameters:
1. `r` is the rank of the matrices that use LoRA
2. `task_type` is very important to tell the model what the task you are fine-tuning.
3. peft was created by huggingface

In [24]:
!pip install peft

Collecting peft
  Downloading peft-0.13.0-py3-none-any.whl.metadata (13 kB)
Downloading peft-0.13.0-py3-none-any.whl (322 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: peft
Successfully installed peft-0.13.0


In [25]:
from peft import LoraConfig, get_peft_model, TaskType

In [26]:
## setup lora_config
lora_config = LoraConfig(
    r=32,# 8, 16, 32
    lora_alpha=32,
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.SEQ_2_SEQ_LM
)

In [29]:
## peft model instantiate
peft_model = get_peft_model(model, peft_config=lora_config)

In [None]:
## hf notebook login if pushing to hub
#from huggingface_hub import notebook_login
#notebook_login()

## Train PEFT Model
* PEFT training_args and trainer operates basically the same as the training_args and trainer we used to fully fine-tune.
* Below we are training for 5 epochs, it will take less time per epoch than the full-fine tune.
  * We can train for more epochs because we are only training a little over 1% of the trainable parameters of the model.

In [34]:
from transformers import TrainingArguments, Trainer
# Ensure your model is a PreTrainedModel or use save_pretrained
#peft_trainer.save_model("./bart-cnn-samsum-peft")

# training ars
peft_training_args = TrainingArguments(
    output_dir = "./bart-cnn-samsum-peft", #local directory
    #hub_model_id= #id on HF hub,
    learning_rate=1e-5,
    num_train_epochs=5,
    weight_decay=0.01,
    auto_find_batch_size=True,
    eval_strategy='epoch',
    logging_steps=10
)


## peft trainer
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

In [35]:
## peft model params
peft_model.print_trainable_parameters()

trainable params: 4,718,592 || all params: 411,009,024 || trainable%: 1.1481


Summary:
* As mentioned above with PEFT we are only training 1.14% of the trainable params of the model.

In [36]:
## peft train model
peft_trainer.train()

Epoch,Training Loss,Validation Loss
1,0.1216,0.138062
2,0.1008,0.137967
3,0.101,0.137466
4,0.108,0.143705


RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'base_model.model.model.encoder.embed_tokens.weight', 'base_model.model.model.shared.weight', 'base_model.model.lm_head.weight', 'base_model.model.model.decoder.embed_tokens.weight'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
            

## Save PEFT Model
* You can push to hf hub with this code: `peft_trainer.push_to_hub()`
* Or you can save locally as below.

In [None]:
# save peft_model locally
peft_trainer.save_model('/content/bart-cnn-samsum-peft')  # This will save the model along with the config.json file

Non-default generation parameters: {'max_length': 142, 'min_length': 56, 'early_stopping': True, 'num_beams': 4, 'length_penalty': 2.0, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


## Test PEFT model

In [None]:
## generate summary function
def gen_summary(input, llm):
  input_prompt = f"""
                  Summarize the following conversation.

                  {sample}

                  Summary:
                  """
  input_ids = tokenizer(sample, return_tensors="pt")
  tokenized_output = llm.generate(input_ids=input_ids['input_ids'], min_length=30, max_length=200)
  output = tokenizer.decode(tokenized_output[0], skip_special_tokens=True)

  return output



## Reload Model then Test
* To do this pull the base model you fine-tuned.

In [None]:
## reload model
from peft import PeftModel, PeftConfig
from transformers import BartTokenizerFast, BartForConditionalGeneration

In [None]:
## peft tokenizer
tokenizer = BartTokenizerFast.from_pretrained('/content/bart-cnn-samsum-finetuned', clean_up_tokenization_spaces=True)

## peft model
model = BartForConditionalGeneration.from_pretrained('/content/bart-cnn-samsum-peft')


## load peft model now
loaded_peft_model = PeftModel.from_pretrained(
    peft_model_base,
    '/content/bart-cnn-samsum-peft',
    is_trainable=False
)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [None]:
## samples
sample = dataset['test'][0]['dialogue']
label = dataset['test'][0]['summary']

##output
output = generate_summary(sample, llm=loaded_peft_model)

##print results
print("Sample")
print(sample)
print("--------------------")
print("Summary:")
print(output)
print("Ground Truth Summary")
print(label)