# Fine Tuning for Summarisation Task
## Introduction
As stated, trying to perform the task of abstractive summarisation through fine tuning a T5 model. As the T5 model has both encoder and the decoder pre-trained, fine-tuning it on dataset should be great start for the task.

While T5 is pre-trained for summarisation task on normal CNN/Daily Mail dataset already, this serves as a demonstration to show how to do it for any domain specific summarisation if needed. 

We are also fine-tuning using Low-Rank Adaptation (LoRA), therefore only small number of parameters have to be fine-tuned for the task, that will augment the baseline model. We will compare then compare its performance in summarisation against the standard non fine-tuned instance.

In [37]:
# Import section
from bert_score import score
from datasets import load_dataset, Dataset, DatasetDict
from transformers import T5ForConditionalGeneration
from transformers.trainer import Trainer
from transformers.training_args import TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
import torch

from typing import cast

from utils import preprocess_function, get_model_name, get_tokenizer, get_data_collator

## Section 1: Preparing the Dataset
The CNN/Daily Mail Dataset of News Articles and their highlights have been hosted as a [huggingface dataset](https://huggingface.co/datasets/abisee/cnn_dailymail) and therefore can be downloaded through the `datasets` library of huggingface.

In [2]:
dataset = load_dataset("cnn_dailymail", "3.0.0")
dataset = cast(DatasetDict, dataset)

### 1.1. Inspect the Dataset

In [3]:
sample = dataset['train'][0]

print("Article:")
print(sample['article'][:300])
print("")
print("Summary:")
print(sample['highlights'])

Article:
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappoi

Summary:
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


In [4]:
# Check the number of samples present
print(f"Number of training samples: {len(dataset['train'])}")
print(f"Number of validation samples: {len(dataset['validation'])}")
print(f"Number of test samples: {len(dataset['test'])}")

Number of training samples: 287113
Number of validation samples: 13368
Number of test samples: 11490


### 1.2. Split dataset into tokens ready for consumption by the model.

In [5]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, num_proc=8)

Map (num_proc=8):   0%|          | 0/287113 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/13368 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/11490 [00:00<?, ? examples/s]

## Section 2: Creating the model and the LoRA Config
The hugging face interface makes it very easy to perform fine-tuning using LoRA.

In [6]:
# Conditional Generation is needed over raw hidden encoder decoder stats from T5Model for this task.
# This comes with the needed vocabulary logits for generating the summary tokens.
model = T5ForConditionalGeneration.from_pretrained(get_model_name())

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

We create the following LoRA config
1. Use rank 8 to reduce the number of parameters.
2. Alpha influences how much LoRA matrix contributes to the final output.
3. We target the Query and Values part of the attention module in the model for Adaptation, as they are the most impactful.
4. Adding dropout of 0.05 for better regularisation.
5. Biases are not adapted as of now.
6. Since it generates a summary from article, it is a sequence to sequence task.

In [7]:
# LoRA Config
lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q", "v"],
                         lora_dropout=0.05, bias="none", task_type=TaskType.SEQ_2_SEQ_LM)

We then add LoRA adapter to the model.

In [8]:
model = get_peft_model(model, lora_config)

# Show how many parameters we train for indicating efficiency.
model.print_trainable_parameters()

trainable params: 294,912 || all params: 60,801,536 || trainable%: 0.4850


## Section 3: Creating a Trainer
Now that we have obtained the appropriate tokens for the model to consume from the dataset and created a LoRA wrapped model instance for fine-tuning, we will create the trainer instance to actually train the model.

Creating an instance of the TrainingArguments to be supplied to the Trainer.
1. Saving the weights to the results folder.
2. Evaluating the performance every 500 steps and logging progress every 100 steps.
3. On a training and evaluation batch size of 16.
4. With a very small learning rate of 1e-5 as it is a fine tuning task.
5. Warm up starts with a lower learning rate and then gradually increases to our set learning rate to ensure stability.
6. Save the weights every 1000 steps and only retain the 2 most recent checkpoints.
7. Use mixed precision for faster training.
8. Save the logs to the logs folder and no remote report.

In [22]:
# Adjust these according to your hardware constraints and performance requirements.
TOTAL_EPOCHS=1
TRAIN_BATCH=64
EVAL_BATCH=64

In [23]:
# Create a TrainingArguments instance to give the trainer its configuration.
training_args = TrainingArguments(output_dir='./results', eval_steps=500, logging_steps=100, 
                                  per_device_train_batch_size=TRAIN_BATCH, per_device_eval_batch_size=EVAL_BATCH, 
                                  num_train_epochs=TOTAL_EPOCHS, learning_rate=1e-5, 
                                  warmup_steps=200, save_steps=1000, save_total_limit=2, fp16=True,
                                  logging_dir='./logs', report_to='none')

### 3.1. Create an instance of Trainer for training loop

In [24]:
trainer = Trainer(model=model, args=training_args, train_dataset=tokenized_dataset['train'], 
                  eval_dataset=tokenized_dataset['validation'], data_collator=get_data_collator(model))

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
No label_names provided for model class `PeftModelForSeq2SeqLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


### 3.2. Run the training loop

In [25]:
trainer.train()

Step,Training Loss
100,2.2077
200,2.1972
300,2.2009
400,2.16
500,2.1615
600,2.1625
700,2.1511
800,2.1484
900,2.1456
1000,2.1403


TrainOutput(global_step=4487, training_loss=2.123746235598587, metrics={'train_runtime': 2082.2125, 'train_samples_per_second': 137.888, 'train_steps_per_second': 2.155, 'total_flos': 3.911850631417037e+16, 'train_loss': 2.123746235598587, 'epoch': 1.0})

### 3.3. Save the model weights

In [26]:
model.save_pretrained("t5-small-lora-ft")

## Section 4: Generate a summary from a real article

In [27]:
# Switch our fine-tuned model to eval mode, to prevent calculation of gradients.
model.eval()
tokenizer = get_tokenizer()

In [30]:
real_article = dataset["test"][0]["article"]
print(real_article)

(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony, sa

In [32]:
article_text = "summarize: " + real_article
inputs = tokenizer(article_text, return_tensors="pt",
                   truncation=True, max_length=512)
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Generate summary
with torch.no_grad():
    outputs = model.generate(**inputs, max_length=128, num_beams=4)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated summary
print(summary)

The Palestinian Authority officially became the 123rd member of the International Criminal Court. The formal accession was marked with a ceremony at The Hague, in the Netherlands. Israel and the United States opposed the Palestinians' efforts to join the body.


## Section 5: Compare with Baseline on summarisation performance
BERTScore will be used as it compares the semantic meaning over literal n-gram overlap (as in the case of ROUGE) and therefore is better suited to measure performance of an abstractive summarisation.

In [42]:
# Base model instance to compare performance against
base_model = T5ForConditionalGeneration.from_pretrained("t5-small").to("cuda")

# 200 articles for performance evaluation
test_set = dataset['test'].select(range(200))
test_set = cast(Dataset, test_set)

base_model.eval()
model.eval()
print("Loaded control model!")

Loaded control model!


In [43]:
# Generate the summaries from both model instances
baseline_summaries = []
finetuned_summaries = []

for item in test_set:
    input_text = "summarize: " + item["article"]
    inputs = tokenizer(input_text, return_tensors="pt",
                       truncation=True, max_length=512)
    inputs = {k: v.to("cuda") for k, v in inputs.items()}
    with torch.no_grad():
        output1 = base_model.generate(**inputs, max_length=128)
        output2 = model.generate(**inputs, max_length=128)
    summary1 = tokenizer.decode(output1[0], skip_special_tokens=True)
    summary2 = tokenizer.decode(output2[0], skip_special_tokens=True)
    baseline_summaries.append(summary1)
    finetuned_summaries.append(summary2)

In [44]:
references = [item["highlights"] for item in test_set]

print(len(baseline_summaries))
print(len(finetuned_summaries))
print(len(references))

P_base = score(baseline_summaries, references, lang="en")
P_finetuned = score(finetuned_summaries, references, lang="en")

200
200
200


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [45]:
print(f"Base T5 BERTScore F1: {P_base[2].mean().item():.4f}")
print(f"LoRA-Tuned T5 BERTScore F1: {P_finetuned[2].mean().item():.4f}")

Base T5 BERTScore F1: 0.8594
LoRA-Tuned T5 BERTScore F1: 0.8665
