# Assignment 1000: Fine Tuning an LLM for Text Generation

### Instrctor: Dr. Ankur Mali

### Submission by: Abrar Zahin


## Objective:
The objective of this notebook is to fine tune an open source pre-trained LLM on a dataset to learn about the significance of this process. After fine tunign the model, I compare the outputs of the generated text from the pre trained model to the trained model. I also choose a model that is better suited for this task and compare both models.

## Methodology:
Our procedure is divided in to these steps:


1.   Installation
2.   Data loading and Data Preprocessing
3.   Training
4.   Evaluation


### Installation:
We install the required dependencies. Here, the transformers and datasets libraries are installed.

In [1]:
!pip install --quiet transformers datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m481.3/491.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/183.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Data Loading and Data preprocessing:
In the beginning, we import pytorch and the tokenizers, models, trainer and trainer argument libraries from transformers. Load_dataset from the datasets library is used for loading our Huggingface dataset to the model.


**Model:** For this experiment, I am using the `bart-base` model trained from facebook (Sequence to sequence). Since this model is pretrained for sequence to sequence generation, this model can be expected to work better than GPT -2 in summarizing the texts upon training with XSUM (Extreme Summarization) dataset.


**Dataset:** The Dataset I used is called the XSum (Extreme summarization) which is usable for the a summarization task. The task is to create a one-shot one sentence summary of articles from the British Broadcasting Corporation.

From the train split, we take the first 10,000 examples (documents) and from the validation split, we take the first 1500 examples.

**Tokenizer:** The input tokenizer tokenizes the documents, truncating if its greater than the max length of 512 tokens and padding if shorter than the max length.
The target tokens are of shorter length with 64 tokens.

A PyTorch tensor is returned by the tokenizer.

### BART-base model from facebook:

In [22]:

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
from datasets import load_dataset

model_name = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)             #Tokenizer of the facebook/bert/base model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

dataset = load_dataset("xsum")               #Loading dataset using the dataset framework
trust_remote_code = True

train_data = dataset["train"].select(range(10000))      #train_data derived by splitting the dataset
val_data = dataset["validation"].select(range(1500))     #Val_data is validation data derived by splitting the dataset


def tokenize_function(example):
    tokenizer.pad_token = tokenizer.eos_token       #Padding token


    input = tokenizer(
        example["document"],       #Documents are tokenized as inputs
        truncation=True,           #Document truncated if longer than max length
        padding="max_length",   #Padding added for same length
        max_length=512,  # longer since articles are larger
        return_tensors="pt"        #returned as pytorch tensors
    )


    with tokenizer.as_target_tokenizer():
        target = tokenizer(
            example["summary"],       #Summary is tokenized as targets
            truncation=True,
            padding="max_length",
            max_length=64,  # summaries are short
            return_tensors="pt"
        )

    input["labels"] = target["input_ids"]        #targets are passed in with labels
    return input

model.resize_token_embeddings(len(tokenizer))              #embeddings resized due to adding padding token
tokenized_train = train_data.map(tokenize_function, batched=True)   #map the tokenizer function to each example in the dataset
tokenized_val = val_data.map(tokenize_function, batched=True)

tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])   # In the pytorch tensors, only important columns are passed
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

### Pre training:
Before training, the model generates the following for a given prompt:

In [23]:
prompt = "India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned Mangalyann, or Mars craft, was successfully placed into orbit; the spacecraft is set to travel for 300 days before reaching Mars and embarking on experiments; if successful, India will become the fourth space agency to reach the Red Planet after the US, Russia, and Europe; the total cost of the mission is put at about $75m (£45m), making it the cheapest Mars mission to date; it aims to study the Martian atmosphere and search for signs of life, among other objectives; the spacecraft will also test out technologies needed for a future interplanetary mission; the launch took place from the Sriharikota spaceport on the east coast of India"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

model.to(torch.device("cuda"))


output = model.generate(input_ids, max_new_tokens=64, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Prompt: {prompt}\nResponse: {response}')

Prompt: India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned Mangalyann, or Mars craft, was successfully placed into orbit; the spacecraft is set to travel for 300 days before reaching Mars and embarking on experiments; if successful, India will become the fourth space agency to reach the Red Planet after the US, Russia, and Europe; the total cost of the mission is put at about $75m (£45m), making it the cheapest Mars mission to date; it aims to study the Martian atmosphere and search for signs of life, among other objectives; the spacecraft will also test out technologies needed for a future interplanetary mission; the launch took place from the Sriharikota spaceport on the east coast of India
Response: India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned M

Prior to training, it only echoes the first few words of the prompt before stopping abruptly. Now we will see how the model trained on xsum database behaves.

### Training

We use the Trainer API to train the model on the provided dataset.

We specify the training arguments before training the model. The model is trained on only 1 epoch, with a batch size of 1. The model is evaluated every 100 steps and logged and saved every 500 steps.

In [24]:
training_args = TrainingArguments(
    output_dir="./finetuned_bertbasexsum",
    eval_strategy="steps",       #Evaluated per 100 steps instead of an entire epoch
    num_train_epochs=1,            #Only one epoch due to GPU memory and usage limitations
    per_device_train_batch_size=1,    #Batch size reduced for memory and usage limitations
    per_device_eval_batch_size=1,
    gradient_accumulation_steps = 4,          #gradient accumulation to replicate 4 times the batch size
    save_steps=500,
    logging_strategy="steps",         # logged every 100 steps
    logging_steps=100,
    #load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

trainer.train()

Step,Training Loss,Validation Loss
100,1.2325,0.969194
200,1.1498,0.976996
300,1.1563,0.965399
400,1.0855,0.960681
500,1.1166,0.954467
600,1.0979,0.945125
700,1.0918,0.949772
800,1.1215,0.941494
900,1.1071,0.93255
1000,1.0934,0.929081




TrainOutput(global_step=2500, training_loss=1.0740334228515624, metrics={'train_runtime': 2130.481, 'train_samples_per_second': 4.694, 'train_steps_per_second': 1.173, 'total_flos': 3048682291200000.0, 'train_loss': 1.0740334228515624, 'epoch': 1.0})

After Training, the model generates the following summary.

In [25]:
prompt = "India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned Mangalyann, or Mars craft, was successfully placed into orbit; the spacecraft is set to travel for 300 days before reaching Mars and embarking on experiments; if successful, India will become the fourth space agency to reach the Red Planet after the US, Russia, and Europe; the total cost of the mission is put at about $75m (£45m), making it the cheapest Mars mission to date; it aims to study the Martian atmosphere and search for signs of life, among other objectives; the spacecraft will also test out technologies needed for a future interplanetary mission; the launch took place from the Sriharikota spaceport on the east coast of India"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

model.to(torch.device("cuda"))


output = model.generate(input_ids, max_new_tokens=64, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Prompt: {prompt}\nResponse: {response}')

Prompt: India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned Mangalyann, or Mars craft, was successfully placed into orbit; the spacecraft is set to travel for 300 days before reaching Mars and embarking on experiments; if successful, India will become the fourth space agency to reach the Red Planet after the US, Russia, and Europe; the total cost of the mission is put at about $75m (£45m), making it the cheapest Mars mission to date; it aims to study the Martian atmosphere and search for signs of life, among other objectives; the spacecraft will also test out technologies needed for a future interplanetary mission; the launch took place from the Sriharikota spaceport on the east coast of India
Response: India's space agency has successfully launched a spacecraft into orbit over the Red Planet.


I also attempted to work with the Distilled GPT-2 model. The pre-trained model generated the following items.

### Distil GPT-2:

In [14]:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

dataset = load_dataset("xsum")
trust_remote_code = True

train_data = dataset["train"].select(range(10000))      #train_data derived by splitting the dataset
val_data = dataset["validation"].select(range(1500))     #Val_data is validation data derived by splitting the dataset


def tokenize_function(example):
    tokenizer.pad_token = tokenizer.eos_token


    input = tokenizer(
        example["document"],
        truncation=True,
        padding="max_length",
        max_length=512,  # longer since articles are larger
        return_tensors="pt"
    )


    with tokenizer.as_target_tokenizer():
        target = tokenizer(
            example["summary"],
            truncation=True,
            padding="max_length",
            max_length=64,  # summaries are short
            return_tensors="pt"
        )

    input["labels"] = target["input_ids"]
    return input

model.resize_token_embeddings(len(tokenizer))
tokenized_train = train_data.map(tokenize_function, batched=True)   #map the tokenizer function to each example in the dataset
tokenized_val = val_data.map(tokenize_function, batched=True)
# Remove columns other than input_ids/attention_mask
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [18]:
prompt = "India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned Mangalyann, or Mars craft, was successfully placed into orbit; the spacecraft is set to travel for 300 days before reaching Mars and embarking on experiments; if successful, India will become the fourth space agency to reach the Red Planet after the US, Russia, and Europe; the total cost of the mission is put at about $75m (£45m), making it the cheapest Mars mission to date; it aims to study the Martian atmosphere and search for signs of life, among other objectives; the spacecraft will also test out technologies needed for a future interplanetary mission; the launch took place from the Sriharikota spaceport on the east coast of India"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

model.to(torch.device("cuda"))


output = model.generate(input_ids, max_new_tokens=64, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Prompt: {prompt}\nResponse: {response}')

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned Mangalyann, or Mars craft, was successfully placed into orbit; the spacecraft is set to travel for 300 days before reaching Mars and embarking on experiments; if successful, India will become the fourth space agency to reach the Red Planet after the US, Russia, and Europe; the total cost of the mission is put at about $75m (£45m), making it the cheapest Mars mission to date; it aims to study the Martian atmosphere and search for signs of life, among other objectives; the spacecraft will also test out technologies needed for a future interplanetary mission; the launch took place from the Sriharikota spaceport on the east coast of India
Response: India has launched a rocket it hopes will allow it to join an elite group of space explorers to Mars; the country's space research organisation (Isro) said the unmanned M

Since the GPT-2 model is a decoder only model trained for text generation, it extended the prompt instead of summarizing it. Although it performed well in generating the text, it failed to summarize the document given.

# Challenges

I faced three main challanges while conducting the training. One of them was **running out of GPU memory**. While training the model, to decrease the loss, I aimed to train the model on 3 epochs. The GPU almost ran out of memory.
To mitigate these issues-

*   Reduced batch_size
*   Used a small part of the dataset as epoch
*   Used gradient accumulation

Due to gradient accumulation, the batch_size is simulated to a size 4 times larger. This is because the gradient is accumulated before updating the weights before each batch.

Another issue I faced during training the model was **finding the right model** for the Extreme Summarizing problem. In the beginning, I started the experiment with the model 'distilgpt-2' which is the distilled version of GPT-2. Since it is a text generation model, I decided to choose another model used more commonly for summarizing documents.

Another issue was the **GPU usage limitations** in Colab. Due to the limitations, I could not run more prompt tests to make more inferences.

# Inference:

For this experiment, I tested a Large Language Model on summarizing large documents in to one sentence summaries.

The BERT-base model from facebook is a model that was pre-trained for summarizations. After pre-processing and loading the dataset, I tested the pre-trained model.

The pre-trained model, when given a prompt, did not attempt to summarize it and only copied the generated data. Since it is a **general purpose** model, it generated a text based on the patterns it learned from the pre-training.

### After Training:
After training it on our dataset, it was found that the model behaved as expected, returning a satisfying summary of the given documents.Compared to the pre-trained model, the trained model performed much better.

The features of the generated summary are as follows-
* Mars is referred to as the red planet. One pattern that the model may have learned from the dataset is **generalization** of words and **Paraphrasing**.

* The summary is a broad generalization and does not contain any specific details.
* There were no words or phrases that indicate that the model may have hallucinated.

### During training:

From the training data, we can conclude that the training was satisfying. Despite having minor fluctuations, the valuation loss ultimately decreases down to 0.88 from 0.97. The training loss decreases from 1.11 to 1.07 which shows that the model has learned gradually.

There were no signs of overfitting as the loss ultimately did not increase.


### distilled GPT -2

I also trained the distilled GPT -2 model on the `xsum` dataset. Since it is a Casual LLM, It is not a summarization model and is pre trained to generate text. After training, the text is generated was the copy of the document along with an extended generated text.

The BART base model is a sequence to sequence pre-trained model, which is why it could capture the Extreme summarizing task better.

I  would like to dive deeper into this experiment by testing with more complex prompts and comparing it with other models such as RNNs, and CNNs.