#**CS355: Introduction to Large Language Models (LLMs)**
---
## **Assignment 2:** Fine Tuning, Evaluation and Data Augmentation

### Student Name: <code>[name]</code>
### Student ID: <code>[ID]</code>

## **Assignment Objectives**
This assignment will guide students through the process of Fine Tuning a mode, how to evaluate a model and augment the data for a model. By the end of the assignment, students will:
*   Be able to fine tune multiple models

*   Evaluate models on multiple criterias.

*   Augment data using various augmentation strategies.



---

## **READ THESE INSTRUCTIONS FIRST**

* There are exactly **3** tasks in this notebook.

* Do not change or remove any pre-written code. The provided code is included intentionally. Make sure to pay special attention to import statements, variable names, and pre-written comments in the code cells.

* Carefully read the task description before beginning each task to ensure you understand what is required.

* **There is no penalty for using AI assistance on this homework** as long as you fully disclose it and understand the solution you have provided. If you do use AI please disclose its use in the cell below.

* Ensure that all code cells in your notebook are executed before submission, with the output clearly visible. If errors are encountered during evaluation, marks will only be awarded for tasks completed up to the error-producing cell. Any attempt to misrepresent the output, such as showing results not generated by the code, will be considered a violation of academic integrity, resulting in an automatic score of zero for the assignment.

* **Submit the completed and fully executed notebook file as your final submission**.






 ### Did you use any AI assistance to complete this assignment?
* *your response here*



# Background on fine-tuning LLMs

**Summary:**

1. **LLM Pretraining:**
   - Large Language Models (LLMs) are pretrained on extensive text corpora.
   - Llama 2 was pretrained on a dataset of 2 trillion tokens, compared to BERT's training on BookCorpus and Wikipedia.
   - Pretraining is resource-intensive and time-consuming.

2. **Auto-Regressive Prediction:**
   - Llama 2, an auto-regressive model, predicts the next token in a sequence.
   - Auto-regressive models lack usefulness in providing instructions, leading to the need for instruction tuning.

3. **Fine-Tuning Techniques:**
   - Instruction tuning uses two main fine-tuning techniques:
     a. Supervised Fine-Tuning (SFT): Trained on instruction-response datasets, minimizing differences between generated and actual responses.
     b. Reinforcement Learning from Human Feedback (RLHF): Trained to maximize rewards based on human evaluations.

4. **RLHF vs. SFT:**
   - RLHF captures complex human preferences but requires careful reward system design and consistent human feedback.
   - Direct Preference Optimization (DPO) might be a future alternative to RLHF.
   - SFT can be highly effective when the model hasn't encountered specific data during pretraining.

# Fine-tuning BART for summarization: A detailed Example

This notebook contains an example of fine-tuning [Bart](https://huggingface.co/transformers/model_doc/bart.html) for generating summaries of article sections from the [WikiLingua](https://huggingface.co/datasets/wiki_lingua) dataset. WikiLingua is a multilingual set of articles. Firstly run the model for english model from [Hugging Face Model Hub](https://huggingface.co/models). We will be using the **English** portion of WikiLingua with [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) Bart checkpoint.

Please go through this example to understand the whole process of fine tuning a model.

## Setup

---

In [None]:
! pip install transformers
! pip install datasets
! pip install sentencepiece
! pip install rouge_score
! pip install wandb



In [None]:
import torch
import numpy as np
import datasets

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    DataCollatorForSeq2Seq,
)

from tabulate import tabulate
import nltk
from datetime import datetime

## Model and tokenizer
Download model and tokenizer. Use default parameters or try custom values (see [HF Bart configuration](https://huggingface.co/transformers/_modules/transformers/configuration_bart.html)).

In [None]:
language = "english"
model_name = "facebook/bart-large-cnn"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set model parameters or use the default
# print(model.config)

# tokenization
encoder_max_length = 256
decoder_max_length = 64

## Data

For demonstration, we are only using a small portion of the data.

In [None]:
data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]")

# Take a look at the data
for k, v in data["article"][0].items():
    print(k)
    print(v)

section_name
['Finding Other Transportation', 'Designating a Driver', 'Staying Safe']
document
['make sure that the area is a safe place, especially if you plan on walking home at night.  It’s always a good idea to practice the buddy system.  Have a friend meet up and walk with you. Research the bus, train, or streetcar routes available in your area to find safe and affordable travel to your destination.  Make sure you check the schedule for your outgoing and return travel.  Some public transportation will cease to run late at night.  Be sure if you take public transportation to the venue that you will also be able to get home late at night. Check the routes.  Even if some public transit is still running late at night, the routing may change.  Some may run express past many of the stops, or not travel all the way to the ends.  Be sure that your stop will still be available when you need it for your return trip. If you are taking public transit in a vulnerable state after drinking, it i

### Prepare

**Format and split into train and validation sets**

In [None]:
def flatten(example):
    return {
        "document": example["article"]["document"],
        "summary": example["article"]["summary"],
    }

def list2samples(example):
    documents = []
    summaries = []
    for sample in zip(example["document"], example["summary"]):
        if len(sample[0]) > 0:
            documents += sample[0]
            summaries += sample[1]
    return {"document": documents, "summary": summaries}

dataset = data.map(flatten, remove_columns=["article", "url"])
dataset = dataset.map(list2samples, batched=True)

train_data_txt, validation_data_txt = dataset.train_test_split(test_size=0.1).values()

**Preprocess and tokenize**

In [None]:
def batch_tokenize_preprocess(batch, tokenizer, max_source_length, max_target_length):
    source, target = batch["document"], batch["summary"]
    source_tokenized = tokenizer(
        source, padding="max_length", truncation=True, max_length=max_source_length
    )
    target_tokenized = tokenizer(
        target, padding="max_length", truncation=True, max_length=max_target_length
    )

    batch = {k: v for k, v in source_tokenized.items()}
    # Ignore padding in the loss
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in l]
        for l in target_tokenized["input_ids"]
    ]
    return batch

train_data = train_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=train_data_txt.column_names,
)

validation_data = validation_data_txt.map(
    lambda batch: batch_tokenize_preprocess(
        batch, tokenizer, encoder_max_length, decoder_max_length
    ),
    batched=True,
    remove_columns=validation_data_txt.column_names,
)

Map:   0%|          | 0/4351 [00:00<?, ? examples/s]

Map:   0%|          | 0/484 [00:00<?, ? examples/s]

In [None]:
!pip install evaluate



## Training

---

### Metrics

In [None]:
import evaluate
import nltk

nltk.download("punkt_tab")
metric = evaluate.load("rouge")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results from ROUGE
    result = {key: value for key, value in result.items()}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Training arguments

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="results",
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    # learning_rate=3e-05,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="logs",
    logging_steps=50,
    save_total_limit=3,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data,
    eval_dataset=validation_data,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(


### Train

Evaluate before fine-tuning

In [None]:
trainer.evaluate()

{'eval_loss': 3.9830331802368164,
 'eval_model_preparation_time': 0.0125,
 'eval_rouge1': 0.2613,
 'eval_rouge2': 0.0682,
 'eval_rougeL': 0.176,
 'eval_rougeLsum': 0.2435,
 'eval_gen_len': 68.9215,
 'eval_runtime': 258.2313,
 'eval_samples_per_second': 1.874,
 'eval_steps_per_second': 0.469}

Train the model

In [None]:
trainer.train()

Step,Training Loss
50,3.9113
100,3.7206
150,3.6475
200,3.6618
250,3.7122
300,3.6927
350,3.6943
400,3.6464
450,3.6962
500,3.7457




TrainOutput(global_step=1088, training_loss=3.6704022884368896, metrics={'train_runtime': 806.2655, 'train_samples_per_second': 5.396, 'train_steps_per_second': 1.349, 'total_flos': 2357268030947328.0, 'train_loss': 3.6704022884368896, 'epoch': 1.0})

Evaluate after fine-tuning

In [None]:
trainer.evaluate()

{'eval_loss': 3.516435384750366,
 'eval_model_preparation_time': 0.0125,
 'eval_rouge1': 0.3438,
 'eval_rouge2': 0.1322,
 'eval_rougeL': 0.2544,
 'eval_rougeLsum': 0.3314,
 'eval_gen_len': 64.4773,
 'eval_runtime': 217.4486,
 'eval_samples_per_second': 2.226,
 'eval_steps_per_second': 0.556,
 'epoch': 1.0}

## Evaluation

---

**Generate summaries from the fine-tuned model and compare them with those generated from the original, pre-trained one.**

In [None]:
def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples["document"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return outputs, output_str


model_before_tuning = AutoModelForSeq2SeqLM.from_pretrained(model_name)

test_samples = validation_data_txt.select(range(16))

summaries_before_tuning = generate_summary(test_samples, model_before_tuning)[1]
summaries_after_tuning = generate_summary(test_samples, model)[1]

In [None]:
print(
    tabulate(
        zip(
            range(len(summaries_after_tuning)),
            summaries_after_tuning,
            summaries_before_tuning,
        ),
        headers=["Id", "Summary after", "Summary before"],
    )
)
print("\nTarget summaries:\n")
print(
    tabulate(list(enumerate(test_samples["summary"])), headers=["Id", "Target summary"])
)
print("\nSource documents:\n")
print(tabulate(list(enumerate(test_samples["document"])), headers=["Id", "Document"]))

  Id  Summary after                                                                                                                                                                                                                                                                                                         Summary before
----  --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 

# Task 1 [40 Points]: Fine Tuning

You might need some luck for this task.
Look at the list given below:
<ol>
  0: Spanish,

  1: Portuguese,

  2: French,

  3: German,

  4: Russian,

  5: Italian,

  6: Indonesian,

  7: Dutch,

  8: Arabic,

  9: Vietnamese
</ol>

Your student ID has 5 digits. you will pick 2nd and 5th digit and pick matching languages from this list. You will import datasets of both these languages from wikilingua and pick 2 models (one for first language and one for second) from hugging face and fine tune.

You will also see that the size of the data is different in all languages. So, to make it fair for everyone, randomize the data and then pick 10k samples for fine tuning for each language. Also, while training, change the number of epochs to 3.

Please note that you can't pick bart-large for this and both picked models should be different. There is also a possibility that 2nd and 5th digit of your student ID are same. In that scenario, you will use 3rd or 4th digit of your student id to pick second language but make sure that this language should be different from the one you already picked as first language.

After fine tuning, save your models on your drive as ModelName_LanguageName_BeforeAugmentation where on the place of model name you will write the name of the model and on the place of language name you will write the name of language.

What is your student ID and which langauges did you pick?

_Write your answer here_

In [None]:
# Write your code for Task 1 here. You can utlize the functions given earlier for this as well.

# Task 2 [20 Points]: Evaluation
You already saw rouge score for evaluation earlier. That isn't the only evaluation metric as we saw in class. Implement 2 more evaluation metrics: BLEU Score and BERT Score. Both metrics and rouge score will evaluate both the models i.e. 3 evaluation criterias for each model.


In [None]:
# Write your code for Task 2 here. You can utlize the functions given earlier for this as well.

# Task 3 [40 points]: Data Augmentation
There was various ways to augment data (depends on the type of data actually though). Since in LLMs we are dealing in text majorly, let's talk about text data augmentation.

The most prominent methods to augment data in text are these.

    Word or sentence shuffling: randomly changing the position of a word or sentence.
    Synonym replacement: replace words with synonyms.
    Syntax-tree manipulation: paraphrase the sentence using the same word. e.g:the cat sat on the mat to on the mat, the cat sat.
    Random word insertion: inserts words at random.
    Random word deletion: deletes words at random.
    Antonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its antonyms chosen at random.
    Backtranslation: translate your given data to some other language and translate it back to the original language.
  
There are various ways to implement these ways for augmenting data. You can write your own functions and utlize some dictionary library to do it (for synonym/anotnym replacement and word insertion). For backtranslation, maybe you can you some translation library. But, that is too time consuming to implement. Another possible approach is feeding your data to a large model for augmentation but that is costly, and requires your prompt to be really good so that model doesn't hallucinate. Moreover, all these tasks at some point were done by someone. So why not use their standard code?

So let me introduce you to [nlpaug](https://nlpaug.readthedocs.io/en/latest/). This library is used for data augmentation for not just text but for audio and images as well. You will use it to augment your data. There are 7 ways listed in the data augmentation methods. You will use backtranslation for augmenting data for both languages. From remaining 6, you will use any 3 methods for augmenting data for first language and other 3 methods for second language. For each method, you will generate 10 new samples.

After completing the augmentation, train your models again for 3 epochs and evaluate them against all 3 evaluation metrics.

After fine tuning, save your models on your drive as ModelName_LanguageName_AfterAugmentation where on the place of model name you will write the name of the model and on the place of language name you will write the name of language.




In [None]:
# Write your code for Task 3 here. You can utlize the functions given earlier for this as well.