<a href="https://colab.research.google.com/github/balajineelisetty/balajineelisetty/blob/main/920_LLMLab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <a name="0">Lab 2: Finetune a Pretrained Model </a>
    
In this notebook, we will explore finetuning a pretrained large languages model (LLM), a powerful technique in the field of generative AI. LLMs have been pre-trained on enormous amounts of data, making them highly effective in understanding the nuances of language and generating coherent responses. These models have learned to extract useful features and patterns from the data, making them a valuable resource for various machine learning tasks.

Finetuning, also known as transfer learning, allows us to leverage the knowledge gained by a pretrained model and apply it to a different but related task. Instead of training a model from scratch, we start with a pretrained model and modify it to adapt to our specific problem domain. This approach not only saves significant computational resources but also benefits from the generalization capabilities of the pretrained model.

In this notebook, we will walk through the process of finetuning a pretrained model step by step. We will cover the following key aspects:

1. <a href="#1">Import libraries</a>
2. <a href="#2">Prepare the training dataset</a>
3. <a href="#3">Load a pretrained LLM</a>
4. <a href="#4">Define the trainer and finetuned the LLM</a>
5. <a href="#5">Inference with the finetuned model</a>
6. <a href="#6">Quizzes</a>


Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you test your understanding by taking a short quiz.</p> |

----        

In [None]:
!unzip LLMLab2.zip

In [None]:
rm -rf LLMLab2.zip

In [None]:
cd LLMLab2

In [None]:
%%capture
!pip3 install -r requirements.txt --quiet

In [None]:
%%capture

import os
import numpy as np
import pandas as pd
from typing import Any, Dict, List, Tuple, Union
from datasets import Dataset, load_dataset, disable_caching
disable_caching() ## disable huggingface cache

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from transformers import TextDataset
from transformers import pipeline

import torch
from torch.utils.data import Dataset, random_split
from transformers import TrainingArguments, Trainer
import accelerate
import bitsandbytes

from IPython.display import Markdown

### <a name="2">Prepare the training dataset</a>
(<a href="#0">Go to top</a>)

Second, let's load and view the dataset. We will use [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/faqs/) as our main dataset. The dataset has two columns `instruction` and `response`.

In [None]:
sagemaker_faqs_dataset = load_dataset("csv",
                                      data_files='data/amazon_sagemaker_faqs.csv')['train']
sagemaker_faqs_dataset

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['instruction', 'response'],
    num_rows: 83
})

In [None]:
sagemaker_faqs_dataset[0]

{'instruction': 'Why should I use Amazon SageMaker Studio Lab?',
 'response': 'Amazon SageMaker Studio Lab is for students, researchers, and data scientists who need a free notebook development environment with no setup required for their ML classes and experiments. SageMaker Studio Lab is ideal for users who do not need a production environment but still want a subset of the SageMaker functionality to improve their ML skills. SageMaker sessions are automatically saved, enabling users to pick up where they left off for each user session.'}

---

To finetune our LLM, we need to decorate our instruction dataset with a prompt like below.

In [None]:
prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request. Instruction: {instruction}\n Response:"""
answer_template = """{response}"""

Markdown(prompt_template + answer_template)

Below is an instruction that describes a task. Write a response that appropriately completes the request. Instruction: {instruction}
 Response:{response}

Let's feed the templates to our dataset via the below function named `_add_text`. It takes a record as input. The function first checks if both the instruction and response fields are not empty. If either of them is empty, it raises a ValueError with a corresponding error message. If both fields have values, the function creates a new "text" field in the record by formatting them using given `prompt_template` and `answer_template`. We also add the instruction and the response as additional fields.

In [None]:
def _add_text(rec):
    instruction = rec["instruction"]
    response = rec["response"]

    if not instruction:
        raise ValueError(f"Expected an instruction in: {rec}")

    if not response:
        raise ValueError(f"Expected a response in: {rec}")

    rec["prompt"] = prompt_template.format(instruction=instruction)
    rec["answer"] = answer_template.format(response=response)
    rec["text"] = rec["prompt"] + rec["answer"]

    return rec

In [None]:
sagemaker_faqs_dataset = sagemaker_faqs_dataset.map(_add_text)
sagemaker_faqs_dataset[0]

Map:   0%|          | 0/83 [00:00<?, ? examples/s]

{'instruction': 'Why should I use Amazon SageMaker Studio Lab?',
 'response': 'Amazon SageMaker Studio Lab is for students, researchers, and data scientists who need a free notebook development environment with no setup required for their ML classes and experiments. SageMaker Studio Lab is ideal for users who do not need a production environment but still want a subset of the SageMaker functionality to improve their ML skills. SageMaker sessions are automatically saved, enabling users to pick up where they left off for each user session.',
 'prompt': 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Instruction: Why should I use Amazon SageMaker Studio Lab?\n Response:',
 'answer': 'Amazon SageMaker Studio Lab is for students, researchers, and data scientists who need a free notebook development environment with no setup required for their ML classes and experiments. SageMaker Studio Lab is ideal for users who do not need a produ

Use `Markdown` to neatly display the text with PROMPT.

In [None]:
Markdown(sagemaker_faqs_dataset[0]['text'])

Below is an instruction that describes a task. Write a response that appropriately completes the request. Instruction: Why should I use Amazon SageMaker Studio Lab?
 Response:Amazon SageMaker Studio Lab is for students, researchers, and data scientists who need a free notebook development environment with no setup required for their ML classes and experiments. SageMaker Studio Lab is ideal for users who do not need a production environment but still want a subset of the SageMaker functionality to improve their ML skills. SageMaker sessions are automatically saved, enabling users to pick up where they left off for each user session.

---

### <a name="#3">Load a pretrained LLM</a>
(<a href="#0">Go to top</a>)

---
Similar to Lab1, we will continue working on the `Dolly-v2-3b` (the 3 billions parameter pretrainedmodel from [Dolly](https://github.com/databrickslabs/dolly) family). The Dolly family models are derived from EleutherAI’s Pythia-12b and fine-tuned on a [~15K record instruction corpus](https://huggingface.co/datasets/databricks/databricks-dolly-15k) generated by Databricks employees and released under a permissive license (CC-BY-SA).

First, let's initialize a tokenizer and a base model using the `Dolly-v2-3b` model from the Hugging Face Transformers library. The tokenizer converts raw text into tokens, and the base model generates text based on a given prompt. By following the instructions outlined above, you can correctly instantiate these components and leverage their functionality in your code.


The `AutoTokenizer.from_pretrained()` function is used to instantiate the tokenizer.
- `padding_side="left"` specifies the side of the sequences where padding tokens will be added. In this case, padding tokens will be added to the left side of each sequence.
- The `eos_token` is a special token representing the end of a sequence. By assigning it to the `pad_token`, any padding tokens added during tokenization will also be considered as end-of-sequence tokens. This can be useful when generating text using the model, as it will know when to stop generating text after encountering padding tokens.

After execution, the `tokenizer` object will be initialized and ready to use for tokenizing text.

In [None]:
model_id = "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

Now we initialize and download a base model using the `AutoModelForCausalLM` class provided by the Transformers library. Base models are responsible for generating text based on a given prompt.

The `AutoModelForCausalLM.from_pretrained()` function is used to instantiate the base model.
- `use_cache=False` determines whether the model should use the local cache when loading pre-trained weights. By setting it to False, the cache will not be used, and the model will always download the weights from the remote source.
- `device_map="auto"` specifies the device where the model will be loaded. Setting it to "auto" allows the library to automatically select the appropriate device (e.g., CPU or GPU) based on availability.
- `load_in_8bit` indicates to loading the model weights in 8-bit format, which is a technique to further reduce memory usage and improve performance.
- The `resize_token_embeddings()` method resizes the model's token embeddings to match the vocabulary size, allowing the model to correctly interpret and generate text based on the tokens used by the tokenizer.


After execution, the `base_model` object will be initialized and ready to use for generating text.

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # use_cache=False,
    device_map="auto", #"balanced",
    load_in_8bit=False,
    torch_dtype=torch.float16
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

#### Prepare model for training
Some pre-processing needs to be done before training such an int8 model using peft, therefore let's import an utiliy function `prepare_model_for_int8_training` that will:

- Casts all the non `int8` modules to full precision (fp32) for stability
- Add a forward_hook to the input embedding layer to enable gradient computation of the input hidden states
- Enable gradient checkpointing for more memory-efficient training

In [None]:
model.resize_token_embeddings(len(tokenizer))

Embedding(50280, 2560)

We use the `preprocess_batch` function to preprocess the "text" field of the batch, applying tokenization, truncation, and other relevant operations based on the specified maximum length. It takes a batch of data, a tokenizer, and a maximum length as input.

In [None]:
from functools import partial
import copy

MAX_LENGTH = 256

# Function to generate token embeddings
def _preprocess_batch(batch: Dict[str, List]):
    model_inputs = tokenizer(batch["text"], max_length=MAX_LENGTH, truncation=True, padding='max_length')
    # response_ids = tokenizer(batch["answer"], max_length=MAX_LENGTH, truncation=True, padding='max_length')

    # model_inputs["labels"] = response_ids["input_ids"]

    model_inputs["labels"] = copy.deepcopy(model_inputs['input_ids'])
    return model_inputs

_preprocessing_function = partial(_preprocess_batch)

Next, we apply the preprocessing function to each batch in the dataset, modifying the "text" field accordingly. The map operation is performed in a batched manner and the "instruction", "response", and "text" columns are removed from the dataset. Finally, the `processed_dataset` is created by filtering the `sagemaker_faqs_dataset` based on the length of the "input_ids" field, ensuring it is less than the specified `MAX_LENGTH`.

In [None]:
encoded_sagemaker_faqs_dataset = sagemaker_faqs_dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["instruction", "response", "prompt", "answer"],
)

processed_dataset = encoded_sagemaker_faqs_dataset.filter(lambda rec: len(rec["input_ids"]) <= MAX_LENGTH)

Map:   0%|          | 0/83 [00:00<?, ? examples/s]

Filter:   0%|          | 0/83 [00:00<?, ? examples/s]

Let's split dataset into `train` and `test` for evaluation.

In [None]:
split_dataset = processed_dataset.train_test_split(test_size=14, seed=0)
split_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 69
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 14
    })
})



---

### <a name="4">Define the trainer and finetuned the LLM</a>
(<a href="#0">Go to top</a>)

To finetune a model efficiently, we're going to use [LoRA: Low-Rank Adaptation](https://arxiv.org/abs/2106.09685). LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.


#### 1). Define the `LoraConfig` and load LoRA model

We'll us built LoRA class `LoraConfig` from [huggingface 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning](https://github.com/huggingface/peft). Within `LoraConfig`, let's specify the following parameters:

- `r`, the dimension of the low-rank matrices
- `lora_alpha`, the scaling factor for the low-rank matrices
- `lora_dropout`, the dropout probability of the LoRA layers


In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training, TaskType, PeftConfig, PeftModel

MICRO_BATCH_SIZE = 8
BATCH_SIZE = 64
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
LORA_R = 256 # 512
LORA_ALPHA = 512 # 1024
LORA_DROPOUT = 0.05

# Define LoRA Config
lora_config = LoraConfig(
                 r=LORA_R, # LoRA attention dimension
                 lora_alpha=LORA_ALPHA, #  Alpha parameter for LoRA scaling
                 lora_dropout=LORA_DROPOUT, # Dropout probability
                 bias="none",
                 task_type="CAUSAL_LM",
                target_modules=["query_key_value"],
)

Let's use the `get_peft_model` function to initialize the model with the LoRA framework, configuring it based on the provided `lora_config` settings. This allows the model to incorporate the benefits and capabilities of the LoRA optimization approach.

In [None]:
# Prepare int-8 model for training
model = prepare_model_for_int8_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 83886080 || all params: 2858972160 || trainable%: 2.9341342029717423


As we can see above, LoRA only trainable parameters is only ~3% of the full weights. Much efficient!

#### 2). Define the data collator

A DataCollator is a huggingface🤗 transformers function that takes a list of samples from a Dataset and collate them into a batch, as a dictionary of PyTorch tensors.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(
        model = model, tokenizer=tokenizer, max_length=MAX_LENGTH, pad_to_multiple_of=8, padding='max_length')


#### 3). Define the trainer

To finetune the LLM, we need to define a trainer. Let's define the training arguments first.

In [None]:
EPOCHS = 3
LEARNING_RATE = 1e-4
MODEL_SAVE_FOLDER_NAME = "dolly-3b-lora"

training_args = TrainingArguments(
                    output_dir=MODEL_SAVE_FOLDER_NAME,
                    overwrite_output_dir=True,
                    fp16=True,
                    per_device_train_batch_size=1,
                    per_device_eval_batch_size=1,
                    learning_rate=LEARNING_RATE,
                    # optim="adafactor",
                    num_train_epochs=EPOCHS,
                    logging_strategy="epoch",
                    evaluation_strategy="epoch",
                    save_strategy="epoch",
)

Now is where the magic happen! Let's initialize the trainer with our defined model, tokenizer, training arguments, data collator and the train/eval datasets.

The training may take ~15 minutes. Once the training is done, we save the finetuned model and tokenizer.

In [None]:
%%time
trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=split_dataset['train'],
        eval_dataset=split_dataset["test"],
        data_collator=data_collator,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

#### <a>Save the finetuned model</a>


After the training finished, we can save the model to a directory using the [`transformers.PreTrainedModel.save_pretrained`] function.
This function only saves the incremental 🤗 PEFT weights (adapter_model.bin) that were trained, meaning
it is super efficient to store, transfer, and load.

In [None]:
# trainer.model.save_pretrained(MODEL_SAVE_FOLDER_NAME)

If you want to save the full model you just finetuned, you can simply use the [`transformers.trainer.save_model`] function. Meanwhile, we save the training arguments together with the trained model.

In [None]:
trainer.save_model(MODEL_SAVE_FOLDER_NAME)
trainer.model.config.save_pretrained(MODEL_SAVE_FOLDER_NAME)

As you can see from the losses above, the training loss dropped quickly while the validation loss doesn't improve at all. This indicates that our training data and validation data are quite different. As a result, our model is overfitting to the training data and is not able to generalize well to unseen examples.


### <a name="5">Inference using the finetuned model</a>
(<a href="#0">Go to top</a>)

Let's use a text generation pipeline for instruction text generation using the Huggingface transformers library.

The `postprocess` function formats the generated sequence to extract the response text. It uses 'Response:' to locate and extract the response portion from the generated tokens. If the key is not found it throws a ValueError.

In [None]:
# Function to format the response and filter out the instruction from the response.
def postprocess(response):
    messages = response.split("Response:")
    if not messages:
        raise ValueError("Invalid template for prompt. The template should include the term 'Response:'")
    return "".join(messages[1:])

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <p style=" text-align: center; margin: auto;">Compare the responses of the fine-tuned model against the vanilla pre-trained LLM.</p>
    <p style=" text-align: center; margin: auto;"><b>Note: Results may not be factually accurate and may be based on false assumptions.</b></p>
    <br>
</div>

In [None]:
# Prompt for prediction
inference_prompt = "What solutions come pre-built with Amazon SageMaker JumpStart?"

In [None]:
%%capture
# Inference pipeline with the fine-tuned model
inf_pipeline =  pipeline('text-generation', model=trainer.model, tokenizer=tokenizer, max_length=256, trust_remote_code=True)

# Format the prompt using the `prompt_template` and generate response
response = inf_pipeline(prompt_template.format(instruction=inference_prompt))[0]['generated_text']

In [None]:
formatted_response = postprocess(response)
formatted_response

### <a name="6">Quizzes</a>
(<a href="#0">Go to top</a>)

Well done on completing the lab! Now, it's time for a brief knowledge assessment.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <p style=" text-align: center; margin: auto;">Answer the following questions to test your understanding of fine-tuning LLMs.</p>
    <br>
</div>

In [None]:
from ml_utils.quiz_questions import *
lab2_question1

In [None]:
lab2_question2

# Thank you!