<a href="https://colab.research.google.com/github/eschjtrDE/UPLIMIT_Opensource_LLMs/blob/main/UPLIMIT_Opensource_LLMs_Week_2_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets trl peft bitsandbytes accelerate 



In [2]:
!pip install peft



# 1. Supervised Finetuning (SFT)

Supervised fine-tuning (SFT) is a specific approach to finetuning that involves training a model on a labeled dataset that directly maps inputs to desired outputs. SFT, including instruction-tuning, which teaches a model to respond based on what humans define.

In this section you will complete the implementation of an SFT implementation using the package Transformers Reinforcement Learning (TRL) package. TRL is a library built on top of the HuggingFace Transformers library that provides a simple interface and training loop for finetuning models using reinforcement learning. TRL is designed to be easy to use and flexible, allowing you to quickly experiment with different reinforcement learning approaches to finetuning. That said, it's abstracted nature means that it is not always the best tool for students. So below we will also provide a more detailed example of how to implement PPO using the Transformers library directly.


## 1.1. Setup

In [4]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments

In [5]:
from peft import LoraConfig
from trl import SFTTrainer

In [10]:
!pip install accelerate bitsandbytes -qq

We will need to reduce the size of the language model, so that it fits in memory. We will use a process referred to as 'Quantization' to reduce the size of the model. Quantization is a process that reduces the size of a model by reducing the precision of the weights. For example, a 32-bit floating point number can be converted to a 16-bit floating point number, reducing the size of the model by 50%. The downside of quantization is that it can reduce the accuracy of the model. However, in practice, quantization can be used to reduce the size of a model with minimal impact on accuracy.

Quantization is beyond the scope of this course, but if you are interested in learning more, you can read the following article: [Quantization: How to shrink a model size by 4x times with TensorFlow](https://towardsdatascience.com/introduction-to-weight-quantization-2494701b9c0c#:~:text=Typically%2C%20the%20size%20of%20a,a%20process%20known%20as%20quantization.).

In [11]:
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

## 1.2. Load the model and tokenizer

Now we will load the model and tokenizer. We will use the `facebook/opt-350m` model, which is a smaller transformer model that suffice for our purposes. We will also use the `GPT2TokenizerFast` tokenizer, which is a fast tokenizer that is optimized for transformer models. [OPT was first introduced in Open Pre-trained Transformer Language Models and first released in metaseq's](https://arxiv.org/abs/2205.01068).

In [12]:
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path="facebook/opt-350m",
    quantization_config=quantization_config,
    trust_remote_code=False,
    torch_dtype=torch.bfloat16,
)

RuntimeError: No GPU found. A GPU is needed for quantization.

## 1.3. Load the dataset

Next we will load the dataset.  We will use the `timdettmers/openassistant-guanaco` dataset, which is a dataset of questions and answers. This dataset is a subset of the Open Assistant dataset, which you can find here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main

**Example of Dataset Sample**


```python
{
    "message_id": "218440fd-5317-4355-91dc-d001416df62b",
    "parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
    "user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
    "text": "It was the winter of 2035, and artificial intelligence (..)",
    "role": "assistant",
    "lang": "en",
    "review_count": 3,
    "review_result": true,
    "deleted": false,
    "rank": 0,
    "synthetic": true,
    "model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
    "labels": {
        "spam": { "value": 0.0, "count": 3 },
        "lang_mismatch": { "value": 0.0, "count": 3 },
        "pii": { "value": 0.0, "count": 3 },
        "not_appropriate": { "value": 0.0, "count": 3 },
        "hate_speech": { "value": 0.0, "count": 3 },
        "sexual_content": { "value": 0.0, "count": 3 },
        "quality": { "value": 0.416, "count": 3 },
        "toxicity": { "value": 0.16, "count": 3 },
        "humor": { "value": 0.0, "count": 3 },
        "creativity": { "value": 0.33, "count": 3 },
        "violence": { "value": 0.16, "count": 3 }
    }
}
```

In [13]:
dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## 1.4. Train the model

This step you will train the model using the `Trainer` class from the `transformers` library. The `Trainer` class provides a simple interface for training a model. It takes care of the details of training, such as batching, shuffling, and logging. It also provides a simple interface for logging metrics and saving checkpoints.

## EXERCISE: Implement the `train` function

Configure the trainer class using the parameters below.

| Variable | Value |
| --- | --- |
| Output Directory | "output_dir" |
| Batch Size | 16 |
| Gradient Accumulation | 16 |
| Learning Rate | 1.41e-5 |
| Logging Frequency  | 1 |
| Epochs | 3 |
| Maximum Steps | -1 |
| Reporting Destination | None |
| Checkpoint Save Steps | 100 |
| Total Checkpoint Limit | 10 |
| Push Model | False |
| Model Id | None |
| Enable Gradient Checkpointing | False |
| Lora Radius | 64 |
| Lora Alpha Value | 16 |
| Bias Type | "none" |
| Task Type | "CAUSAL_LM" |

Review the documentation for parameter names: [SFTTrainer Documentation](https://huggingface.co/docs/trl/sft_trainer)



In [None]:
# SOLUTIONS !!!

# Step 3: Define the training arguments
training_args = TrainingArguments(
    output_dir="output_dir",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    logging_steps=1,
    num_train_epochs=3,
    max_steps=25,
    save_steps=100,
    save_total_limit=10,
    push_to_hub=False,
    hub_model_id=None,
    gradient_checkpointing=False,
    report_to="none"
)

# Step 4: Define the LoraConfig
peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    bias="none",
    task_type="CAUSAL_LM",
)

# Step 5: Define the Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    max_seq_length=512,
    train_dataset=dataset,
    dataset_text_field="text",
    peft_config=peft_config,
)

trainer.train()

In [None]:
# Step 3: Define the training arguments
training_args = TrainingArguments(
    output_dir="output_dir",
    per_device_train_batch_size=64, # TODO: define a functional batch size. It will be low!
    gradient_accumulation_steps=1,
    learning_rate=1.41e-5,
    logging_steps=1,
    num_train_epochs=3,
    max_steps=25,
    save_steps=100,
    save_total_limit=10,
    push_to_hub=False,
    hub_model_id=None,
    gradient_checkpointing=False,
    report_to="none"
)

# Step 4: Define the LoraConfig
peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    bias="none",
    task_type="CAUSAL_LM",
)

# Step 5: Define the Trainer
trainer = SFTTrainer(
    # TODO: add model, dataset, and training arguments to the Trainer
    max_seq_length=512,
    dataset_text_field="text",
    peft_config=peft_config,
)

trainer.train()

# 2. Direct Preference Optimsation (DPO)

DPO has emerged as a more efficient and streamlined method of fine-tuning large language models (LLMs), offering a simpler alternative to the complex RLHF approach. It treats the task of aligning a language model's output with human preferences as a binary classification problem, thereby simplifying the process and making it more stable and computationally lightweight.

## 2.1 Data Processing

The first step in DPO is to prepare the data. The data is prepared by first tokenizing the data using the tokenizer provided by the model. The tokenizer is used to convert the text into a sequence of tokens. The tokens are then converted into a sequence of integers using the tokenizer's `convert_tokens_to_ids` method. DPO requires examples of the prompt, a positive example, and a negative example.

```python
dpo_dataset_dict = {
 "prompt": ["hello", "how are you", …],
 "chosen": ["hi, nice to meet you", "I am fine", …],
 "rejected": ["leave me alone", "I am not fine", …],
 }
 ```

 ## EXERCISE: Implement the `prepare_data` function

 Implement the `prepare_data` function. The function should take a list of prompts and a list of chosen and rejected examples. The function should return a dictionary with the following keys: `prompt`, `chosen`, and `rejected`. The values for each key should be a list of tokenized and encoded examples. The `prompt` key should contain the tokenized and encoded prompts. The `chosen` key should contain the tokenized and encoded chosen examples. The `rejected` key should contain the tokenized and encoded rejected examples.

In [None]:
# SOLUTION !!!!

def extract_anthropic_prompt(prompt_and_response):
    """Extract the anthropic prompt from a prompt and response pair."""
    search_term = "\n\nAssistant:"
    search_term_idx = prompt_and_response.rfind(search_term)
    assert search_term_idx != -1, f"Prompt and response does not contain '{search_term}'"
    return prompt_and_response[: search_term_idx + len(search_term)]

def prepare_data(sample) -> Dict[str, str]:
    prompt = extract_anthropic_prompt(sample["chosen"])
    return {
        "prompt": prompt,
        "chosen": sample["chosen"][len(prompt) :],
        "rejected": sample["rejected"][len(prompt) :],
    }

In [None]:
from trl import DPOTrainer
from typing import Dict

def extract_anthropic_prompt(prompt_and_response):
    """Extract the anthropic prompt from a prompt and response pair."""
    search_term = "\n\nAssistant:"
    search_term_idx = prompt_and_response.rfind(search_term)
    assert search_term_idx != -1, f"Prompt and response does not contain '{search_term}'"
    return prompt_and_response[: search_term_idx + len(search_term)]

def prepare_data(sample) -> Dict[str, str]:
    prompt = extract_anthropic_prompt(sample["chosen"])
    # TODO return a diction of the prompt, chosen, and rejected by slicing the sample components
    return

train_dataset = load_dataset("Anthropic/hh-rlhf", split="train").map(prepare_data)
eval_dataset  = load_dataset("Anthropic/hh-rlhf", split="test").map(prepare_data)

## 2.2. Model Training

A tokenizer for GPT-2 is loaded, and its padding token is set to be the same as the end-of-sentence token, which is a typical setup for models that generate text.
The model is set to use 16-bit floating point precision (`torch.float16`) for memory efficiency and specifies device_map="auto" for optimal device placement (e.g., GPU).

The provided code snippet is for setting up and training a language model using Direct Preference Optimization (DPO) in Python. Let's break down the key components:

`AutoModelForCausalLM` and `AutoTokenizer` are classes from the Hugging Face `transformers` library. They are used to automatically load a pre-trained model and its corresponding tokenizer.  =`torch` is the PyTorch library, a popular framework for deep learning.

The code initializes a GPT-2 model (`model_id = "gpt2"`) for causal language modeling (predicting the next word in a sentence). The model is set to use 16-bit floating point precision (`torch.float16`) for memory efficiency and specifies `device_map="auto"` for optimal device placement (e.g., GPU). A tokenizer for GPT-2 is loaded, and its padding token is set to be the same as the end-of-sentence token, which is a typical setup for models that generate text.


### Warning

You may need to restart the kernel for this section, because we're going to define and train a new model.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "gpt2"
del model # just some housekeeping to clear memory
# TODO: load model with a datatype of float16 and an auto device_map
# TODO: load tokenizer
tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
# SOLUTION !!!
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)


`TrainingArguments` configures various parameters for training, such as batch size (`per_device_train_batch_size=3`), learning rate (`learning_rate=1e-3`), and the optimizer to use (`optim="rmsprop"`). These settings dictate how the model will be trained, including how data is batched and how the model's weights are updated during training.

In [None]:
# SOLUTION !!!!

training_args = TrainingArguments(
        per_device_train_batch_size=1,
        remove_unused_columns=False,
        gradient_accumulation_steps=16,
        learning_rate=1e-3,
        evaluation_strategy="steps",
        logging_first_step=True,
        logging_steps=10,  # match results in blog post
        eval_steps=500,
        output_dir="./test",
        optim="rmsprop",
        warmup_steps=150,
        # bf16=True,
        report_to="none",
        max_steps=5,
)

In [None]:
training_args = TrainingArguments(
        per_device_train_batch_size=32, # TODO: Define an appropriate batch_size
        remove_unused_columns=False,
        gradient_accumulation_steps=16,
        learning_rate=1e-3,
        evaluation_strategy="steps",
        logging_first_step=True,
        logging_steps=10,  # match results in blog post
        eval_steps=500,
        output_dir="./test",
        optim="rmsprop",
        warmup_steps=150,
        report_to="none",
        max_steps=5,
)

`DPOTrainer` is likely a custom training class for implementing DPO. It takes the model, training arguments, datasets, and tokenizer as inputs. Important parameters here include `beta=0.1`, which could be a hyperparameter for the DPO process, and `max_length`, `max_target_length`, `max_prompt_length`, which define the size constraints for the model's input and output. The `train()` method on `dpo_trainer` initiates the training process. This likely involves iterating over the provided datasets (`train_dataset` and `eval_dataset`), computing loss based on the DPO methodology, and updating the model's weights accordingly.

In [None]:
dpo_trainer = DPOTrainer(
    model,
    # "gpt2_dpo",
    args=training_args,
    beta=0.1,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    max_length=512,
    max_target_length=128,
    max_prompt_length=128,
    # generate_during_eval=True,
)

dpo_trainer.train()

DPO is a method for aligning the outputs of language models with human preferences. It simplifies the process by treating the alignment task as a binary classification problem, where the model learns to differentiate between preferred and non-preferred responses. Unlike traditional approaches that require a separate reward model, DPO directly integrates preference learning into the training process, making it more efficient and straightforward.

## EXERCISE: Implement the `inference` function

In [None]:
# SOLUTION !!!


def infer(instruction:str, context: str):
    template = """\
    ### Instruction: {instruction}\n
    ### Context: {context}\n
    ### Response: {response}"""

    inputs = template.format(
        instruction=instruction,
        context=context,
        response=""
    ).strip()
    encoding = tokenizer([inputs], return_tensors="pt").to("cuda")
    outputs = model.generate(**encoding, max_new_tokens=30)
    output_text = tokenizer.decode(outputs[0])
    return output_text



In [None]:
def infer(instruction:str, context: str):
    template = "" # TODO: implement an instruction template
    inputs = "" # TODO: use the .format() method to define the input variables
    encoding = tokenizer([inputs], return_tensors="pt").to("cuda")
    outputs = model.generate(**encoding, max_new_tokens=30)
    output_text = tokenizer.decode(outputs[0])
    return output_text

infer(
    instruction="What is a frog?",
    context="Both frogs and toads are amphibians in the order Anura, which means \"without a tail.\" Toads are a sub-classification of frogs, meaning that all toads are frogs, but not all frogs are toads.",
)

# 3. Collecting Human Feedback

In this section we will go back to the data itself. We will use the Argilla library to collect human feedback on the data. In reality, we will just push some data to Argilla, and explore its quality.

From experience, and literature, we know that the quality of the data is the most important factor in the quality of the model. So we will use Argilla to collect human feedback on the data, and then use that feedback to improve the quality of the data. It is important to become familiar with inspecting the data, and understanding the quality of the data. After all, the quality of the data is the most important factor in the quality of the model.

## 3.1. Setup

First, we will need to install the Argilla library. Argilla is a library for collecting human feedback on data. It is designed to be easy to use, and to provide a simple interface for collecting human feedback on data. It is also designed to be flexible, allowing you to collect feedback on any type of data, including text, images, and audio.

In [None]:
%pip install "argilla~=1.16.0" "transformers~=4.34.0" "datasets~=2.14.5" "peft~=0.5.0" "trl~=0.7.1" "wandb~=0.15.12" -qqq

Running Argilla Quickstart
For small-scale projects and quick experimentation, there are two recommended ways:

👩🏽‍🚀 Argilla on Hugging Face Spaces
If you have a Hugging Face account and want to run Argilla workflows from Colab or remote notebooks, you can deploy Argilla on Spaces:

[deploy on spaces](https://huggingface.co/new-space?template=argilla/argilla-template-space)

In [None]:
import argilla as rg

rg.init(
    api_url="https://burtenshaw-argilla.hf.space",
    api_key="owner.apikey",
    workspace="argilla"
)

![Argilla on Spaces](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-argilla-duplicate-space.png)

HuggingFace Spaces now have persistent storage and this is supported from Argilla 1.11.0 onwards, but you will need to manually activate it via the HuggingFace Spaces settings. Otherwise, unless you’re on a paid space upgrade, after 48 hours of inactivity the space will be shut off and you will lose all the data. To avoid losing data, we highly recommend using the persistent storage layer offered by HuggingFace. If everything goes well, you’ll see your online Argilla UI login page. You can log in with username admin and password 12345678. You can find the direct URL by clicking on the Embed space button. You’ll use this URL for sending data to your Argilla instance.

## 3.2 Defining the Feedback Dataset

Argilla feedback allows you to collect detailed information from annotators that you LLM can learn from.

In [None]:
dataset = rg.FeedbackDataset(
    fields = [
        rg.TextField(name="background"),
        rg.TextField(name="prompt"),
        rg.TextField(name="response", title="Final Response"),
    ],
    questions = [
        rg.LabelQuestion(name="quality", title="Is it a Good or Bad response?", labels=["Good", "Bad"])
    ]
)

For the sake of this tutorial, we will use a simple dataset of questions and answers. This dataset is a subset of the Open Assistant dataset, which you can find here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main

In [None]:
from datasets import load_dataset

data = load_dataset("laion/OIG", split="train", streaming=True)
data = data.shuffle(buffer_size=1_000_000).take(30_000)

In [None]:
from typing import Dict, Any

def extract_background_prompt_response(text: str) -> Dict[str, Any]:
    '''Extract the anthropic prompt from a prompt and response pair.'''
    start_prompt = text.find("<human>:")
    end_prompt = text.rfind("<bot>:")
    # Background is anything before the first <human>:
    background = text[:start_prompt].strip()
    # Prompt is anything between the first <human>: (inclusive) and the last <bot>: (exclusive)
    prompt = text[start_prompt: end_prompt].strip()
    # Response is everything after the last <bot>: (inclusive)
    response = text[end_prompt:].strip()
    return {"background": background, "prompt": prompt, "response": response}


data = data.map(extract_background_prompt_response, input_columns="text")

In [None]:
dataset.push_to_argilla("oig-30k")

In [None]:
feedback_dataset = rg.FeedbackDataset.from_argilla("oig-30k")

# 4. [Optional] Training a Model with Human Feedback

In [None]:
dataset_ds = feedback_dataset.format_as("datasets")

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        # TODO: Implement sample instruction text
        output_texts.append(text)
    return output_texts

response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    data_collator=collator,
    train_dataset=dataset["train"],
    formatting_functions=[formatting_prompts_func]
)

trainer.train()

# The End 💐 🎆

Well done! You have finished the week 2 project.