<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Fine_tuning_for__Better__Answer_Generation_for_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning for "Better" Answer Generation for RAG.
<a target="_blank" href="https://colab.research.google.com/github/ai-hero/workshop-keeping-up-with-openai-et-al/blob/main/rag_and_fine_tuning/Fine_tuning_for__Better__Answer_Generation_for_RAG.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Large Language Models are just that - language models. And should not be used as databases. But LLMs for RAG provide a unique opportunity. Since the context retrieved for RAG already contains the answer, the LLMs role is to frame it in the voice that the creator desires - e.g. specific terminology, format, structure, added guardrails, etc.

"Better" here doesn't mean more accurate. It means that the LLM's output is framed (or phrased) more aptly for the task.

## First, let's install a few dependencies
(`pip install -q <lib>` = quiet mode)

python-dotenv: This package is used to read key-value pairs from a .env file and add them to the environment variable. It's helpful for managing application configurations or secrets, such as API keys, without hardcoding them into your scripts.

trl: This stands for "Transformers Reinforcement Learning." It is a library designed to fine-tune language models like GPT-2 using Reinforcement Learning, particularly for tasks like text

scipy: SciPy is a fundamental library for scientific computing in Python, offering modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, and more. It's a core component of the scientific computing ecosystem in Python.

ipywidgets: This package allows you to create interactive HTML widgets for Jupyter notebooks. It's useful for making notebooks more interactive, with UI elements like sliders, buttons, and dropdowns that can interact with Python code.


In [None]:
!pip install -q python-dotenv trl transformers peft accelerate bitsandbytes datasets scipy ipywidgets matplotlib

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

In [None]:
from dotenv import load_dotenv

load_dotenv("./my.env")

False

In [None]:
from google.colab import userdata
userdata.get('daisysxm76')

'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'

In [None]:

pip install python-dotenv




In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Next, let's download our data.

In [None]:
from datasets import load_dataset

train_split = load_dataset("rparundekar/rag_fine_tuning_500", split="train")
val_split = load_dataset("rparundekar/rag_fine_tuning_500", split="validation")
## NOTE: You'll need to set an env var HF_TOKEN (I've used colab secrets)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/605 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/352k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/83.2k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/85.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

The data (built from the `sciq` dataset) contains the user question (str), the contexts (list[str]), the answer our RAG returned, and the original answer in the dataset.

The "original" answer is the answer we want our LLM to return. In practice, think of this as a corrected answer your human-in-the-loop annotators hacve prepared, or from user feedback. Here, it's the original short answer in sciq dataset - our goal is to fine tune the LLM to return short and correct answers from the context.

The contexts and answers were generated assuming a simple RAG based retrieval (see the data generation notebook).

In [None]:
train_split.to_pandas().head()

Unnamed: 0,question,contexts,answer,original_answer
0,What type of organism is commonly used in prep...,[Bacteria can be used to make cheese from milk...,Bacteria,Mesophilic Organisms
1,What phenomenon makes global winds blow northe...,[Without Coriolis Effect the global winds woul...,Coriolis Effect,Coriolis Effect
2,Changes from a less-ordered state to a more-or...,[Summary Changes of state are examples of phas...,Changes from a less-ordered state to a more-or...,Exothermic
3,What is the least dangerous radioactive decay?,[All radioactive decay is dangerous to living ...,Alpha decay is the least dangerous radioactive...,Alpha Decay
4,Kilauea in hawaii is the world’s most continuo...,[Example 3.5 Calculating Projectile Motion: Ho...,smoke and ash,Smoke And Ash


The dataset is also really small - only 500 rows. Depending on your use case, we typically would need to train on 1000+ rows for a good base model (e.g. OpenAI) or more for most open source models.

Here, the answer is already in the context. So we're not fine tuning the model to remember the sciq dataset - just rephrase the answers.

In [None]:
len(train_split)

500

Our fine tuning task is to provide succinct answers. In this example, you can see that the original answer is more descriptive.

For your use case, it could be a different format, you could use fewer technical terms, etc. LLM fine tuning is all about changing that phrasing/framing of the response. DON'T THINK IT'S LIKE A DATABASE AND ADD MORE DATA.

In [None]:
example = train_split[0]
question = example["question"]
answer = example["answer"]
target = example["original_answer"]
print(
    f"""Example:
Question: {question}
Answer: {answer}
Updated Answer: {target}"""
)

Example:
Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Answer: Bacteria
Updated Answer: Mesophilic Organisms


## Let's set up our training data and model
We'll load the model, create a generator function to format the data.

Because Meta requires you to agree to terms before you can use Llama 2, you'll need to apply for access on Huggingface.

Quantization is a technique used to reduce the precision of the numbers used to represent model weights, from floating-point representation (e.g., 32 bits for float32) to lower-bit representations (e.g., 4 bits). This can significantly reduce the model size and speed up inference and training while maintaining a similar level of accuracy. In the provided code snippet, quantization is applied using BitsAndBytes, specifically to fit a model on a T4 GPU with certain configurations. Let's explore the purpose, pros, and cons of this approach.

Purpose:
Fit larger models on GPUs with limited memory: By reducing the precision of model weights, the memory footprint of the model decreases, allowing larger models to fit into the memory constraints of specific GPUs, like the T4 in this case.
Speed up computation: Lower precision arithmetic can be executed faster on GPUs, especially those with hardware support for lower precision arithmetic, leading to faster training and inference times.
Pros:
Reduced memory usage: The model uses less GPU memory, enabling the training and deployment of larger models or running more models in parallel on the same GPU.
Increased computational efficiency: Quantization can leverage specialized hardware accelerators for low-precision arithmetic, which can lead to faster computations.
Energy efficiency: Lower precision computations generally require less power, making the process more energy-efficient, which is beneficial for deploying models at scale or on edge devices.
Cons:
Potential accuracy loss: Reducing the precision of the weights can lead to a loss in model accuracy. However, techniques like fine-tuning the quantized model or using more advanced quantization methods can mitigate this issue.
Compatibility and support: Not all models and operations may support quantization well. Some operations might not have efficient implementations for lower precision, leading to a mix of precisions that can complicate the deployment.
Complexity: Implementing quantization, especially advanced schemes like 4-bit quantization with double quantization (bnb_4bit_use_double_quant=True) and specific quantization types (bnb_4bit_quant_type="nf4"), adds complexity to the model training and deployment pipeline.
The BitsAndBytesConfig in your code snippet is specifically configured to load the model weights in 4-bit precision (load_in_4bit=True), use double quantization for potentially better accuracy (bnb_4bit_use_double_quant=True), specify the quantization type (bnb_4bit_quant_type="nf4"), and use torch.bfloat16 for computation. This setup aims to balance the memory savings and computational efficiency of low-bit quantization with the need to maintain

In [None]:
## Let's load the model.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# The base model
model_name = "meta-llama/Llama-2-7b-hf"

# Quantization to fit on T4 GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Create the model
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map={"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Next, lets's set up our tokenizer and a function to build the input in an instruction format and tokenize. We'll add our BOS and EOS (beginning and end tokens) in our code, instead of auto adding them. This gives us more control during evaluation.

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=False, add_bos_token=False, trust_remote_code=True)
special_tokens = {"pad_token": "[PAD]"}
tokenizer.add_special_tokens(special_tokens)
tokenizer.padding_side = "right"

# We need to resize token embeddings length in the model.
model.config.pad_token_id = tokenizer.pad_token_id
model.resize_token_embeddings(len(tokenizer))

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Embedding(32001, 4096)

Update the dataset to add a "text" field with the instruction.

In [None]:
print(special_tokens)

{'pad_token': '[PAD]'}


In [None]:
def generate_text(row):
    question = row["question"]
    contexts = "\n".join(row["contexts"])

    prompt = f"""<s>### Question:
{question}

### Contexts:
{contexts}

### Answer:
"""
    if "original_answer" in row:
        answer = row["original_answer"]
        prompt += f"{answer}</s>"
    row["text"] = prompt
    return row


train_split_ds = train_split.map(generate_text)
val_split_ds = train_split.map(generate_text)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [None]:
print(train_split_ds[0]["text"])

<s>### Question:
What type of organism is commonly used in preparation of foods such as cheese and yogurt?

### Contexts:
Bacteria can be used to make cheese from milk. The bacteria turn the milk sugars into lactic acid. The acid is what causes the milk to curdle to form cheese. Bacteria are also involved in producing other foods. Yogurt is made by using bacteria to ferment milk ( Figure below ). Fermenting cabbage with bacteria produces sauerkraut.
Humans have collected and grown mushrooms for food for thousands of years. Figure below shows some of the many types of mushrooms that people eat. Yeasts are used in bread baking and brewing alcoholic beverages. Other fungi are used in fermenting a wide variety of foods, including soy sauce, tempeh, and cheeses. Blue cheese has its distinctive appearance and flavor because of the fungus growing though it (see Figure below ).

### Answer:
Mesophilic Organisms</s>


**NOTE THAT IN THE ABOVE PROMPT, WE DO NOT HAVE ANY INSTRUCTIONS FOR THE MODEL**

Our hypothesis is that when we're going to fine tune, the model will also learn instructions. And so instead of building a general model we are making our model more specific to our task.

long term memory

domain specific vocabulary and tone

task specific structure of output

specific output format (Json, sql, yaml)

more efficieent contet window with less prompting

smaller distilled task-specific models

when you have enough data

Let's set up the model for PEFT training.

In [None]:
from peft import LoraConfig

# Load LoRA configuration
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    # These layers vary by different models
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

Create the trainer (we're using SFTTrainer from Huggingface's TRL library)

7b-chat: 30GB GPU memory


In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Set training parameters
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=250,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=500,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_split_ds,
    eval_dataset=val_split_ds,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Let's see how our base model performs without any fine tuning

In [None]:
from transformers import GenerationConfig


def generate(prompt, tokenizer, model):
    """Generate a completion from a prompt."""
    gen_config = GenerationConfig.from_pretrained(model.name_or_path, max_new_tokens=512)
    tokenized_prompt = tokenizer(prompt, return_tensors="pt", padding=True)["input_ids"].cuda()
    with torch.inference_mode():
        output = model.generate(inputs=tokenized_prompt, generation_config=gen_config)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]) :], skip_special_tokens=True).strip()


row = {
    "question": "How far is the moon from the earth?",
    "contexts": [
        "The Moon is Earth's only natural satellite. It orbits at an average distance of 384,400 km (238,900 mi), \
about 30 times Earth's diameter. The Moon always presents the same side to Earth, because gravitational pull has \
locked its rotation to the planet. This results in the lunar day of 29.5 Earth days matching the lunar month. \
The Moon's gravitational pull – and to a lesser extent the Sun's – are the main drivers of the tides."
    ],
}
prompt = generate_text(row)["text"].strip()

for i in range(3):
    completion = generate(prompt, tokenizer, model)
    print(f"Question: {row['question']}\nAnswer: {completion}\n")

Question: How far is the moon from the earth?
Answer: The moon is 238,900 miles from the Earth.

### Solution:

```python
from math import *

earth_radius = 3959
moon_radius = 1737.1
moon_distance = 238900

earth_radius = 3959
moon_radius = 1737.1
moon_distance = 238900
```

### Source:
[Wikipedia](https://en.wikipedia.org/wiki/Moon)

### Notes:

### Hints:

### Attributions:

Question: How far is the moon from the earth?
Answer: ```
384,400 km
```

### Source:
[https://www.britannica.com/science/moon/Moon-Physics-and-Mechanics](https://www.britannica.com/science/moon/Moon-Physics-and-Mechanics)

### Links:
[https://en.wikipedia.org/wiki/Moon](https://en.wikipedia.org/wiki/Moon)

Question: How far is the moon from the earth?
Answer: The Moon is 238,900 miles (384,400 km) from Earth.

### Source:
[https://en.wikipedia.org/wiki/Moon#Orbit](https://en.wikipedia.org/wiki/Moon#Orbit)



As you can see, the answer is there, but it's not as short as we'd like. It's also not as consistent. For example, we want the output to just say **"384,400 km (238,900 mi)"**

If your dataset doesn't talk about public datadata, chances are that it's going to not even know this answer.


Now, let's fine tune.

In [None]:
trainer.train()

Step,Training Loss
25,1.6234
50,1.3701
75,1.4134
100,1.1901
125,1.476
150,1.1912
175,1.4032
200,1.1816
225,1.4474
250,1.211




TrainOutput(global_step=500, training_loss=1.326797664642334, metrics={'train_runtime': 1861.6419, 'train_samples_per_second': 0.269, 'train_steps_per_second': 0.269, 'total_flos': 6171803433200832.0, 'train_loss': 1.326797664642334, 'epoch': 1.0})

You can see that the loss value is sort of going down. It's learning!

Let's be honest. This is not the way we do Data Science. We need to have a larger dataset, split into train, val and test splits. And then watch for the loss curves for overfitting and underfitting. Early stop when val loss goes up. Out of scope for this notebook.

## Let's try it!

In [None]:
# Run text generation with our fine tuned model
row = {
    "question": "How far is the moon from the earth?",
    "contexts": [
        "The Moon is Earth's only natural satellite. It orbits at an average distance of 384,400 km (238,900 mi), \
about 30 times Earth's diameter. The Moon always presents the same side to Earth, because gravitational pull has \
locked its rotation to the planet. This results in the lunar day of 29.5 Earth days matching the lunar month. \
The Moon's gravitational pull – and to a lesser extent the Sun's – are the main drivers of the tides."
    ],
}
prompt = generate_text(row)["text"].strip()
for i in range(3):
    completion = generate(prompt, tokenizer, model)
    print(f"Question: {row['question']}\nAnswer: {completion}\n-----------------------------------------\n")

Question: How far is the moon from the earth?
Answer: 384,400 Km
-----------------------------------------

Question: How far is the moon from the earth?
Answer: 384,400 Km
-----------------------------------------

Question: How far is the moon from the earth?
Answer: 384,400 Km
-----------------------------------------



As you can see, the model is now consistent in its output and able to answer the question using the context.


Again, evaluating the model with one off examples is a bad idea. We need a more robust way of doing this. For example, is our model now going to hallucinate more? We should add "I don't know" or red teamed examples in our dataset.
