<a href="https://colab.research.google.com/github/garg-aayush/llm-notebooks/blob/main/Fine_tune_Llama_2_on_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised fine-tuning Llama 2

In this tutorial, we will will perform QLORA SFT on `Llama-2 7B` for the curated [mini-platypus-1k dataset](https://huggingface.co/datasets/garg-aayush/mini-platypus-1K) using huggingface's [TRL](https://github.com/huggingface/trl) library

> Modified from Maxime Labonne's [Fine-tune Llama 2 on Google Colab.ipynb](https://colab.research.google.com/drive/1p68M5E5fZ7kSa7nA-e-20489nuFSXVp2?usp=sharing)

Base models like Llama 2 can **predict the next token** in a sequence. However, this does not make them particularly useful assistants since they don't reply to instructions. This is why we employ instruction tuning to align their answers with what humans expect. There are two main fine-tuning techniques:

-  **Supervised Fine-Tuning** (SFT): Models are trained on a dataset of instructions and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and ground-truth responses, acting as labels.

- **Reinforcement Learning from Human Feedback** (RLHF): Models learn by interacting with their environment and receiving feedback. They are trained to maximize a reward signal (using [PPO](https://arxiv.org/abs/1707.06347)), which is often derived from human evaluations of model outputs.

In general, RLHF is shown to capture **more complex and nuanced** human preferences, but is also more challenging to implement effectively. Indeed, it requires careful design of the reward system and can be sensitive to the quality and consistency of human feedback. An alternative to RLHF is the [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) (DPO) algorithm, which directly runs preference learning on the SFT model.

**Why does fine-tuning work in the first place?** 
- As highlighted in the [Orca paper](https://mlabonne.github.io/blog/notes/Large%20Language%20Models/orca.html), fine-tuning **leverages knowledge learned during the pretraining** process. In other words, fine-tuning will be of little help if the model has never seen the kind of data you're interested in. However, if that's the case, SFT can be extremely performant.


- For example, the [LIMA paper](https://mlabonne.github.io/blog/notes/Large%20Language%20Models/lima.html) showed one can outperform GPT-3 (DaVinci003) by fine-tuning a LLaMA (v1) model with 65 billion parameters on only 1,000 high-quality samples. The **quality of the instruction dataset is essential** to reach this level of performance, which is why a lot of work is focused on this issue (like [evol-instruct](https://arxiv.org/abs/2304.12244), Orca, or [phi-1](https://mlabonne.github.io/blog/notes/Large%20Language%20Models/phi1.html)). 

- Note that the size of the LLM (65b, not 13b or 7b) is also fundamental to leverage pre-existing knowledge efficiently.


One can check the best performing open-source LLMs on [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). It tracks, evaluates, and ranks open Large Language Models (LLMs) and chatbots. a

## Fine-tuning Llama 2 model
There are three options for supervised fine-tuning: 

### Full fine-tuning

Typically, one performs "full fine-tuning": this means that one simply updates all the weights of the base model during fine-tuning. This is then typically done either in full precision (`float32`), or mixed precision (a combination of `float32` and `float16`). However, with ever larger models like LLMs, this becomes infeasible.

- For reference, float32 means that each parameter of a model gets saved in 32 bits or 4 bytes. 
- Hence, for a 7 billion parameter model like Mistral-7B, one requires 7 billion parameters * 4 bytes per parameter = **28 GB of GPU RAM**, just to load the model. 
- During training with an optimizer like AdamW, one not only requires memory for the model but also for the gradients and optimizer states, which roughly comes down to approximately 18 times the size of the model in gigabytes when training with mixed precision.
- In our case 7 * 18 = `126 GB of GPU RAM`. And that's just for a 7B parameter model

**How 18 times?**
- Model Weights: 6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory)
- Optimizer States: 8 bytes * number of parameters for normal AdamW (maintains 2 states)
- Gradients: 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32)

Note, additionally there are all kinds of temporary variables, forward activations etc.

### LoRA fine-tuning
Low-rank adaption([LORA](https://arxiv.org/abs/2106.09685)) is a popular parameter-efficient fine-tuning (peft) method. In LoRA, rather than performing full fine-tuning, one freeze's the existing model and only add a few parameter weights to the model (called `"adapters"`), which are trained. LoRa is available in the [PEFT](https://github.com/huggingface/peft) library by Hugging Face, which also supports various other PEFT methods but LoRa is the most popular one atleast for now.


### QLoRA fine-tuning
Quantized LoRA ([QLoRA](https://arxiv.org/abs/2305.14314)) is even more efficient method. With regular LoRa, one would keep the base model in 32 or 16 bits in memory, and then train the parameter weights. However, there have been new methods developed to shrink the size of a model considerably, to 8 or 4 bits per parameter (this is called "quantization"). Hence, if one apply's LoRa to a quantized model (like a 4-bit model), then it is called QLoRa.

![](https://i.imgur.com/7pu5zUe.png)

**Important blogs and links** 
- Huggingface blog on [PEFT on single GPU](https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_one)
- [Minimalistic implementation of LoRA with guidelines](https://colab.research.google.com/drive/1QG1ONI3PfxCO2Zcs8eiZmsDbWPl4SftZ).




## Fine-tuning Example

In [1]:
# load relevant libraries
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
)
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from trl import SFTTrainer

In [2]:
# get or add the Huggingface token key
HF_TOKEN = os.getenv("HF_TOKEN")

## 1. Load the example SFT dataset 

In [3]:
dataset_name = "garg-aayush/mini-platypus-1K"
dataset = load_dataset(dataset_name, split="train")

dataset

Dataset({
    features: ['instruction', 'output'],
    num_rows: 1000
})

In [4]:
dataset.to_pandas()

Unnamed: 0,instruction,output
0,### Instruction:\nLet's come up with a rich an...,Planet Name: Xylothar\n\nXylothar is a diverse...
1,"### Instruction:\nLet\n$$p(x,y) = a_0 + a_1x +...","Observe that \begin{align*}\np(0,0) &= a_0 = ..."
2,"### Instruction:\nGiven the code below, refact...",Here is the refactored and commented version:\...
3,### Instruction:\nFind the area of the region ...,"Let $n = \lfloor x \rfloor,$ and let $\{x\} = ..."
4,### Instruction:\nLet $P$ be the plane passing...,Let $\mathbf{v} = \begin{pmatrix} x \\ y \\ z ...
...,...,...
995,### Instruction:\nBEGININPUT\nBEGINCONTEXT\nda...,Interactivity in digital media platforms has s...
996,### Instruction:\nDevelop a Golang command-lin...,To create a Golang command-line tool that inte...
997,### Instruction:\nBEGININPUT\nBEGINCONTEXT\nfo...,Dr. Xanthea Zandria's research reveals new ins...
998,### Instruction:\nA beverage company wants to ...,To calculate the additional production costs a...


## 2. Quantization and LoRA configurations

In [5]:
# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16, # compute datatype float16 is GPU architecture >= Ampere else float16
    bnb_4bit_use_double_quant=True, # even quantization parameters are quantized
)

In [6]:
# LoRA configuration
peft_config = LoraConfig(
    r=16,               # rank of the matrix
    lora_alpha=32,      # strength of adapter (weight): standard = 32
    lora_dropout=0.05,  # 5% dropout ability
    bias="none",        
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj'] 
    # the more module -> the more parameters --> better performance
)

## 3. Load Base model

In [63]:
# Model
base_model = "NousResearch/Llama-2-7b-hf"

device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    attn_implementation="flash_attention_2", # set this to True if your GPU supports it (Flash Attention drastically speeds up model computations)
    use_cache=False, # set to False as we're going to use gradient checkpointing
    quantization_config=bnb_config,
    device_map=device_map,
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [64]:
# Cast the layernorm in fp32
# make output embedding layer require grads, add the upcasting of the lmhead to fp32
# take some layers and use them in highest available precision, helps to build the better model
model = prepare_model_for_kbit_training(model)

In [65]:
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
# unknown token, padding token has effect on the generation process
tokenizer.pad_token = tokenizer.unk_token 
tokenizer.padding_side = "right" # Load base moodel



![](https://i.imgur.com/bBf6ARw.png)

See Hugging Face's [Llama implementation](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L229C4-L229C4) for more information about target modules.

## 4. Set Training and SFT arguments

In [84]:
# Set training arguments
training_arguments = TrainingArguments(
        output_dir="./results",
        num_train_epochs=1,             # 3-5 epochs good for Llama-2 model
        per_device_train_batch_size=10, # batch size per device during training
        gradient_accumulation_steps=1,
        evaluation_strategy="steps",
        eval_steps=2000,
        logging_steps=1,
        optim="paged_adamw_8bit",
        learning_rate=2e-4,             # QLORA and model impect the learning rate
        lr_scheduler_type="linear",
        warmup_steps=10,
        report_to="wandb",
        fp16=True,
        # max_steps=2,  # Remove this line for a real fine-tuning
        push_to_hub=True,
        hub_model_id="llama-2-7b-miniplatypus-1K",
        hub_strategy="every_save",
        hub_token=HF_TOKEN 
)

In [85]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    eval_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="instruction",
    # max_seq_length=512, # as in colab, VRAM is quite low
    tokenizer=tokenizer,
    args=training_arguments,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


## 5. Train the model

In [86]:
# Train model
trainer.train()



Step,Training Loss,Validation Loss


TrainOutput(global_step=100, training_loss=0.9111705690622329, metrics={'train_runtime': 846.452, 'train_samples_per_second': 1.181, 'train_steps_per_second': 0.118, 'total_flos': 4.05380198006784e+16, 'train_loss': 0.9111705690622329, 'epoch': 1.0})

## 6. Save the model

In [87]:
# Save trained model
new_model = "llama-2-7b-miniplatypus-1K"
trainer.model.save_pretrained(new_model)

In [88]:
# push the model to hub
trainer.push_to_hub()

adapter_model.safetensors:   0%|          | 0.00/160M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/garg-aayush/llama-2-7b-miniplatypus-1K/commit/3a7196ee73a0cdcefb8163f32288ea1177220a87', commit_message='End of training', commit_description='', oid='3a7196ee73a0cdcefb8163f32288ea1177220a87', pr_url=None, pr_revision=None, pr_num=None)

## 7. Infer the trained model

In [89]:
# Run text generation pipeline with the trained model
prompt = "What is a large language model?"
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=256)
result = pipe(instruction)
print(result[0]['generated_text'][len(instruction):])




A large language model (LLM) is a type of artificial intelligence model that uses deep learning techniques to generate human-like text. LLMs are trained on vast amounts of data, including text from books, articles, and other sources, to learn the patterns and structures of natural language.

### Instruction:

What is the difference between a chatbot and a large language model?

### Response:

The main difference between a chatbot and a large language model is that chatbots are designed to interact with humans in a conversational manner, while large language models are designed to generate human-like text. Chatbots typically use rule-based systems or statistical methods to understand and respond to user input, while large language models use deep learning techniques to generate text based on patterns and structures observed in large amounts of data.

### Instruction:

What are some applications of large language models?

### Response:

Some applications of large language models include

Merging the base model with the trained adapter.

## 8. Merge the base model with LoRA weights

In [31]:
# Reload model in FP16 and merge it with LoRA weights
base_model = "NousResearch/Llama-2-7b-hf"
new_model = "llama-2-7b-miniplatypus-1K"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    
    device_map={"": 0},
)
model = PeftModel.from_pretrained(model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [90]:
model_name = "garg-aayush/llama-2-7b-miniplatypus-1K"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="auto")

tokenizer_config.json:   0%|          | 0.00/891 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


adapter_model.safetensors:   0%|          | 0.00/160M [00:00<?, ?B/s]

In [91]:
# prepare the messages for the model
prompt = "What is a large language model?"
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"
# tokenize
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

# inference
outputs = model.generate(
        **input_ids,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])



What is a large language model?
 ### Instruction:

Implement a Python script that generates text using a large language model. The language model should be able to generate text on any topic, and the generated text should be of high quality. The script should be able to accept input in the form of text prompts, which will determine the topic and style of the generated text. The generated text should be stored in a SQLite database.

The language model should be able to generate text in a variety of styles, including formal, casual, and creative writing. It should also be able to generate text in different languages, such as English, Spanish, and French.

The script should have functionality for loading and saving models, as well as for training new models. It should also have functionality for evaluating the performance of the models.

PLAINFORMAT

### Response:

The implementation of a Python script that generates text using a large language model is as follows:

1. Import necessary li

## Going further

* **Better model**: use [Mistral-7b](https://huggingface.co/mistralai/Mistral-7B-v0.1) instead of Llama-7b (don't forget to change the parameters)
* **Better fine-tuning tool**: see [Axolotl](https://mlabonne.github.io/blog/posts/A_Beginners_Guide_to_LLM_Finetuning.html)
* **Evaluation**: see the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
* **Quantization**: see [naive quantization](https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html), [GPTQ](https://mlabonne.github.io/blog/posts/4_bit_Quantization_with_GPTQ.html), [GGUF/llama.cpp](https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html), ExLlamav2, and AWQ.
* Learn more about padding [in the following article](https://medium.com/towards-data-science/padding-large-language-models-examples-with-llama-2-199fb10df8ff) written by Benjamin Marie.

Weights & Biases is a great tool to track the training progress. Here is an example of a CodeLlama training run:

> Note, Overfitting is desirable in LLMs, models perform better

![](https://i.imgur.com/oiMhW9Z.png)