# Fine-Tuning Llama 2.0 with Single GPU Magic

Following article:

[https://ai.plainenglish.io/fine-tuning-llama2-0-with-qloras-single-gpu-magic-1b6a6679d436](https://ai.plainenglish.io/fine-tuning-llama2-0-with-qloras-single-gpu-magic-1b6a6679d436)

In [1]:
!pip install transformers datasets peft accelerate bitsandbytes safetensors

Collecting peft
  Downloading peft-0.10.0-py3-none-any.whl.metadata (13 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl.metadata (2.2 kB)
Downloading peft-0.10.0-py3-none-any.whl (199 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: bitsandbytes, peft
Successfully installed bitsandbytes-0.43.1 peft-0.10.0


In [2]:
import os, sys
import torch
import datasets
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments,
    GenerationConfig
)
from peft import PeftModel, LoraConfig, prepare_model_for_kbit_training, get_peft_model

2024-04-16 20:01:27.206459: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-16 20:01:27.206558: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-16 20:01:27.338739: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Model Loading

[https://huggingface.co/NousResearch/Llama-2-7b-hf](https://huggingface.co/NousResearch/Llama-2-7b-hf)

**NOTE: below cells were commented out - kept getting bugs due to accelerate/bitsandbytes library : only fix is that you NEED TO HAVE A GPU!!! I was running on CPU**

In [3]:
#!pip install -i https://pypi.org/simple/ bitsandbytes

#!pip install accelerate

#!pip install transformers==4.30

#!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

In [4]:
### config ###
model_id = "NousResearch/Llama-2-7b-hf"
max_length = 512
device_map = "auto"
batch_size = 128
micro_batch_size = 32
gradient_accumulation_steps = batch_size // micro_batch_size

# nf4" use a symmetric quantization scheme with 4 bits precision
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# load model from huggingface
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    device_map=device_map
)

# load tokenizer from huggingface
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Helper function called “print_number_of_trainable_model_parameters” to inspect the trainable parameters of the original model. Upon running this function, it provides us the output trainable model parameter: 262,410,240

In [5]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    print(f"trainable model parameters: {trainable_model_params}. All model parameters: {all_model_params} ")
    return trainable_model_params

ori_p = print_number_of_trainable_model_parameters(model)

ori_p

trainable model parameters: 262410240. All model parameters: 3500412928 


262410240

Next, we can start packing the model into the LoRA format while keeping the original parameters frozen and introducing additional weights as discussed earlier. The LoRA model has several configurable parameters:

* rdetermines the rank of the update matrices, also known as Lora attention dimension. Lower rank results in smaller update matrices with fewer trainable parameters. Increasing r (not more than 32) will lead to more robust model but higher memory consumption at the same time.
* lora_alpha controls the LoRA scaling factor
* target_modules is a list of module names, such as “q_proj” and “v_proj,” which serves as the targets for the LoRA model. The specific module names may vary depending on the underlying model.
* bias: Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'.

After attaching model with the LoRA adapter, let’s print the trainable parameters again and compare them to the original model. Remarkably, the trainable model parameter: 4,194,304is now represent only less than 2% of the original model’s size.

In [6]:
# LoRA config
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config)

### compare trainable parameters #
peft_p = print_number_of_trainable_model_parameters(model)
print(f"# Trainable Parameter \nBefore: {ori_p} \nAfter: {peft_p} \nPercentage: {round(peft_p / ori_p * 100, 2)}")

trainable model parameters: 4194304. All model parameters: 3504607232 
# Trainable Parameter 
Before: 262410240 
After: 4194304 
Percentage: 1.6


# Test before finetuning

Just before the thrilling fine-tuning process, let’ not skip the process of generating an output from a pre-trained language model and observe its response. In this case, when asking the model to write a poem about Singapore, the generated output appears to be quite vague and repetitive, indicating that the model struggles to provide a coherent and meaningful response.

In [8]:
### generate ###
prompt = "Write me a poem about Singapore."
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=64)
print('\nAnswer: ', tokenizer.decode(generate_ids[0]))
res = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(res)


Answer:  <s> Write me a poem about Singapore. nobody can write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore. I'm not sure if I'm allowed to write
Write me a poem about Singapore. nobody can write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore. I'm not sure if I'm allowed to write


**Note: when I run the above I get Warning saying input_ids are not on same device as model**

In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [10]:
### generate ###
prompt = "Write me a poem about Singapore."
inputs = tokenizer(prompt, return_tensors="pt")

# my code:
inputs = inputs.to(device)

generate_ids = model.generate(inputs.input_ids, max_length=64)
print('\nAnswer: ', tokenizer.decode(generate_ids[0]))
res = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(res)


Answer:  <s> Write me a poem about Singapore. nobody can write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore. I'm not sure if I'm allowed to write
Write me a poem about Singapore. nobody can write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore.
I'm not sure if I'm allowed to write a poem about Singapore. I'm not sure if I'm allowed to write


Results aren't great - same as him on article they are incoherent also OK

# Data Loading

To demonstate the process of fine-tuning an Instruction-LLM, we are going to use a public dataset sourced from databricks/databricks-dolly-15k which presents an array of instruction-response pairs. Notably, certain samples in this dataset also incorporate contextual information, adding an extra layer of complexity and richness to the model’s comprehension process. Allow me to present a captivating sample record extracted from this intriguing raw data:

```
{
    'instruction': 'Why can camels survive for long without water?',
    'context': '',
    'response': 'Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time.',
    'category': 'open_qa',
}
```

A prompt_template is created to enhance the learning capabilities of the model. This ingenious template consists of two distinct types: prompt_input and prompt_no_input. The former is employed for samples that encompass an input context, while the latter caters to instances lacking such contextual information. By precisely pairing each task’s instruction with the appropriate context (if available), we foster a deeper understanding and context-awareness within the model.

In [11]:
max_length = 256
dataset = datasets.load_dataset(
    "databricks/databricks-dolly-15k", split='train'
)

### generate prompt based on template ###
prompt_template = {
    "prompt_input": \
    "Below is an instruction that describes a task, paired with an input that provides further context.\
    Write a response that appropriately completes the request.\
    \n\n### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n",

    "prompt_no_input": \
    "Below is an instruction that describes a task.\
    Write a response that appropriately completes the request.\
    \n\n### Instruction:\n{instruction}\n\n### Response:\n",

    "response_split": "### Response:"
}

def generate_prompt(instruction, input=None, label=None, prompt_template=prompt_template):
    if input:
        res = prompt_template["prompt_input"].format(
            instruction=instruction, input=input)
    else:
        res = prompt_template["prompt_no_input"].format(
            instruction=instruction)
    if label:
        res = f"{res}{label}"
    return res

Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 13.1M/13.1M [00:00<00:00, 50.4MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

We first generate the full prompt by combining the instruction, context, and response using the generate_prompt function. Once the full prompt is crafted, we tokenize it using the provided tokenizer, which transforms the text into input_ids and attention_mask. Notably, to train the model to predict the next word, we designate the labelsimilar to the input_idsand facilitate a shift-right operation by the trainer. However, to avoid the model focusing on the next word in the instruction and context, we shall mask all the original tokens in these segments, replacing them with -100, while retaining only the response input. The data is further organized into training and validation sets, and unnecessary columns are removed, thus culminating in a refined and highly effective dataset poised for training.

In [12]:
def tokenize(tokenizer, prompt, max_length=max_length, add_eos_token=False):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=max_length,
        padding=False,
        return_tensors=None)

    result["labels"] = result["input_ids"].copy()
    return result

def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(
        data_point["instruction"],
        data_point["context"],
        data_point["response"],
    )
    tokenized_full_prompt = tokenize(tokenizer, full_prompt)
    user_prompt = generate_prompt(data_point["instruction"], data_point["context"])
    tokenized_user_prompt = tokenize(tokenizer, user_prompt)
    user_prompt_len = len(tokenized_user_prompt["input_ids"])
    mask_token = [-100] * user_prompt_len
    tokenized_full_prompt["labels"] = mask_token + tokenized_full_prompt["labels"][user_prompt_len:]
    return tokenized_full_prompt

dataset = dataset.train_test_split(test_size=1000, shuffle=True, seed=42)
cols = ["instruction", "context", "response", "category"]
train_data = dataset["train"].shuffle().map(generate_and_tokenize_prompt, remove_columns=cols)
val_data = dataset["test"].shuffle().map(generate_and_tokenize_prompt, remove_columns=cols,)

Map:   0%|          | 0/14011 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

# Model Training

With extensive preparation of the data and model, the moment has come to initiate the training process. With the flexibility to fine-tune the trainer’s settings as needed, I run 200 steps on a crucial five-hour training session, with Google Colab.

In [None]:
# changed params compared to his : num_train_epochs 20 -> 1
# save_total_limit 3 -> 1

args = TrainingArguments(
    output_dir="./llama-7b-int4-dolly",
    num_train_epochs=1,
    max_steps=200,
    fp16=True,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="constant",
    per_device_train_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=True,
    group_by_length=False,
    logging_steps=10,
    save_strategy="epoch",
    save_total_limit=1,
    disable_tqdm=False,
    report_to=["tensorboard"],
)

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=args,
    data_collator=DataCollatorForSeq2Seq(
      tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True),
)

# silence the warnings. re-enable for inference!
model.config.use_cache = False
trainer.train()
model.save_pretrained("llama-7b-int4-dolly")

# Generation

After several hours of training with a single GPU, it’s time to test the model’s performance using the input prompt “Write me a poem about Singapore” which we previously used. The code snippet starts by loading the pre-trained Llama-2–7b-hf model and Peft weights. The model’s generation configuration is set to control factors such as

* temperature controls the randomness of the generation process. When the temperature is high, the generator is more random and generates diverse but less coherent outputs. When the temperature is low, the generator is less random and generates more coherent but less diverse outputs.
* top-p select the most promising candidates from a set of generated options. The “p” in top-p stands for “probability,” and it refers to the probability of a given candidate being the best option.
* top-k is similar to top-p, but instead of selecting a percentage of candidates, it selects a fixed number of candidates with the highest probability scores.
* num_beam in Beam search algorithm that allows the model to consider multiple possible outputs simultaneously. It works by maintaining a set of possible outputs, called the “beam,” and iteratively expanding the beam by adding new outputs that are likely to be correct.

In [None]:
# model path and weight
model_id = "NousResearch/Llama-2-7b-hf"
peft_path = "./llama-7b-int4-dolly"

# loading model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto"
)

# loading peft weight
model = PeftModel.from_pretrained(
    model,
    peft_path,
    torch_dtype=torch.float16,
)
model.eval()

# generation config
generation_config = GenerationConfig(
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    num_beams=4, # beam search
)

# generating reply
with torch.no_grad():
    prompt = "Write me a poem about Singapore."
    inputs = tokenizer(prompt, return_tensors="pt")
    generation_output = model.generate(
        input_ids=inputs.input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=64,
    )
    print('\nAnswer: ', tokenizer.decode(generation_output.sequences[0]))

The model’s output is now showing promising improvements compared to the pre-trained model. Although the result might not meet our high poetic expectations, it’s essential to consider that we employed the smallest available 7b-Llama2.0 model and trained only on a limited weight using Lora for a short period. Nevertheless, this outcome is already impressive, considering the constraints.