# Fine tune Llama v2 with Qlora and instruction dataset

This notebook showcases how to fine-tune Llama 2 7B pre-trained model using the PEFT library and QLoRa method. We’ll use a custom instructional dataset to build a sentiment analysis model.

Prerequisistes:
* Weights & Biases (W&B)
* Hugging Face (HF) libraries
* A dataset in jsonl.

## Prepare Your Dataset
Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case. The training examples look like this:

Below is an instruction that describes a sentiment analysis task...
```
### Instruction:
Analyze the following comment and classify the tone as...

### Input:
I love reading your articles...

### Response:
friendly & constructive
```

But for creating a training dataset that can be easily used with HF libraries, we recommend using jsonl. The easiest way to go about this is to create a single line JSON object with just a text field for each example. Something like this:

```
{ "text": "Below is an instruction ... ### Instruction: Analyze the... ### Input: I love... ### Response: friendly" },
{ "text": "Below is an instruction ... ### Instruction: ..." }
```

Let’s talk a bit about the parameters we can tune here. First, we want to load a llama-2-7b-hf model (original model) and train it on the drugs.com (165,000 samples), which will produce our fine-tuned model llama-2-7b-drugGPT. If you’re interested in how this dataset was created, you can check this notebook. Feel free to change it: there are many good datasets on the Hugging Face Hub, like databricks/databricks-dolly-15k.

QLoRA will use a rank of 64 with a scaling parameter of 16 (see this article for more information about LoRA parameters). We’ll load the Llama 2 model directly in 4-bit precision using the NF4 type and train it for one epoch. To get more information about the other parameters, check the TrainingArguments, PeftModel, and SFTTrainer documentation.

In [1]:
#!pip install -q huggingface_hub
#!pip install -q -U trl transformers accelerate peft
#!pip install -q -U datasets bitsandbytes einops wandb

# Uncomment to install new features that support latest models like Llama 2
# !pip install git+https://github.com/huggingface/peft.git
# !pip install git+https://github.com/huggingface/transformers.git

# When prompted, paste the HF access token you created earlier.
from huggingface_hub import login
import os

login(token=os.getenv('hf_api_token'))

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
from peft import LoraConfig
from trl import SFTTrainer

dataset_name = "/data/med/drugs/drugs_instruct_text.jsonl"
dataset = load_dataset('json', data_files=dataset_name, split="train")

#base_model_name = "meta-llama/Llama-2-7b-hf"
base_model_name = "meta-llama/Llama-2-7b-chat-hf"


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/alfred/.cache/huggingface/token
Login successful


2023-08-16 22:58:53.549963: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
caused by: ['/home/alfred/.conda/envs/openai/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl5mutexC1Ev']
caused by: ['/home/alfred/.conda/envs/openai/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZNK10tensorflow4data11DatasetBase8FinalizeEPNS_15OpKernelContextESt8functionIFN3tsl8StatusOrISt10unique_ptrIS1_NS5_4core15RefCountDeleterEEEEvEE']


In [2]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Activate 4-bit precision base model loading
    use_nested_quant = False, # Activate nested quantization for 4-bit base models (double quantization)
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=torch.float16, # Compute dtype for 4-bit base models
)

device_map = {"": 0}

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map=device_map,
    trust_remote_code=True,
    use_auth_token=True
)
base_model.config.use_cache = False

# More info: https://github.com/huggingface/transformers/pull/24906
base_model.config.pretraining_tp = 1 

peft_config = LoraConfig(
    lora_alpha=16, # Alpha parameter for LoRA scaling
    lora_dropout=0.1, # Dropout probability for LoRA layers
    r=64, # LoRA attention dimension
    bias="none",
    task_type="CAUSAL_LM",
)



Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

```
################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25


# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
```

In [3]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

output_dir = "./results"

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    per_device_eval_batch_size = 4, 
    gradient_accumulation_steps=4, # Number of update steps to accumulate the gradients for
    learning_rate=2e-4,
    logging_steps=10,
    max_grad_norm = 0.3, # Maximum gradient normal (gradient clipping)
    gradient_checkpointing = True, # Enable gradient checkpointing
    max_steps=750
)

max_seq_length = 512

In [None]:
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
)

trainer.train()

import os
output_dir = os.path.join(output_dir, "final_checkpoint_chat")
trainer.model.save_pretrained(output_dir)



Map:   0%|          | 0/16500 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Currently logged in as: [33malfredcs919[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,2.064
20,0.9995
30,0.4725
40,0.1779
50,0.0489
60,0.0256
70,0.0205
80,0.0192
90,0.0199
100,0.0192


## A quick check
Here’s a quick and dirty approach to load the model and do a sanity test.

In [4]:
output_dir = "./results/final_checkpoint_chat"
device_map = {"": 0}
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [6]:
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map=device_map, torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
text = "I am taking Losatan 25mg together with HCTZ 75mg on daily basis?"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), attention_mask=inputs["attention_mask"], max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

I am taking Losatan 25mg together with HCTZ 75mg on daily basis? I am supposed to take it in the morning and at night. on empty stomach. I have been taking it for about 3 weeks. I am a bit worried about it. I have been told by my doctor to take it in


Let’s make sure that the model is behaving correctly with a different way. It would require a more exhaustive evaluation, but we can use the text generation pipeline to ask questions like “What is a large language model?” Note that I’m formatting the input to match Llama 2’s prompt template.

In [20]:
from transformers import logging, pipeline
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Should I stop taking Losatan and HCTZ after my BP back to normal?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1024)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Should I stop taking Losatan and HCTZ after my BP back to normal? [/INST]
 I am currently taking Losartan and HCTZ to control my high blood pressure. However, I just found out that my BP is now normal. Should I stop taking these medications? If so, how should I slowly wean myself off these medications?
A You should continue taking losartan and hydrochlorothiazide as directed by your doctor. Do not stop taking this medication without talking to your doctor. If you stop taking this medication, your blood pressure may increase and put you at risk for serious medical problems. You should not stop taking this medication unless told to do so by your doctor. If you are instructed to stop taking this medication, your doctor may recommend another medication to treat your medical condition. If you have any questions about this, talk to your doctor. You should call your doctor if your blood pressure readings remain normal on this medication. You should also call your doctor if you have 

### Evaluation, merging, saving, and deployment of a fine-tuned model to production for inference

The training can be very long, depending on the size of your dataset. Here, it took less than an hour on a T4 GPU. We can check the plots on tensorboard, as follows:

In [21]:
%load_ext tensorboard
%tensorboard --logdir results/runs

ERROR: Failed to launch TensorBoard (exited with 1).
Contents of stderr:
2023-08-16 22:36:06.128634: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
caused by: ['/home/alfred/.conda/envs/openai/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl5mutexC1Ev']
caused by: ['/home/alfred/.conda/envs/openai/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZNK10tensorflow4data11DatasetBase8FinalizeEPNS_15OpKernelContextESt8functionIFN3tsl8StatusOrISt10unique_ptrIS1_NS5_4core15RefCountDeleterEEEEvEE']
2023-08-16 22:36:07.464642: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read f

How can we store our new fien tuned llama-2-7b model now? We need to merge the weights from LoRA with the base model. Unfortunately, as far as I know, there is no straightforward way to do it: we need to reload the base model in FP16 precision and use the peft library to merge everything. Alas, it also creates a problem with the VRAM (despite emptying it), so I recommend restarting the notebook, re-executing the three first cells, and then executing the next one. Please contact me if you know a fix!

In [8]:
#torch.cuda.empty_cache() 

In [4]:
from peft import PeftModel

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)

# Fine-tuned model name
new_model = "./results/final_checkpoint"

model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

#### Test the merged model

In [23]:
from transformers import pipeline
prompt = "Are there any OTC alternatives for Losatan?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=256)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Are there any OTC alternatives for Losatan? [/INST]
 Are there any OTC alternatives for Losatan?
Iuput Formulary.  Neomycin, polymyxin, and Bacitracin Topical medication guide.  Neomycin, polymyxin, and bacitracin combination is used to prevent minor skin injuries such as cuts, scrapes, and burns from becoming infected. Neomycin, polymyxin, and bacitracin are in a class of medications called antibiotics. Neomycin, polymyxin, and bacitracin combination works by stopping the growth of bacteria.  Neomycin, polymyxin, and bacitracin combination may be used to treat other skin conditions in addition to minor skin injuries.  However,  this medication is sometimes used for purposes other than those listed in this medication guide.  Do not use neomycin, polymyxin, and bacitracin combination to prevent minor skin injuries from becoming infected if you are treating a different skin condition with this


### (Optional) Push to HF Hub
Our weights are merged and we reloaded the tokenizer. We can now push everything to the Hugging Face Hub to save our model.

In [11]:
#!huggingface-cli login
new_model = 'alfredcs/llama2-7b-DrugGPT'
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-64dd51ee-468e0f677d6e6151472da9fa;280271ec-3df2-4ea7-822e-952fb84cc316)

Repository Not Found for url: https://huggingface.co/api/models/alfredcs/llama2-7b-DrugGPT.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.