# Fine tune Llama Model

Install all required library

In [1]:
!pip install \
accelerate \
peft \
bitsandbytes \
transformers \
trl \
triton
!pip install --upgrade transformers accelerate

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting trl
  Downloading trl-0.18.1-py3-none-any.whl.metadata (11 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets>=3.0.0->trl)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch>=2.0.0->accelerate)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.meta

Import all required libraries

In [2]:
import os
import torch
from datasets import load_dataset
from transformers import(
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)
from peft import  LoraConfig, PeftModel
from trl import SFTTrainer

2025-06-02 06:24:07.109619: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1748845447.331969      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1748845447.387540      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In case of Llama 2, the following prompt templae is used for the chat models.

System Prompt(optional) to guide the model
User prompt(required) to give the instruction
Model Answer (required)




In [3]:
# <s> [INST] <<SYS>>
# System prompt
# <</SYS>>

# User Prompt [/INS] Model answer </s>

We will reformat our instruction dataset fo follow Llama 2 template

* Original dataset: https://huggingface.co/datasets/Amod/mental_health_counseling_conversations

How to fine tune Llama 2

* Free Google Colab offers a 15GB Graphics Card (Limited Resource -> Barely enough to store Llama 2-7b's weight).

* We also need to consider the overhead due to optimizer states, gradient, and forward action activations.

* Full fine tuning is not possible here: we need parameter-efficient fine-tuning (PEFT) techquies like LoRA, QLoRA.

* To drastically reduce the VRAM usage, we must fine-tune the model in 4-bit precision, which is why we'll use LoRA and QLoRA here.

1. Load llama2-7b model
2. Train it on the mlabonne/guanaco-llama2-1k which produce oru fine tuned lamma-2-7b-chat-finetune

QLoRA will use a rank of 64 with scaling parameter of 16. We'll load the Llama 2 model directly in 4-bit precision using the NF4 type and train it for one epoch

In [None]:
model_name = "NousResearch/Llama-2-7b-chat-hf"

# dataset_name = "mlabonne/guanaco-llama2-1k"
dataset_name = "Amod/mental_health_counseling_convergit isations"

model_fine_tune_name = "Llama-2-7b-chat-hf-finetune"

In [5]:
## QLoRA parameters
lora_r = 8  # rank
lora_alpha = 16
lora_dropout = 0.1

## bitsandbytes parameters
use_4bit = True # activation 4-bit precision model base model loading
bnb_4bit_compute_dtype = 'float16' # compute dtype for 4-bit base models
bnb_4bit_quant_type = 'nf4' # Quantization type (fp4 or nf4)
use_nested_quant = False # activate nested quantization for 4-bit base models

## TrainingArguments Parameter

output_dir = "./results"
num_train_epochs = 1
# Enable fp16/bp16 training
fp16 = True
bf16 = False

per_device_train_batch_size = 1

per_device_eval_batch_size = 1

gradient_accmulation_steps = 4

gradient_checkpointing = True

max_grad_norm = 0.3 # Maximum gradient normal (gradient clipping)

learning_rate = 2e-4 # Initial learning rate (AdamW optimizer)

weight_decay = 0.001

optim = "paged_adamw_32bit" # optimizer to use

lr_schedule_type = "cosine" # Learning rate schedule

max_steps = -1 # No of training steps (override train epcohs)

warmup_ratio = 0.03 # Ratio of steps for a linear warmup (from 0 to learning rate)

group_by_length = True # group sequences into batches with same length

save_steps = 0

logging_steps = 25

## STF Parameter

max_seq_length = None
packing = False # pack multiple short example in the same input sequence to increase efficency
device_map = {"": 0} # Load the entire model on the GPU 0

Load everything and start the fine-tuning process

1. First of all we want to load the dataset we defined. Here our dataset is already preprocessed but, usually this is where you would reformat the prompt, filter out bax text, combine multiple datasets, etc.

2. Then, we're configuring bitsandbytes for 4-bit quantization.

3. Next, we're loading the Llama 2 model in 4-bit precision on a GPU with the corresponding tokenizer.

4. Finally, we're loading configurations for QLoRA, regular training parameters, and passing everything to the STF traniner. The training can finally start.

In [6]:
# Load the dataset (you can process it here)
dataset = load_dataset(dataset_name, split='train')

README.md:   0%|          | 0.00/2.82k [00:00<?, ?B/s]

combined_dataset.json:   0%|          | 0.00/4.79M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3512 [00:00<?, ? examples/s]

In [7]:
dataset['Response'][:1]

["If everyone thinks you're worthless, then maybe you need to find new people to hang out with.Seriously, the social context in which a person lives is a big influence in self-esteem.Otherwise, you can go round and round trying to understand why you're not worthless, then go back to the same crowd and be knocked down again.There are many inspirational messages you can find in social media. \xa0Maybe read some of the ones which state that no person is worthless, and that everyone has a good purpose to their life.Also, since our culture is so saturated with the belief that if someone doesn't feel good about themselves that this is somehow terrible.Bad feelings are part of living. \xa0They are the motivation to remove ourselves from situations and relationships which do us more harm than good.Bad feelings do feel terrible. \xa0 Your feeling of worthlessness may be good in the sense of motivating you to find out that you are much better than your feelings today."]

In [8]:
dataset['Context'][:1]

["I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n   I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n   How can I change my feeling of being worthless to everyone?"]

In [9]:
def format_dataset(data):
  formated_data = {'text': f"###user: {data['Context']}\n###psychologist: {data['Response']}"}
  return formated_data

In [10]:
dataset = dataset.map(format_dataset)

Map:   0%|          | 0/3512 [00:00<?, ? examples/s]

In [11]:
dataset['text'][:3]

["###user: I'm going through some things with my feelings and myself. I barely sleep and I do nothing but think about how I'm worthless and how I shouldn't be here.\n   I've never tried or contemplated suicide. I've always wanted to fix my issues, but I never get around to it.\n   How can I change my feeling of being worthless to everyone?\n###psychologist: If everyone thinks you're worthless, then maybe you need to find new people to hang out with.Seriously, the social context in which a person lives is a big influence in self-esteem.Otherwise, you can go round and round trying to understand why you're not worthless, then go back to the same crowd and be knocked down again.There are many inspirational messages you can find in social media. \xa0Maybe read some of the ones which state that no person is worthless, and that everyone has a good purpose to their life.Also, since our culture is so saturated with the belief that if someone doesn't feel good about themselves that this is someh

In [12]:
compute_type = getattr(torch, bnb_4bit_compute_dtype)

In [13]:
compute_type

torch.float16

In [14]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_type,
    bnb_4bit_use_double_quant=use_nested_quant
)

In [15]:
# Check GPU compatibility with bfloat16
if compute_type == torch.float16 and use_4bit:
  major, _ = torch.cuda.get_device_capability()
  if major >= 8:
    print("=" * 80)
    print("Your GPU support bfloat16: accelerate training with bf16=True")
  else:
    print(f"Major is {major}")

Major is 7


In [16]:
# Load base model

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config = bnb_config,
    device_map=device_map
)

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [17]:
# loaded_model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     quantization_config = bnb_config,
#     device_map=device_map
# )

In [18]:
model.config.use_cache = False
model.config.pretraining_tp = 1

In [19]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code = True
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right' # Fix weird overflow issue with fp16 training


tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [20]:
dataset

Dataset({
    features: ['Context', 'Response', 'text'],
    num_rows: 3512
})

In [21]:
lengths = []
for idx,example in enumerate(dataset):
    tokens = tokenizer(example["text"])["input_ids"]
    lengths.append((idx,len(tokens)))


In [22]:
max(lengths,key=lambda x:x[1]),min(lengths,key=lambda x:x[1])

((2624, 26816), (2625, 19))

In [23]:
dataset['text'][2624]

'###user: I get so much anxiety, and I don’t know why. I feel like I can’t do anything by myself because I’m scared of the outcomes.\n###psychologist: The other two post answers to your question are very good and I don\'t feel the need to repeat what has already been said quite well, but I will offer one other option I have been able to utilize quite successfully with those dealing with panic attacks. \xa0Chain analysis is a fantastic way for your to map out the situation starting with the prompting event, the chain of events ((links) that lead up to the behavior - in this case a panic attack, and then what the consequences were. \xa0See the illustration below:<img src="

In [24]:
len(dataset['text'][19])

653

In [25]:
import re

In [26]:
def clean_html(example):
    text = example['text']
    pattern = r"<.*?\/>|<.*"
    cleaned = re.sub(pattern, "", text, flags=re.DOTALL)

    return {'cleaned_text': cleaned}

dataset = dataset.map(clean_html)

Map:   0%|          | 0/3512 [00:00<?, ? examples/s]

In [27]:
dataset

Dataset({
    features: ['Context', 'Response', 'text', 'cleaned_text'],
    num_rows: 3512
})

In [28]:
dataset['cleaned_text'][2624]

"###user: I get so much anxiety, and I don’t know why. I feel like I can’t do anything by myself because I’m scared of the outcomes.\n###psychologist: The other two post answers to your question are very good and I don't feel the need to repeat what has already been said quite well, but I will offer one other option I have been able to utilize quite successfully with those dealing with panic attacks. \xa0Chain analysis is a fantastic way for your to map out the situation starting with the prompting event, the chain of events ((links) that lead up to the behavior - in this case a panic attack, and then what the consequences were. \xa0See the illustration below:"

In [29]:
lengths = []
for idx,example in enumerate(dataset):
    tokens = tokenizer(example["cleaned_text"])["input_ids"]
    lengths.append((idx,len(tokens)))


In [30]:
max(lengths,key=lambda x:x[1]),min(lengths,key=lambda x:x[1])

((3144, 1306), (2625, 19))

In [31]:
max(lengths,key=lambda x:x[1]),min(lengths,key=lambda x:x[1])

((3144, 1306), (2625, 19))

In [32]:
import numpy as np

values = [v for i,v in lengths]
arr = np.array(values)
arr

array([307, 578, 166, ..., 258, 159,  88])

In [33]:
q1 = np.percentile(arr,25)
q3 = np.percentile(arr,75)

In [34]:
iqr = q3 - q1
upper_bound = (1.5 * iqr) + q3
upper_bound

690.5

In [35]:
def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512,
    )

tokenized_dataset = dataset.map(tokenize, batched=True)

Map:   0%|          | 0/3512 [00:00<?, ? examples/s]

In [36]:
tokenized_dataset

Dataset({
    features: ['Context', 'Response', 'text', 'cleaned_text', 'input_ids', 'attention_mask'],
    num_rows: 3512
})

In [None]:
import torch
torch.cuda.empty_cache()

In [37]:
# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM"
)

In [38]:
# Set training parameter
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accmulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_schedule_type,
    report_to='tensorboard'
)

In [40]:
trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,
    peft_config=peft_config,
    args=training_arguments,
)

Truncating train dataset:   0%|          | 0/3512 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [41]:
trainer.train()

Step,Training Loss
25,1.1589
50,0.6285
75,0.6246
100,0.6117
125,0.6197
150,0.6117
175,0.6251
200,0.5865
225,0.5991
250,0.6224


TrainOutput(global_step=439, training_loss=0.6275949934348973, metrics={'train_runtime': 9007.4715, 'train_samples_per_second': 0.39, 'train_steps_per_second': 0.049, 'total_flos': 7.133098344972288e+16, 'train_loss': 0.6275949934348973})

In [48]:
trainer.model.save_pretrained(model_fine_tune_name)

Check the plots on tensorboard, as follows

In [49]:
%load_ext tensorboard
%tensorboard --logdir results/runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 155), started 0:12:38 ago. (Use '!kill 155' to kill it.)

<IPython.core.display.Javascript object>

In [53]:
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = 'I am feeling very insecure. How to handle it?'
pipe = pipeline(task="text-generation",model=model,tokenizer=tokenizer)
result = pipe(f"<s>[INST] {prompt} [/INST]")

In [54]:
print(result[0]['generated_text'])

<s>[INST] I am feeling very insecure. How to handle it? [/INST]  Feeling insecure can be a challenging and uncomfortable experience. Here are some suggestions that may help you manage your insecurity:

1. Practice self-compassion: Be kind to yourself and try to reframe your negative thoughts. Instead of focusing on your flaws, try to focus on your strengths and accomplishments.
2. Identify the source of your insecurity: Try to understand what is causing your insecurity. Is it a specific situation or person? Is it a past experience or a general feeling? Once you understand the source of your insecurity, you can start to address it.
3. Challenge negative thoughts: Try to identify negative thoughts and challenge them. Ask yourself if they are based on facts or if they are just your perception. Try to replace negative thoughts with more positive and realistic ones.
4. Practice mindfulness: Mindfulness is the practice of being present in the moment and focusing on your thoughts and feelings

In [52]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_f

In [47]:
# del model
# del pipe
# del trainer
# import gc
# gc.collect()
# # gc.collect()