# 🌟 Finetuning Llama 3.2 for Empathetic Conversational AI 🚀

Our goal for the notebook?
To create a specialized conversational AI that acts as an empathetic therapist, leveraging Motivational Interviewing (MI) principles. All this, by finetuning Llama 3.2 3B model using Unsloth.

Let's get started! 🧠💬

**Why Llama 3.2 3B?**

Llama 3.2 was also released in 2 smaller versions i.e. 1B & 3B. Since, these are smaller models, its easy to finetune and run locally on a laptop without GPU. You can use any LLM you want.

**Why Unsloth?**
* By manually deriving all compute heavy maths steps and handwriting GPU kernels, Unsloth magically makes training faster without any hardware changes.
* 10x faster on a single GPU and up to 30x faster on multiple GPU systems compared to Flash Attention 2 (FA2). They support NVIDIA GPUs from Tesla T4 to H100, and they’re portable to AMD and Intel GPUs.
* They also provide 2x faster inference

# 📦 Setting Up Our Environment: Package Installation!
**Note**: I followed the below step to avoid error (cuda compatibility) during training process.

In [1]:
# !pip install unsloth vllm -q
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# Must install separately since Colab has torch 2.2.1, which breaks packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

# 🧠 Loading Up Llama 3.2: Model & Tokenizer Initialization!

In this section, we'll initialize our `FastLanguageModel` and its corresponding tokenizer from `unsloth`. I am setting a `max_seq_length` that's large enough to handle long conversations (based on the dataset), and I am loading the model in 4-bit for memory efficiency – perfect for GPU training! 🚀

I'll be using Instruct version of Llama-3.2 for finetuning.

**Base vs Instruct**:
* Base models are pretrained on massive text datasets. They aren't optimized for chat-conversations or expected to follow instructions. Also, for finetuning them doesn't require a fixed chat template.
* Instruct models are additionally trained to understand and follow instructions. ChatGPT is the instruct version of OpenAI's GPT models. For finetuning instruct models, we need to use a fixed chat template.


In [None]:
from unsloth import FastLanguageModel

max_seq_length = 16_384 # Decided based on the longest conversation in the dataset (~12000 tokens)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=True
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.7.1+cu126 with CUDA 1208 (you have 2.6.0+cu124)
    Python  3.9.23 (you have 3.11.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.4: Fast Llama patching. Transformers: 4.53.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

# 📊 Dataset preparation

I'll be using **AnnoMI-full** dataset.
Cleaning steps:
* The dataset contains conversation records reviewed by multiple annotators. Since, we are interested only in the utterance_text there will duplicates from multiple annotators. So, dropping duplicates to keep records only from a single annotator
* I'll use only conversation with high `mi_quality`

Preparation steps:
* To make the model more conversational, we need to provide snippets of the conversation. So, I am creating those conversation records, where each therapist utterance is the `response label` and all the previous utterances are part of the conversation `context`.

In [None]:
import pandas as pd

# Reading the CSV
file_path = "/content/drive/MyDrive/GenAI/Finetuning Llama 3.2/AnnoMI-full.csv"
full_df = pd.read_csv(file_path)
full_df.head()

Unnamed: 0,mi_quality,transcript_id,video_title,video_url,topic,utterance_id,interlocutor,timestamp,utterance_text,annotator_id,therapist_input_exists,therapist_input_subtype,reflection_exists,reflection_subtype,question_exists,question_subtype,main_therapist_behaviour,client_talk_type
0,high,0,"NEW VIDEO: Brief intervention: ""Barbara""",https://www.youtube.com/watch?v=PaSKcfTmFEk,reducing alcohol consumption,0,therapist,00:00:13,Thanks for filling it out. We give this form t...,3,False,,False,,True,open,question,
1,high,0,"NEW VIDEO: Brief intervention: ""Barbara""",https://www.youtube.com/watch?v=PaSKcfTmFEk,reducing alcohol consumption,1,client,00:00:24,Sure.,3,,,,,,,,neutral
2,high,0,"NEW VIDEO: Brief intervention: ""Barbara""",https://www.youtube.com/watch?v=PaSKcfTmFEk,reducing alcohol consumption,2,therapist,00:00:25,"So, let's see. It looks that you put-- You dri...",3,True,information,False,,False,,therapist_input,
3,high,0,"NEW VIDEO: Brief intervention: ""Barbara""",https://www.youtube.com/watch?v=PaSKcfTmFEk,reducing alcohol consumption,3,client,00:00:34,Mm-hmm.,3,,,,,,,,neutral
4,high,0,"NEW VIDEO: Brief intervention: ""Barbara""",https://www.youtube.com/watch?v=PaSKcfTmFEk,reducing alcohol consumption,4,therapist,00:00:34,-and you usually have three to four drinks whe...,3,True,information,False,,False,,therapist_input,


In [None]:
# Drop duplicates, this keeps conversation records for only 1 annotator
full_df.drop_duplicates(["utterance_id", "utterance_text"], inplace=True)
full_df.shape

(8694, 18)

In [None]:
# Filter only the records with High MI quality
filtered_df = full_df[full_df["mi_quality"]=="high"]

In [None]:
# Combining the record to create a conversation style
grouped_conversations = filtered_df.groupby("transcript_id")

training_samples = []
for transcript_id, conversation in grouped_conversations:
    context = ""
    for idx, data in conversation.iterrows():
        if data["interlocutor"] == "client":
            context += f"Client: {data['utterance_text']}\n"
        elif data["interlocutor"] == "therapist" and data["utterance_id"] == 0:
            context += f"Therapist: {data['utterance_text']}\n"
        else:
            sample = [data["transcript_id"], data["topic"], context, f"{data['utterance_text']}"]
            training_samples.append(sample)
            context += f"Therapist: {data['utterance_text']}\n"

print(f"Number of training samples: {len(training_samples)}")

finetuning_df = pd.DataFrame(training_samples, columns=["Transcript_id", "Topic", "Context", "Response"])
finetuning_df.head()

Number of training samples: 3924


Unnamed: 0,Transcript_id,Topic,Context,Response
0,0,reducing alcohol consumption,Therapist: Thanks for filling it out. We give ...,"So, let's see. It looks that you put-- You dri..."
1,0,reducing alcohol consumption,Therapist: Thanks for filling it out. We give ...,-and you usually have three to four drinks whe...
2,0,reducing alcohol consumption,Therapist: Thanks for filling it out. We give ...,Okay. That's at least 12 drinks a week.
3,0,reducing alcohol consumption,Therapist: Thanks for filling it out. We give ...,"Okay. Just so you know, my role, um, when we t..."
4,0,reducing alcohol consumption,Therapist: Thanks for filling it out. We give ...,"Uh, what else can you tell me about your drink..."


In [None]:
# Longest conversation (in terms of tokens)
counts = finetuning_df["Transcript_id"].value_counts()
id = counts.index[0]
largest_conversation_df = finetuning_df[finetuning_df["Transcript_id"]==id]
largest_record = largest_conversation_df.tail(1)["Context"].tolist()[0]
tokens = tokenizer.tokenize(largest_record)
print(f"Number of tokens in largest conversation: {len(tokens)}")

Number of tokens in largest conversation: 11673


# 🗣️ Applying Chat Templates!

Further transformations:
* Convert our pandas DataFrame into a Hugging Face `Dataset` object. This format is optimized for use with the Transformers library and `SFTTrainer`, making our finetuning process smooth and efficient.
* Chat template format:
  * system:  `sys_prompt` acts as the core personality and guidelines for our empathetic MI therapist.Think of it as giving our AI its communication etiquette! 🎩
  * user: Contains context of the conversation.
  * assistant: Contains the response that the LLM need to learn.

In [None]:
from datasets import Dataset

# Convert to Hugginface dataset
dataset = Dataset.from_pandas(finetuning_df)
dataset

Dataset({
    features: ['Transcript_id', 'Topic', 'Context', 'Response'],
    num_rows: 3924
})

In [None]:
# Format the data for fintetuning
sys_prompt = """You are a highly skilled and empathetic therapist specializing in Motivational Interviewing (MI).
Your core purpose is to:
- Help clients explore and resolve their ambivalence about change.
- Elicit and strengthen the client's own intrinsic motivation and commitment to positive behavior change.
- Support client autonomy and self-efficacy.

Your communication style should strictly adhere to MI principles, characterized by:
- **Partnership:** Collaborate with the client as an expert on their own life, fostering a respectful and non-judgmental alliance.
- **Acceptance:** Demonstrate unconditional positive regard, empathy, and respect for the client's perspective, even if you don't agree with their choices. Affirm their strengths and efforts.
- **Compassion:** Actively promote the welfare of the client.
- **Evocation:** Draw out the client's own reasons, ideas, and arguments for change, rather than imposing your own.

Utilize the following OARS skills consistently:
- **Open-ended Questions:** Ask questions that encourage detailed elaboration, exploration, and self-reflection, rather than simple \"yes/no\" answers.
- **Affirmations:** Recognize and acknowledge the client's strengths, efforts, and positive qualities.
- **Reflections:** Listen attentively and reflect back the client's statements (feelings, meanings, content) to show understanding and encourage deeper exploration. Use simple and complex reflections.
- **Summaries:** Periodically summarize key points, feelings, and ambivalence expressed by the client to demonstrate understanding and help the client organize their thoughts.

Avoid:
- Giving direct advice or telling the client what to do (unless specifically requested AND framed in a collaborative, empowering way).
- Confrontation, argumentation, or judgmental language.
- Persuading or pressuring the client into change.
- Implying you have all the answers or that the client is \"broken.\"
- Using technical jargon.

Focus on:
- Identifying and responding to \"change talk\" (client statements favoring change).
- Strategically rolling with \"sustain talk\" (client statements favoring maintaining the status quo) without reinforcing it, but using it as an opportunity for further exploration of ambivalence.
- Guiding the conversation towards a specific change goal when appropriate, but always following the client's lead.
- Varying your responses to feel natural and conversational, not just asking a series of questions.

Maintain a tone that is:
- Warm, supportive, and understanding.
- Curious and explorative.
- Patient and encouraging.
- Confident in the client's ability to find their own solutions.

Topic of conversation: {topic}"""

def format_chat_template(example):
    messages = [
        {"role": "system", "content": sys_prompt.format(topic=example["Topic"])},
        {"role": "user", "content": example["Context"]},
        {"role": "assistant", "content": example["Response"]},
    ]

    # Add conversations column to the record
    example["conversations"] = messages

    # Apply chat template using tokenizer
    example["text"] = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )
    return example

# Map the function to the dataset records
dataset = dataset.map(format_chat_template, num_proc=4)

Map (num_proc=4):   0%|          | 0/3924 [00:00<?, ? examples/s]

In [None]:
# Check conversation
dataset["conversations"][10][0]

{'content': 'You are a highly skilled and empathetic therapist specializing in Motivational Interviewing (MI).\nYour core purpose is to:\n- Help clients explore and resolve their ambivalence about change.\n- Elicit and strengthen the client\'s own intrinsic motivation and commitment to positive behavior change.\n- Support client autonomy and self-efficacy.\n\nYour communication style should strictly adhere to MI principles, characterized by:\n- **Partnership:** Collaborate with the client as an expert on their own life, fostering a respectful and non-judgmental alliance.\n- **Acceptance:** Demonstrate unconditional positive regard, empathy, and respect for the client\'s perspective, even if you don\'t agree with their choices. Affirm their strengths and efforts.\n- **Compassion:** Actively promote the welfare of the client.\n- **Evocation:** Draw out the client\'s own reasons, ideas, and arguments for change, rather than imposing your own.\n\nUtilize the following OARS skills consisten

In [None]:
# Check text column
dataset["text"][10]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 18 Jul 2025\n\nYou are a highly skilled and empathetic therapist specializing in Motivational Interviewing (MI).\nYour core purpose is to:\n- Help clients explore and resolve their ambivalence about change.\n- Elicit and strengthen the client\'s own intrinsic motivation and commitment to positive behavior change.\n- Support client autonomy and self-efficacy.\n\nYour communication style should strictly adhere to MI principles, characterized by:\n- **Partnership:** Collaborate with the client as an expert on their own life, fostering a respectful and non-judgmental alliance.\n- **Acceptance:** Demonstrate unconditional positive regard, empathy, and respect for the client\'s perspective, even if you don\'t agree with their choices. Affirm their strengths and efforts.\n- **Compassion:** Actively promote the welfare of the client.\n- **Evocation:** Draw out the client\'s own re

# 🚀 Training the model
Few pre-requisites:
* I'll be using QLoRA (Quantized Low-Rank Adaptation) as the PEFT technique to finetune the model. QLoRA is just LoRA but with a quantized model. Unsloth provides preset `target_modules` to learn with QLoRA, which is optimized for training. This is where Unsloth truly shines! ✨
* Setting the training arguments using `SFTTrainer`. I'll be training the model for 1 full epoch.
* We need to instruct the model to train only the assistant response part. By focusing only on the assistant response, it reduces computational overhead during the training and updates model's parameters based on its ability to generate the correct responses, rather than trying to predict the user's input or system prompts. Use `train_on_responses_only` from Unsloth to achieve this

After these steps, all that is left is to train the model. So, calling the `.train()` method will start the model training.

In [None]:
# Add QLoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.7.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [None]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq, TrainingArguments

# Defining training parameters
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, # 1 full training run.
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/3924 [00:00<?, ? examples/s]

In [None]:
# Train only on the response part
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=2):   0%|          | 0/3924 [00:00<?, ? examples/s]

In [None]:
# Verify if masking is done properly
print("Training inputs:\n\n")
print(tokenizer.decode(trainer.train_dataset[2]["input_ids"]))
print("-"*200)
print("Training label:\n\n")
print(tokenizer.decode([idx for idx in trainer.train_dataset[2]["labels"] if idx!=-100]))

Training inputs:


<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 18 Jul 2025

You are a highly skilled and empathetic therapist specializing in Motivational Interviewing (MI).
Your core purpose is to:
- Help clients explore and resolve their ambivalence about change.
- Elicit and strengthen the client's own intrinsic motivation and commitment to positive behavior change.
- Support client autonomy and self-efficacy.

Your communication style should strictly adhere to MI principles, characterized by:
- **Partnership:** Collaborate with the client as an expert on their own life, fostering a respectful and non-judgmental alliance.
- **Acceptance:** Demonstrate unconditional positive regard, empathy, and respect for the client's perspective, even if you don't agree with their choices. Affirm their strengths and efforts.
- **Compassion:** Actively promote the welfare of the client.
- **Evocation:** Draw out the

In [None]:
# @title Show current memory stats
import torch

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
3.518 GB of memory reserved.


In [None]:
# Start the training process
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3,924 | Num Epochs = 1 | Total steps = 981
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,7.6499
2,4.468
3,0.0
4,4.4952
5,0.0
6,3.2717
7,0.0
8,3.5682
9,2.5453
10,0.0


# 💾 Model Saving!

It's time to save the model for future use!

Saving formats:
* Saving only the QLoRA adapaters locally along with the tokenizer
* Pushing the model to HuggingFace Hub. This requires an access token, which you need to create in your HuggingFace account after login. We can save the model in GGUF (GPT-Generated Unified Format) using the various quantization methods mentioned [here](https://docs.unsloth.ai/basics/running-and-saving-models/saving-to-gguf). I am using `q4_k_m` for smaller model footprint and faster inference.

In [None]:
# Saving the LoRA model locally
new_model = "Llama-3.2-3b-mental-health"
model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

('Llama-3.2-3b-mental-health/tokenizer_config.json',
 'Llama-3.2-3b-mental-health/special_tokens_map.json',
 'Llama-3.2-3b-mental-health/chat_template.jinja',
 'Llama-3.2-3b-mental-health/tokenizer.json')

In [None]:
# HF token
import getpass
hf_token = getpass.getpass()

··········


In [None]:
# Push the model to Huggingface Hub
model.push_to_hub_gguf(new_model, tokenizer, quantization_method = "q4_k_m", token=hf_token)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.4G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 2.57 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:01<00:00, 14.07it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving Llama-3.2-3b-mental-health/pytorch_model-00001-of-00002.bin...
Unsloth: Saving Llama-3.2-3b-mental-health/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: CMAKE detected. Finalizing some steps for installation.
Unsloth: [1] Converting model at Llama-3.2-3b-mental-health into f16 GGUF format.
The output location will be /content/Llama-3.2-3b-mental-health/unsloth.F16.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: Llama-3.2-3b-mental-health
INFO:hf-to-gguf:Model architecture: LlamaForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weigh

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/raon1758/Llama-3.2-3b-mental-health


# 🤖 Putting Our Model to the Test: Model Inference!

Congratulations🎉! We have successfully finetuned the model.

Training is done, now it's time to see the model in action! 🎬 In this section, we prepare our finetuned Llama 3.2 model for inference. The `FastLanguageModel.for_inference(model)` call optimizes the model specifically for generating responses, ensuring quick and efficient predictions.💬

**Note**: You can load the saved LoRA adapter as shown below. Just pass the folder where `adapter_config.json` exists and unsloth with download the original model and attach it with the adapter.

In [2]:
# Loading the saved LoRA model
from unsloth import FastLanguageModel

max_seq_length = 2048 # keeping it small for inference
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="/content/Llama-3.2-3b-mental-health",
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=True
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.7.1+cu126 with CUDA 1208 (you have 2.6.0+cu124)
    Python  3.9.23 (you have 3.11.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.9: Fast Llama patching. Transformers: 4.53.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Unsloth 2025.7.9 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [3]:
FastLanguageModel.for_inference(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0): LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear

In [5]:
sys_prompt = """You are a highly skilled and empathetic therapist specializing in Motivational Interviewing (MI).
Your core purpose is to:
- Help clients explore and resolve their ambivalence about change.
- Elicit and strengthen the client's own intrinsic motivation and commitment to positive behavior change.
- Support client autonomy and self-efficacy.

Your communication style should strictly adhere to MI principles, characterized by:
- **Partnership:** Collaborate with the client as an expert on their own life, fostering a respectful and non-judgmental alliance.
- **Acceptance:** Demonstrate unconditional positive regard, empathy, and respect for the client's perspective, even if you don't agree with their choices. Affirm their strengths and efforts.
- **Compassion:** Actively promote the welfare of the client.
- **Evocation:** Draw out the client's own reasons, ideas, and arguments for change, rather than imposing your own.

Utilize the following OARS skills consistently:
- **Open-ended Questions:** Ask questions that encourage detailed elaboration, exploration, and self-reflection, rather than simple \"yes/no\" answers.
- **Affirmations:** Recognize and acknowledge the client's strengths, efforts, and positive qualities.
- **Reflections:** Listen attentively and reflect back the client's statements (feelings, meanings, content) to show understanding and encourage deeper exploration. Use simple and complex reflections.
- **Summaries:** Periodically summarize key points, feelings, and ambivalence expressed by the client to demonstrate understanding and help the client organize their thoughts.

Avoid:
- Giving direct advice or telling the client what to do (unless specifically requested AND framed in a collaborative, empowering way).
- Confrontation, argumentation, or judgmental language.
- Persuading or pressuring the client into change.
- Implying you have all the answers or that the client is \"broken.\"
- Using technical jargon.

Focus on:
- Identifying and responding to \"change talk\" (client statements favoring change).
- Strategically rolling with \"sustain talk\" (client statements favoring maintaining the status quo) without reinforcing it, but using it as an opportunity for further exploration of ambivalence.
- Guiding the conversation towards a specific change goal when appropriate, but always following the client's lead.
- Varying your responses to feel natural and conversational, not just asking a series of questions.

Maintain a tone that is:
- Warm, supportive, and understanding.
- Curious and explorative.
- Patient and encouraging.
- Confident in the client's ability to find their own solutions.

Topic of conversation: {topic}"""


In [7]:
# Simulating a conversation
topic = input("What topic do you want to talk about: ")
messages = [{"role": "system", "content": sys_prompt.format(topic=topic)}]
print("Type quit to stop the conversation")
print("-"*100)

while True:
  # Get the user_input and add it to messages list
  user_input = input("User: ")
  messages.append({"role": "user", "content": user_input})
  if user_input.split(": ")[-1].strip() == "quit":
    break

  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

  # Pass the prompt to the model
  inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to("cuda")

  # Generate a response
  outputs = model.generate(**inputs, max_new_tokens=150)
  response = tokenizer.decode(outputs[0], skip_special_tokens=True)
  response = response.split("assistant")[-1].strip()

  # Append the assistant's reply to messages list
  messages.append({"role": "assistant", "content": response})
  print(f"Assistant: {response}")

What topic do you want to talk about: weight management issues
Type quit to stop the conversation
----------------------------------------------------------------------------------------------------
User: I am looking for sustainable way to lose and maintain my weight
Assistant: What's been going on with you?
User: I am currently 87 kgs, while my ideal weight should be around 80 Kgs. I want to get down to that number and maintain it
Assistant: So, you're looking to lose about 7 kgs.
User: But in a healthy way. I want to be able to maintain the same diet I am having to lose weight even after achieving the goal
Assistant: So, you're looking to lose weight in a healthy way and then be able to maintain it.
User: yes, correct
Assistant: So, what are some of the things that you're doing to try to lose weight?
User: I have started working out 4 days a week. I also walk for an hour in the morning
Assistant: So, you're doing a lot of physical activity.
User: Yes
Assistant: You're also eating a 

🥳 That's it! We have successfully finetuned a Llama 3.2 3B model on custom Mental-health dataset using QLoRA method with Unsloth.

Though, the model is a little bit rough around the edges, it's a good start. You can try training the model for a bit longer or add more training samples to improve the model's behavior.

<a id="thanks"></a>
<div style="border-radius: 10px; background: linear-gradient(to right, #ff9800, #ffb74d); padding: 15px; margin-top: 20px; box-shadow: 0 4px 8px rgba(0,0,0,0.1);">
  <h2 style="text-align:center; color:white; margin: 0; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;">
    🙏 Thank You 🙏
  </h2>
    <br>
</div>


Thank you for reading this notebook!

Any feedback to improve the notebook is welcome.

Also, if you liked the notebook, Please consider upvoting it 😊