## 🚀 Day 3/15 — Fine-Tuning with Unsloth AI

## Synthetic Data Generation using Llama 3.2 3B model

> This notebook outlines the process of generating synthetic data using the LLama 3.2-3B model and fine-tuning another model on that data.
---

### 👋🏻 About Me

Hi, I'm **Aasher Kamal** — a Generative & Agentic AI developer passionate about building intelligent systems with LLMs.

I have started a **15-day challenge** to master fine-tuning using the open-source **Unsloth AI** framework. This journey will cover everything from LoRA and QLoRA to reinforcement learning, vision, and TTS fine-tuning — all hands-on, all open-source.

I'll be documenting my learnings, experiments, and challenges daily.

---

### 🌐 Connect with Me

- [LinkedIn](https://www.linkedin.com/in/aasher-kamal/)
- [GitHub](https://github.com/aasherkamal216)
- [X (Twitter)](https://x.com/Aasher_Kamal)
- [Facebook](https://www.facebook.com/aasher.kamal)
- [Website](https://aasherkamal.framer.website/)

Let’s build and learn together! 💡

---

### Acknowledgements

This notebook is adapted from Unsloth's official [GitHub repository](https://github.com/unslothai/notebooks).  
I've made minor modifications to the original version to better understand and document the workflow.

---


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm==0.8.5.post1
    !pip install synthetic-data-kit==0.0.3
else:
    !pip install --no-deps unsloth vllm==0.8.5.post1
    !pip install synthetic-data-kit==0.0.3


In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm==0.8.5.post1
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

### Unsloth

In [None]:
from unsloth.dataprep import SyntheticDataKit

generator = SyntheticDataKit.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 1024,
    gpu_memory_utilization=0.6  # using 60% of the total GPU memory
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-30 09:24:39 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-30 09:24:39 [__init__.py:239] Automatically detected platform cuda.


config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
Unsloth: Using dtype = torch.bfloat16 for vLLM.
Unsloth: vLLM loading unsloth/Llama-3.2-3B-Instruct with actual GPU utilization = 59.5%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.16 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 7.2 GB. Also swap space = 4 GB.
vLLM STDOUT: INFO 07-30 09:25:02 [__init__.py:239] Automatically detected platform cuda.
vLLM STDOUT: INFO 07-30 09:25:08 [api_server.py:1043] vLLM API server version 0.8.5.post1
vLLM STDOUT: INFO 07-30 09:25:08 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='unsloth/Llama-3.2-3B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=Non

## Generate QA Pairs + Auto clean data
We now use synthetic data kit for question answer pair generation:

In [None]:
generator.prepare_qa_generation(
    output_folder = "data", # Output location of synthetic data
    temperature = 0.9, # Higher temp makes more diverse datases
    top_p = 0.95,
    overlap = 64, # Overlap portion during chunking
    max_generation_tokens = 256,
)

Check if it succeeded:

In [None]:
!synthetic-data-kit system-check

[?25l[32m VLLM server is running at [0m[4;94mhttp://localhost:8000/v1[0m
[32m⠋[0m[32m Checking VLLM server at http://localhost:8000/v1...[0m[2KAvailable models: [1m{[0m[32m'object'[0m: [32m'list'[0m, [32m'data'[0m: [1m[[0m[1m{[0m[32m'id'[0m: 
[32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'object'[0m: [32m'model'[0m, [32m'created'[0m: [1;36m1753867667[0m, 
[32m'owned_by'[0m: [32m'vllm'[0m, [32m'root'[0m: [32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'parent'[0m: [3;35mNone[0m, 
[32m'max_model_len'[0m: [1;36m1024[0m, [32m'permission'[0m: [1m[[0m[1m{[0m[32m'id'[0m: 
[32m'modelperm-32241f8566c4453c9fcfb80fc2004228'[0m, [32m'object'[0m: [32m'model_permission'[0m, 
[32m'created'[0m: [1;36m1753867667[0m, [32m'allow_create_engine'[0m: [3;91mFalse[0m, [32m'allow_sampling'[0m: [3;92mTrue[0m, 
[32m'allow_logprobs'[0m: [3;92mTrue[0m, [32m'allow_search_indices'[0m: [3;91mFalse[0m, [32m'allow_view'[0m: [3;92mTrue[0m

## Document Parsing
I have placed the document `comprehensive_guide_daca.md` manually inside `data/output/` folder. We'll use this file and covert it to Q&A pairs in order to finetune Llama 3.2!

In [None]:
# Truncate document
filenames = generator.chunk_data("data/output/comprehensive_guide_daca.md")
print(len(filenames), filenames[:3])

98 ['data/output/comprehensive_guide_daca_0.md', 'data/output/comprehensive_guide_daca_1.md', 'data/output/comprehensive_guide_daca_2.md']


We see around 98 chunks of data. We now call synthetic-data-kit to create some pairs of data for 10 of our chunks.

In [None]:
import time

for filename in filenames[:10]:
    !synthetic-data-kit \
        -c synthetic_data_kit_config.yaml \
        create {filename} \
        --num-pairs 10 \
        --type "qa"
    time.sleep(2) # Sleep some time to leave some room for processing

[2KProcessing 3 chunks to generate QA pairs...
[2KBatch processing complete.
[2KGenerated 6 QA pairs total
[2KSaving result to data/generated/comprehensive_guide_daca_0_qa_pairs.json
[2KSuccessfully wrote test file to data/generated/test_write.json
[2KSuccessfully wrote result to 
data/generated/comprehensive_guide_daca_0_qa_pairs.json
[2K[32m⠇[0m Generating qa content from data/output/comprehensive_guide_daca_0.md...
[1A[2K[32m Content saved to [0m[1;32mdata/generated/comprehensive_guide_daca_0_qa_pairs.json[0m
[2KProcessing 3 chunks to generate QA pairs...
[2KBatch processing complete.
[2KGenerated 8 QA pairs total
[2KSaving result to data/generated/comprehensive_guide_daca_1_qa_pairs.json
[2KSuccessfully wrote test file to data/generated/test_write.json
[2KSuccessfully wrote result to 
data/generated/comprehensive_guide_daca_1_qa_pairs.json
[2K[32m⠼[0m Generating qa content from data/output/comprehensive_guide_daca_1.md...
[1A[2K[32m Content saved to [0m

**Note:**  We are currently encountering a VLLM server error. Attempts to resolve the issue by upgrading the GPU and system memory have not been successful.

We now convert the generated datasets into QA formats so we can load it for finetuning:

In [None]:
qa_pairs_filenames = [
    f"data/generated/comprehensive_guide_daca_{i}_qa_pairs.json"
    for i in range(len(filenames[:3]))
]
for filename in qa_pairs_filenames:
    !synthetic-data-kit \
        -c synthetic_data_kit_config.yaml \
        save-as {filename} -f ft

[?25l[32m⠋[0m Converting data/generated/comprehensive_guide_daca_0_qa_pairs.json to ft 
format with json storage...
[?25h[1A[2K[1A[2K[32m Converted to ft format and saved to [0m
[1;32mdata/final/comprehensive_guide_daca_0_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/comprehensive_guide_daca_1_qa_pairs.json to ft 
format with json storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m
[1;32mdata/final/comprehensive_guide_daca_1_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/comprehensive_guide_daca_2_qa_pairs.json to ft 
format with json storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m
[1;32mdata/final/comprehensive_guide_daca_2_qa_pairs_ft.json[0m


Let's load up the data and see what the synthetic data looks like!

In [None]:
from datasets import Dataset
import pandas as pd
final_filenames = [
    f"data/final/comprehensive_guide_daca_{i}_qa_pairs_ft.json"
    for i in range(len(filenames[:3]))
]
conversations = pd.concat([
    pd.read_json(name) for name in final_filenames
]).reset_index(drop = True)

dataset = Dataset.from_pandas(conversations)

In [None]:
len(dataset)

20

In [None]:
dataset[6]

{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'},
  {'content': 'What is the potential benefit of leveraging free-tier cloud services in planetary-scale production?',
   'role': 'user'},
  {'content': 'cost optimization', 'role': 'assistant'}]}

In [None]:
dataset[8]

{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'},
  {'content': 'What are the key features of DACA framework for agentic AI applications?',
   'role': 'user'},
  {'content': 'a robust, flexible, and cost-effective framework',
   'role': 'assistant'}]}

Finally free vLLM process to save memory and to allow for finetuning!

In [None]:
generator.cleanup()

Attempting to terminate the VLLM server gracefully...
Server did not terminate gracefully after 10 seconds. Forcing kill...
Server killed forcefully.


### Fine-tuning Synthetic Dataset with Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 1024,
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False,
)

==((====))==  Unsloth 2025.7.11: Fast Llama patching. Transformers: 4.54.0. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.7.11 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Llama-3.2` format for conversation style finetunes. The chat template renders conversations like below: (Cutting Knowledge Date is by default there!)

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 May 2025

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is 1+1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

2<|eot_id|>
```

In [None]:
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return {"text" : texts}

# Get our previous dataset and format it:
dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
dataset[13]

{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'},
  {'content': 'What are the foundational tenets of DACA?', 'role': 'user'},
  {'content': 'The foundational tenets of DACA are AI-first and cloud-first.',
   'role': 'assistant'}],
 'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 30 Jul 2025\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are the foundational tenets of DACA?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe foundational tenets of DACA are AI-first and cloud-first.<|eot_id|>'}

## Train the model

In [None]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.161 GB.
3.07 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20 | Num Epochs = 20 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856 of 3,237,063,680 (0.75% trained)


Step,Training Loss
1,4.783
2,4.7368
3,4.9021
4,4.534
5,4.1259
6,3.6525
7,3.2935
8,2.6931
9,2.537
10,2.2901


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

79.8495 seconds used for training.
1.33 minutes used for training.
Peak reserved memory = 3.07 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 13.853 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! Use `apply_chat_template` with `add_generation_prompt` set to `True` for inference.

In [None]:
messages = [
    {"role": "user", "content": "What is the purpose of DACA?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer,
                   max_new_tokens = 256, temperature = 0.5)

The purpose of DACA is to implement Agentia World.<|eot_id|>


In [None]:
messages = [
    {"role": "user", "content": "What are the foundations of DACA?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer,
                   max_new_tokens = 256, temperature = 0.1)

The foundations of DACA are AI-first and cloud-first.<|eot_id|>


## Observations

The synthetic data generated by LLama 3.2 is of poor quality. This appears to be due to the low values set for `max_seq_length` and `max_generation_tokens`, which were intentionally limited to conserve GPU memory. Attempts to use higher values led to rapid memory exhaustion.

Additionally, we encountered VLLM server issues during QA pairs generation, specifically when generating multiple QA pairs. The VLLM server tends to shut down unexpectedly after some time.

As a result, the QA pairs produced by LLama 3.2 have contributed to poor fine-tuning outcomes.