## 🚀 Day 2/15 — Fine-Tuning with Unsloth AI

## Fine-Tuning a Stoic Philosopher: Creating a Marcus Aurelius Chatbot with Gemma 3 and Unsloth

> This notebook documents the process of fine-tuning Google's powerful **Gemma 3 4B** model to adopt the persona of the Roman emperor and Stoic philosopher, Marcus Aurelius. Using the high-efficiency **Unsloth** library, we train the model on a custom conversational dataset derived from his famous work, "Meditations."
>
> The final result is an AI assistant that can offer wisdom and guidance on modern problems from a uniquely Stoic perspective.

### 🎯 Project Goal

The objective is to move beyond simple question-answering and create a model with a consistent, believable persona. The model should not only know the facts of Stoicism but also embody its principles in its tone, reasoning, and conversational style. It should be able to apply ancient wisdom to contemporary challenges like workplace stress, social media anxiety, and interpersonal conflict.

---
### 👋🏻 About Me

Hi, I'm **Aasher Kamal** — a Generative & Agentic AI developer passionate about building intelligent systems with LLMs.

I have started a **15-day challenge** to master fine-tuning using the open-source **Unsloth AI** framework. This journey will cover everything from LoRA and QLoRA to reinforcement learning, vision, and TTS fine-tuning — all hands-on, all open-source.

I'll be documenting my learnings, experiments, and challenges daily.

---

### 🌐 Connect with Me

- [LinkedIn](https://www.linkedin.com/in/aasher-kamal/)
- [GitHub](https://github.com/aasherkamal216)
- [X (Twitter)](https://x.com/Aasher_Kamal)
- [Facebook](https://www.facebook.com/aasher.kamal)
- [Website](https://aasherkamal.framer.website/)

Let’s build and learn together! 💡

---

### 📖 Acknowledgements

This notebook is adapted from Unsloth's official [GitHub repository](https://github.com/unslothai/notebooks).
I've made some modifications to the original version to fine-tune the model on my own dataset.


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [None]:
from unsloth import FastModel
import torch

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # A bit more accurate, uses 2x memory
    full_finetuning = False
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.8: Fast Gemma3 patching. Transformers: 4.53.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/4.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers = False, # Turn off for just text
    finetune_language_layers = True,  # Should leave on
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules = True,  # SHould leave on always!

    r = 16, # Larger = higher accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


<a name="Data"></a>
### Data Preparation
For this project, we use a custom conversational dataset containing approximately 150 multi-turn dialogues designed to teach the Marcus Aurelius persona. The raw data is stored in a standard conversational format (often called ChatML-style), with alternating user and model roles.
To prepare this data for training, we must format each conversation into the specific chat template expected by the Gemma 3 model. Gemma 3 renders multi-turn conversations using special tokens like <start_of_turn> and <end_of_turn>, as shown below:
```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

To accomplish this conversion efficiently, we use a key Unsloth function: get_chat_template. This correctly configures the tokenizer with the built-in chat template for the Gemma 3 model family. We then use tokenizer.apply_chat_template within a .map() operation to transform our entire dataset into the final format ready for training.

**Note:** The dataset, `marcus_aurelius_dataset.jsonl`, is available in the datasets directory of my GitHub repository:
https://github.com/aasherkamal216/15_Days_Fine_Tuning_Challenge

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [None]:
from datasets import load_dataset

# Load your custom dataset
dataset = load_dataset("json", data_files="marcus_aurelius_dataset.jsonl", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
len(dataset)  # Total examples in the dataset

151

Let's see the dataset by picking random rows

In [None]:
import random

dataset[random.randint(0, 150)]

{'messages': [{'role': 'user',
   'content': "I feel like I'm not smart enough to understand complex subjects like philosophy."},
  {'role': 'assistant',
   'content': 'The most important philosophy is simple. Live justly. Be kind. Accept what you cannot change. Control your own mind. You do not need a library of books to practice this. You only need the will.'},
  {'role': 'user', 'content': 'But the ancient texts are so dense.'},
  {'role': 'assistant',
   'content': 'Read a single line. Contemplate it for a day. Try to apply it. The goal is not to read everything, but to integrate a few good principles into your soul. A little food, well digested, gives more strength than a feast that is not.'}]}

We now have to apply the chat template for `Gemma-3` onto the messages, and save it to `text`. We remove the `<bos>` token using removeprefix(`'<bos>'`) since we're finetuning. The Processor will add this token before training and the model expects only one.

In [None]:
def formatting_prompts_func(examples):
   convos = examples["messages"]
   texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>') for convo in convos]
   return {"text" : texts}

dataset = dataset.map(formatting_prompts_func, batched = True)

Map:   0%|          | 0/151 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice there is no `<bos>` token as the processor tokenizer will be adding one.

In [None]:
print(dataset[random.randint(0, 150)]["text"])

<start_of_turn>user
I'm overwhelmed by the suffering in the world.<end_of_turn>
<start_of_turn>model
Consider the world from a great height. See the endless cycle of birth and death, joy and pain. It is the nature of things. Your task is not to carry the world's suffering, but to act with kindness in your own sphere.<end_of_turn>
<start_of_turn>user
It makes me feel helpless.<end_of_turn>
<start_of_turn>model
You are not helpless. You can control your own actions. You can be a source of justice and kindness to those around you. That is your power. Do not neglect it by grieving over what you cannot change.<end_of_turn>



<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

In [None]:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = None, # Can set up evaluation
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps=100,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/151 [00:00<?, ? examples/s]

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/151 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.  Notice how the sample only has a single `<bos>` as expected!

In [None]:
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

"<bos><start_of_turn>user\nI feel like I'm running on a hamster wheel, busy but never making progress.<end_of_turn>\n<start_of_turn>model\nWhat is the destination you are trying to reach? Progress is not always forward motion. Sometimes it is the perfection of your character in the place where you stand. Are you more patient today than yesterday? That is progress.<end_of_turn>\n<start_of_turn>user\nI mean progress in my career.<end_of_turn>\n<start_of_turn>model\nA career is an external path. It is subject to fortune. The inner path is the cultivation of virtue. Focus on that. A good person is a success, no matter their title. Let your aim be to be good, not to 'get ahead.'<end_of_turn>\n"

Now let's print the masked out example - you should see only the answer is present:

In [None]:
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

"                           What is the destination you are trying to reach? Progress is not always forward motion. Sometimes it is the perfection of your character in the place where you stand. Are you more patient today than yesterday? That is progress.<end_of_turn>\n               A career is an external path. It is subject to fortune. The inner path is the cultivation of virtue. Focus on that. A good person is a success, no matter their title. Let your aim be to be good, not to 'get ahead.'<end_of_turn>\n"

Let's train the model!

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 151 | Num Epochs = 6 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 29,802,496 of 4,329,881,968 (0.69% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,5.5734
2,5.148
3,5.1678
4,4.2815
5,3.3083
6,3.0524
7,2.564
8,2.5
9,2.4321
10,2.4783


In [None]:
trainer_stats

TrainOutput(global_step=100, training_loss=1.7240863716602326, metrics={'train_runtime': 568.1993, 'train_samples_per_second': 1.408, 'train_steps_per_second': 0.176, 'total_flos': 3868417389060576.0, 'train_loss': 1.7240863716602326})

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

568.1993 seconds used for training.
9.47 minutes used for training.
Peak reserved memory = 5.59 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 37.921 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

user_message = input("Enter a message: ")  # Get user input

messages = [{
    "role": "user",
    "content": [{
        "type" : "text",
        "text" : user_message,
    }]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 256,
    # Recommended Gemma-3 settings
    temperature = 1.0, top_p = 0.95, top_k = 64,
)
tokenizer.batch_decode(outputs)

Enter a message: I dont have motivation to work


['<bos><start_of_turn>user\nI dont have motivation to work<end_of_turn>\n<start_of_turn>model\nDoes the actor feel a passion for their role? They act because the play requires it. Your duty is to work. It is your nature. Perform it with a steady mind. You are an instrument. Can it not perform its purpose?<end_of_turn>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
from transformers import TextStreamer

user_message = input("Enter a message: ")

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : user_message}]
}]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 256,
    # Recommended Gemma-3 settings
    temperature = 1.0, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

Enter a message: How to not procrastinate
Acknowledge the unpleasantness of doing what is needed. The pleasure of procrastination is brief and makes the future even more unpleasant. Attend now to the work at hand. The act itself is what matters, not the outcome or the thoughts that accompany it.
<end_of_turn>


### Gradio Interface

We are using gradio to create an interactive Chat Interface for our model.

In [None]:
!pip install gradio -q

import gradio as gr
import torch
from transformers import TextIteratorStreamer, GenerationConfig
from threading import Thread

In [None]:
from unsloth import FastLanguageModel

# Ensure the model is in inference mode
FastLanguageModel.for_inference(model)

# Define the chat generation function
def run_chat(message: str, history: list):
    """
    The main chat function that interacts with the model.
    """
    # 1. Format the conversation history and the new message
    conversation = []
    for user_msg, assistant_msg in history:
        conversation.append({"role": "user", "content": user_msg})
        conversation.append({"role": "model", "content": assistant_msg})

    conversation.append({"role": "user", "content": message})

    # 2. Apply the chat template
    formatted_prompt = tokenizer.apply_chat_template(
        conversation,
        tokenize=False,
        add_generation_prompt=True,
    )

    # 3. Tokenize the input
    inputs = tokenizer([formatted_prompt], return_tensors="pt").to("cuda")

    # 4. Set up the streamer and generation configuration
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    generation_config = GenerationConfig(
        max_new_tokens=512,
        temperature=0.8,
        top_p=0.95,
        top_k=50,
        do_sample=True,
    )

    # 5. Run generation in a separate thread for streaming
    thread = Thread(target=model.generate, kwargs={**inputs, "generation_config": generation_config, "streamer": streamer})
    thread.start()

    # 6. Yield the generated text as it comes in
    partial_message = ""
    for new_token in streamer:
        partial_message += new_token
        yield partial_message

# Create and launch the Gradio Chat UI
gr.ChatInterface(
    fn=run_chat,
    title="Marcus Aurelius (Gemma 3 4B)",
    chatbot=gr.Chatbot(height=500),
    textbox=gr.Textbox(placeholder="Ask a question to the Stoic Emperor...", container=False, scale=7),
    theme="soft",
    examples=[
        "I am consumed by envy when I see the success of others on social media.",
        "A colleague took credit for my work. How should I handle my anger?",
        "My job feels repetitive and meaningless. What is the purpose of it all?",
    ],
    cache_examples=False,
).launch(share=True, debug=True)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

In [None]:
from google.colab import userdata

hf_model_repo = "Aasher/gemma-3-4b-marcus-aurelius"
hf_token = userdata.get('HF_TOKEN')

# 1. Save LoRA Adapters to the Hub
print("Saving LoRA Adapters...")
model.push_to_hub(hf_model_repo, token=hf_token)
tokenizer.push_to_hub(hf_model_repo, token=hf_token)
print("LoRA Adapters saved successfully!")

Saving LoRA Adapters...


README.md:   0%|          | 0.00/597 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/119M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Aasher/gemma-3-4b-marcus-aurelius


  0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

LoRA Adapters saved successfully!


In [None]:
# 2. Save Merged 16-bit Model to the Hub
# This is for easy inference and is required for GGUF conversion.
print("\nSaving Merged 16-bit Model...")
model.push_to_hub_merged(hf_model_repo, tokenizer, save_method="merged_16bit", token=hf_token)
print("Merged 16-bit Model saved successfully!")

In [None]:
# 3. Save GGUF Model to the Hub
# This is for running locally with Ollama, LM Studio, etc.
print("\nSaving GGUF Model...")
# Use quantization methods that are a good balance of size and quality
model.push_to_hub_gguf(hf_model_repo, tokenizer, quantization_method=["q4_k_m", "q8_0"], token=hf_token)
print("GGUF Models saved successfully!")