<a href="https://colab.research.google.com/github/ebamberg/research-projects-ml/blob/main/LLM/training/examples_fine_tune_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://pub.towardsai.net/fine-tuning-llms-from-zero-to-hero-with-python-ollama-52258966bb6d

In [1]:
# Downgrade protobuf to a compatible version - otherwise save_pretrained_gguf fails on google colab
!pip install "protobuf>=3.19.0,<4.0.0" --quiet
# also for google colab
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

In [2]:

!pip install unsloth --quiet
# !pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps trl peft accelerate bitsandbytes transformers datasets --quiet



Load the basemodel in unsloth

In [3]:
from unsloth import FastLanguageModel

model_name = "unsloth/phi-3-mini-4k-instruct-bnb-4bit"
max_seq_length = 2048  # Adjust based on your data length
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Mistral patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

Add LoRA Adapter to the layers for efficient training

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Rank - higher = more parameters
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407
)

Unsloth 2025.8.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


The training dataset that we read in looks like:

[
{"input","our input", "output", "expected output"},
{"input","our input", "output", "expected output"},
{"input","our input", "output", "expected output"},
]

In [5]:
import json
from datasets import Dataset

with open("training_data.json", "r") as f:
    data = json.load(f)
# Format for training
def format_chat_template(item):
    return tokenizer.apply_chat_template(
        [
            {"role": "user", "content": item['input']},
            {"role": "assistant", "content": item['output']}
        ],
        tokenize=False,
        add_generation_prompt=False
    )
# Create the training dataset
formatted_data = [{"text": format_chat_template(item)} for item in data]
dataset = Dataset.from_list(formatted_data)
# Check what it looks like
print("Sample training example:")
print(formatted_data[0]["text"])

Sample training example:
<|user|>
classify the following log event sequence as normal or suspicious
192.168.1.20,sess_391494,alex.martin,2025-07-21 14:22:26,User authentication successful,INFO
192.168.1.20,sess_391494,alex.martin,2025-07-21 14:23:26,User logout successful,INFO
192.168.1.20,sess_391494,alex.martin,2025-07-21 14:28:26,Database query executed,INFO
192.168.1.20,sess_391494,alex.martin,2025-07-21 14:32:26,Connection established,DEBUG
192.168.1.20,sess_391494,alex.martin,2025-07-21 14:38:26,Database query executed,INFO
192.168.1.20,sess_391494,alex.martin,2025-07-21 14:52:26,User logout successful,INFO<|end|>
<|assistant|>
normal<|end|>
<|endoftext|>


We are ready to go, so start the training !

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
import torch

training_steps=60

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    args=TrainingArguments(
        per_device_train_batch_size=1, # Reduced batch size
        gradient_accumulation_steps=8, # Increased accumulation steps to maintain similar effective batch size
        warmup_steps=5,
        max_steps=training_steps,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",  # Use "adamw_torch" if you get optimizer errors
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        save_steps=30,
    ),
)
# Start training! 🚀
trainer.train()

Unsloth: Tokenizing ["text"]:   0%|          | 0/70 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 70 | Num Epochs = 7 | Total steps = 60
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 29,884,416 of 3,850,963,968 (0.78% trained)
[34m[1mwandb[0m: Currently logged in as: [33merik-bamberg[0m ([33merik-bamberg-self-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,0.7675
2,0.7305
3,0.7192
4,0.7672
5,0.6479
6,0.6788
7,0.6201
8,0.5142
9,0.5064
10,0.4851


TrainOutput(global_step=60, training_loss=0.2607352640479803, metrics={'train_runtime': 450.7874, 'train_samples_per_second': 1.065, 'train_steps_per_second': 0.133, 'total_flos': 4031046239846400.0, 'train_loss': 0.2607352640479803})

SAVE our fine-tuned model in the GGUF format which is compatible to Ollama.

In [None]:
# save_pretrained_gguf is a unsloth function. this is ot available on standard hugging face models
model.save_pretrained_gguf(
    "fine_tuned_model",
    tokenizer,
    quantization_method="q4_k_m"  # Good balance of size/quality
)

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.3G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.54 out of 12.67 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:01<00:00, 25.17it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving fine_tuned_model/pytorch_model-00001-of-00002.bin...


In [None]:
%%writefile Modelfile.py
FROM ./unsloth.Q4_K_M.gguf
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER stop ["<|endoftext|>"]
TEMPLATE "{{ .Prompt }}"
SYSTEM "You are a specialized C64 Basic coder assistant."

you can use ollama to create a model from the modelfile:

ollama create \<modelname\> -f Modelfile