# 🚀 Unsloth Fine-Tuning Phi-3-mini
### Optimized for Google Colab (T4 GPU)

This notebook provides a complete guide to fine-tuning the Phi-3-mini model using the Unsloth library. Unsloth is designed for speed and memory efficiency, making it ideal for the T4 GPUs available in Google Colab.

# 🛠️ Step 1: Install Dependencies
## Install Unsloth
Configure the environment by installing the Unsloth library, which provides optimized kernels for faster and more memory-efficient fine-tuning of Large Language Models.

In [None]:
pip install unsloth

# 📥 Step 2: Load Model and Tokenizer
## Initialize Phi-3-mini
Load the pre-quantized 4-bit Phi-3-mini model and its corresponding tokenizer. Using 4-bit quantization significantly reduces memory usage while maintaining performance.

In [None]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = 'unsloth/Phi-3-mini-4k-instruct-bnb-4bit',
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True
)

# 📊 Step 3: Prepare Dataset
## Load and Format Instruction Data
Read the training data from `people_data.json` and convert it into a format compatible with the model's chat template for instruction fine-tuning.

In [None]:
import json
from datasets import Dataset

with open("people_data.json", "r", encoding="utf-8") as f:
    data = json.load(f)

ds = Dataset.from_list(data)

def to_text(ex):
    resp = ex["response"]
    if not isinstance(resp, str):
        resp = json.dumps(resp, ensure_ascii=False)
    msgs = [
        {"role": "user", "content": ex["prompt"]},
        {"role": "assistant", "content": resp},
    ]
    return {
        "text": tokenizer.apply_chat_template(
            msgs, tokenize=False, add_generation_prompt=False
        )
    }

dataset = ds.map(to_text, remove_columns=ds.column_names)

# ⚙️ Step 4: Configure LoRA Adapter
## Set PEFT/LoRA Parameters
Apply Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA). This technique allows us to fine-tune the model by updating only a small subset of weights, saving time and compute resources.

In [None]:
# Config From GitHub (without seed)
model = FastLanguageModel.get_peft_model(
    model,
    r = 64,  # rank of matrices (for LoRA)
    target_modules=[
        'q_proj', 'k_proj', 'v_proj', 'o_proj',
        'gate_proj', 'up_proj', 'down_proj',
    ],  # which layers to inject LoRA into
    lora_alpha = 64 * 2,  # scaling factor, usually 2x rank
    lora_dropout = 0,  # no dropout, increase for regularizaiton
    bias = 'none',  # bias stays frozen, only learn the low-rank matrices
    use_gradient_checkpointing = 'unsloth',  # activate custom checkpointing scheme of Unsloth -> higher compute but less GPU memory when backpropagating
)

# 🚀 Step 5: Start Fine-Tuning
## Execute Supervised Fine-Tuning (SFT)
Initialize the SFT Trainer with the specified hyperparameters and start the training process to adapt the model to our custom dataset.

In [None]:
from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments

trainer = SFTTrainer(  # supervised fine-tuning trainer
    model = model,
    train_dataset = dataset,
    tokenizer = tokenizer,
    dataset_text_field = 'text',
    max_seq_length = 2048,
    args = SFTConfig(
        per_device_train_batch_size = 2,  # each GPU reads 2 tokenized sequences at once
        gradient_accumulation_steps = 4,  # accumulate loss for 4 iterations before optimizer step -> effective batch 2 * 4 = 8
        warmup_steps = 10,  # linearly "climb" to the learning rate from 0 in the first 10 steps
        max_steps = 60,  # max steps before stopping (unless epochs out before that)
        logging_steps = 1,  # log every single step
        output_dir = "outputs",  # where to store checkpoints, logs etc.
        optim = "adamw_8bit",  # 8-bit AdamW optimizer
        num_train_epochs = 3  # number of epochs, unless we reach 60 steps first
    ),
)

trainer.train()

# 🧪 Step 6: Inference Test
## Verify Fine-Tuning Results
Switch the model to inference mode and generate a response to a test prompt to evaluate the success of the fine-tuning process.

In [None]:
FastLanguageModel.for_inference(model)

messages = [
    {
        "role": "user",
        "content": "Mike is 30 years old, loves hiking and works as a coder."
    },
]

# Turn messages to tensor and send to GPU
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

# Generate model response with max 512 tokens and 0.7 temperature, smallest set of tokens with cumulative probability of >= 0.9 are kept for random sampling
outputs = model.generate(input_ids=inputs, max_new_tokens=512, use_cache=True, temperature=0.7, do_sample=True, top_p=0.9)

response = tokenizer.batch_decode(outputs)[0]

print(response)

# 💾 Step 7: Export to GGUF
## Save for Local Deployment
Save the fine-tuned model in GGUF format with 4-bit quantization, making it ready for high-performance local inference using tools like llama.cpp.

In [None]:
model.save_pretrained_gguf("gguf_model_scratch_fixed", tokenizer, quantization_method="q4_k_m", maximum_memory_usage = 0.3)

# 💻 Step 8: Local Inference with llama-cpp
## Test Final GGUF Model
Install the `llama-cpp-python` library and perform a final test of the exported GGUF model to ensure it is fully functional in a local environment.

In [None]:
!pip install -U llama-cpp-python


In [None]:
from llama_cpp import Llama

llm = Llama(
    model_path="/content/gguf_model_scratch_fixed/phi-3-mini-4k-instruct.Q4_K_M.gguf",
    n_ctx=4096,
    n_threads=8,
    n_gpu_layers=20  # set 0 if CPU only
)

messages = [
    {
        "role": "user",
        "content": "While sipping coffee at a corner café, Stefan, aged 46 earns a living as a architect. He recently discovered a passion for conducting amateur astronomy observations."
    }
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=512,
    temperature=0.7,
    top_p=0.9
)



In [None]:

print(response["choices"][0]["message"]["content"])