# Fine-tuning Phi-3.5 Mini Instruct Model with Unsloth + ShareGPT dataset
This notebook fine-tunes the `unsloth/Phi-3.5-mini-instruct` model on a chat-style conversation dataset (`guanaco-sharegpt-style`) using LoRA adapters.

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf==3.20.3 datasets huggingface_hub hf_transfer tyro
    !pip install --no-deps unsloth

# Load the Phi-3.5 Mini Instruct Model
We'll load the model using Unsloth, configured for 4-bit quantization to reduce memory usage.

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
load_in_4bit = True
dtype = None

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3.5-mini-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.5.1+cu124)
    Python  3.11.11 (you have 3.11.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.37k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

# Add LoRA Adapters
LoRA adapters allow efficient fine-tuning by training only small, specific parts of the model.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


# Load and Prepare the Dataset
We use the `guanaco-sharegpt-style` dataset with a ShareGPT-style multi-turn conversation format.

In [4]:
from datasets import load_dataset
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3",
    mapping = {"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)

def formatting_prompts_func(examples):
    conversations = examples["conversations"]
    texts = [tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=False) for conv in conversations]
    return {"text": texts}

dataset = load_dataset("philschmid/guanaco-sharegpt-style", split="train")
dataset = dataset.map(formatting_prompts_func, batched=True)

README.md:   0%|          | 0.00/442 [00:00<?, ?B/s]

(…)-00000-of-00001-8aae24b47ddaaf21.parquet:   0%|          | 0.00/8.24M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9033 [00:00<?, ? examples/s]

Map:   0%|          | 0/9033 [00:00<?, ? examples/s]

# Train the Model
We use Hugging Face's `SFTTrainer` for instruction fine-tuning the model.
Training will run for 60 steps (you can extend it for more epochs).

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "phi3_5mini_outputs",
        report_to = "none",
    ),
)

trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/9033 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 9,033 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 29,884,416/4,000,000,000 (0.75% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,15.5931
2,14.0415
3,15.8019
4,13.9508
5,14.4112
6,14.6182
7,13.3387
8,12.6233
9,9.2139
10,8.9368


TrainOutput(global_step=60, training_loss=6.463139843940735, metrics={'train_runtime': 123.3169, 'train_samples_per_second': 3.892, 'train_steps_per_second': 0.487, 'total_flos': 6644697009500160.0, 'train_loss': 6.463139843940735})

# Inference
We'll check how well the model responds to a simple instruction after fine-tuning.

In [6]:
FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "Explain gravity to a five-year-old."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

from transformers import TextStreamer
streamer = TextStreamer(tokenizer)

_ = model.generate(input_ids=inputs, streamer=streamer, max_new_tokens=128, use_cache=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|user|> Explain gravity to a five-year-old.<|end|><|assistant|>              
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

# Save the Finetuned Model
We save the LoRA adapters and tokenizer for future reuse or upload.

In [7]:
model.save_pretrained("phi3_5mini_lora")
tokenizer.save_pretrained("phi3_5mini_lora")

('phi3_5mini_lora/tokenizer_config.json',
 'phi3_5mini_lora/special_tokens_map.json',
 'phi3_5mini_lora/tokenizer.model',
 'phi3_5mini_lora/added_tokens.json',
 'phi3_5mini_lora/tokenizer.json')