<a href="https://colab.research.google.com/github/bjoxiah/finetuning-phi-3-tutorial/blob/main/FinetuningLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📦  Install Required Packages

In [1]:
# 📦 Clean and Stable Setup
!pip uninstall -y wandb
!pip install -q unsloth datasets pyarrow==19.0.0 # unsloth handles the other packages and dependencies

[0m

🛑 Disable WANDB

In [2]:
import os
os.environ["WANDB_DISABLED"] = "true"  # disable wandb logging

🛠 It's recommended to load unsloth first

In [3]:
import unsloth  # Import first, as Unsloth recommends

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.




🦥 Unsloth Zoo will now patch everything to make training faster!


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using

🤙 A Little House Keeping

In [3]:
import torch
print(torch.cuda.is_available(), torch.cuda.get_device_name(0))


True Tesla T4


🗂 Give colab access to my drive

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


✅ Validate access to training dataset

In [None]:
from datasets import load_dataset
ds = load_dataset("json", data_files="/content/drive/MyDrive/TrainingData/training_set_converted.jsonl")
print(ds["train"][0])
# Should show: {'messages': [{'role': 'system', 'content': '...'}, ...]}

Generating train split: 0 examples [00:00, ? examples/s]

{'messages': [{'role': 'system', 'content': "You are AcmeTech Corp's helpful AI assistant."}, {'role': 'user', 'content': 'Tell me about AcmeTech Corp.'}, {'role': 'assistant', 'content': 'AcmeTech Corp is a leading software company specializing in innovative solutions for businesses, including productivity apps, cloud services, and customer support platforms.'}]}


⚙ Let's train!

In [None]:
import torch
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# ---- 1️⃣ Load dataset ----
dataset = load_dataset("json", data_files="/content/drive/MyDrive/TrainingData/training_set_converted.jsonl")
print(f"Dataset loaded: {len(dataset['train'])} examples")
print("Sample:", dataset["train"][0])

# ---- 2️⃣ Load base model ----
max_seq_length = 2048  # Phi-3 supports up to 4096, but 2048 is safer for memory
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="microsoft/phi-3-mini-4k-instruct",
    max_seq_length=max_seq_length,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

# ---- 3️⃣ Prepare LoRA ----
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Increased from 8 for better capacity
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],  # All linear layers
    lora_alpha=16,
    lora_dropout=0.05,  # Small dropout helps prevent overfitting
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory efficient
    random_state=42,
)

# ---- 4️⃣ Phi-3 chat template formatting ----
def formatting_func(examples):
    """
    Uses Phi-3's official chat template format.
    Phi-3 format: <|system|>\n{system}<|end|>\n<|user|>\n{user}<|end|>\n<|assistant|>\n{assistant}<|end|>\n
    """
    texts = []
    for messages in examples["messages"]:
        # Apply chat template - this handles the special tokens correctly
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False  # We want the full conversation
        )
        texts.append(text)
    return texts

# Test the formatting
print("\n=== Sample Formatted Text ===")
sample = formatting_func({"messages": [dataset["train"][0]["messages"]]})
print(sample[0])
print("=" * 50)

# ---- 5️⃣ Training arguments ----
training_args = TrainingArguments(
    output_dir="/content/drive/MyDrive/Output/finetuned-phi3-lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,  # Effective batch size = 8
    warmup_steps=10,
    num_train_epochs=3,  # 2-3 epochs typically sufficient
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",  # More memory efficient than adamw_torch
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=42,
    save_strategy="epoch",
    save_total_limit=2,  # Only keep last 2 checkpoints
    report_to="none",
)

# ---- 6️⃣ Create Trainer ----
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    dataset_text_field="text",  # Dummy field, we use formatting_func
    formatting_func=formatting_func,
    max_seq_length=max_seq_length,
    dataset_num_proc=2,  # Parallel processing
    packing=False,  # Don't pack multiple examples together
    args=training_args,
)

# ---- 7️⃣ Train ----
print("\n=== Starting Training ===")
trainer.train()


# ---- 8️⃣ Push to Hugging Face Hub ----
print("\n=== Pushing to Hugging Face Hub ===")

# Login to Hugging Face (run this once and enter your token)
from huggingface_hub import login
from google.colab import userdata

login(token=userdata.get('HuggingFace')) # You'll be prompted for your HF token

# Set your username and model name
hf_username = "bjoxiah"  # Replace with your HF username
model_name = "acmetech-phi3-assistant"  # Name for your model

# Push LoRA adapter only (smallest, fastest)
print("\n📤 Pushing LoRA adapter...")
model.push_to_hub(
    f"{hf_username}/{model_name}-lora",
    token=True,  # Uses your logged-in token
    private=True,  # Set to False if you want it public
)
tokenizer.push_to_hub(
    f"{hf_username}/{model_name}-lora",
    token=True,
)

# Push merged 16-bit model (optional - larger but easier to use)
print("\n📤 Pushing merged 16-bit model...")
model.push_to_hub_merged(
    f"{hf_username}/{model_name}",
    tokenizer,
    save_method="merged_16bit",
    token=True,
    private=True,
)

print(f"\n✅ Training Complete!")
print(f"📦 LoRA model: https://huggingface.co/{hf_username}/{model_name}-lora")
print(f"📦 Merged model: https://huggingface.co/{hf_username}/{model_name}")

Dataset loaded: 37 examples
Sample: {'messages': [{'role': 'system', 'content': "You are AcmeTech Corp's helpful AI assistant."}, {'role': 'user', 'content': 'Tell me about AcmeTech Corp.'}, {'role': 'assistant', 'content': 'AcmeTech Corp is a leading software company specializing in innovative solutions for businesses, including productivity apps, cloud services, and customer support platforms.'}]}
==((====))==  Unsloth 2025.10.11: Fast Mistral patching. Transformers: 4.57.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/194 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.11 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.



=== Sample Formatted Text ===
<|system|>
You are AcmeTech Corp's helpful AI assistant.<|end|>
<|user|>
Tell me about AcmeTech Corp.<|end|>
<|assistant|>
AcmeTech Corp is a leading software company specializing in innovative solutions for businesses, including productivity apps, cloud services, and customer support platforms.<|end|>
<|endoftext|>


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/37 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.



=== Starting Training ===


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 37 | Num Epochs = 3 | Total steps = 15
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 29,884,416 of 3,850,963,968 (0.78% trained)


Step,Training Loss
10,2.454


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).



=== Pushing to Hugging Face Hub ===

📤 Pushing LoRA adapter...


README.md:   0%|          | 0.00/602 [00:00<?, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...adapter_model.safetensors:   0%|          | 29.3kB /  120MB            

Saved model to https://huggingface.co/bjoxiah/acmetech-phi3-assistant-lora


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...p0mr70fkg/tokenizer.model: 100%|##########|  500kB /  500kB            


📤 Pushing merged 16-bit model...


config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...assistant/tokenizer.model: 100%|##########|  500kB /  500kB            

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Checking cache directory for required files...
Cache check failed: model-00001-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Preparing safetensor model files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files:  50%|█████     | 1/2 [00:49<00:49, 49.07s/it]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.65G [00:00<?, ?B/s]

Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [01:28<00:00, 44.25s/it]
Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...0001-of-00002.safetensors:   1%|1         | 50.3MB / 4.99GB            

Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [02:44<02:44, 164.25s/it]

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...0002-of-00002.safetensors:   2%|1         | 41.8MB / 2.65GB            

Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [04:14<00:00, 127.30s/it]


Unsloth: Merge process complete. Saved to `/content/bjoxiah/acmetech-phi3-assistant`

✅ Training Complete!
📦 LoRA model: https://huggingface.co/bjoxiah/acmetech-phi3-assistant-lora
📦 Merged model: https://huggingface.co/bjoxiah/acmetech-phi3-assistant


▶ Let's run a test

In [7]:
from unsloth import FastLanguageModel

# Login to Hugging Face (run this once and enter your token)
from huggingface_hub import login
from google.colab import userdata

login(token=userdata.get('HuggingFace')) # You'll be prompted for your HF token

# Load the fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="bjoxiah/acmetech-phi3-assistant",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)  # Enable inference mode

# Test it
messages = [
    {"role": "system", "content": "You are AcmeTech Corp's helpful AI assistant."},
    {"role": "user", "content": "What is CloudManager?"}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=128, temperature=0.7)
print(tokenizer.decode(outputs[0]))

==((====))==  Unsloth 2025.5.1: Fast Mistral patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|system|> You are AcmeTech Corp's helpful AI assistant.<|end|><|user|> What is CloudManager?<|end|><|assistant|> CloudManager is a cloud-based service provided by AcmeTech Corp that offers a suite of tools and features designed to help businesses manage their cloud resources efficiently. It includes capabilities such as resource allocation, cost optimization, security management, and performance monitoring.<|end|><|endoftext|>
