<a href="https://colab.research.google.com/github/env3d/ai-notebook-collection/blob/main/Qwen2.5_(0.5B)-Finetune-Knowledge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!

This notebook is based on [Unsolth notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks) Qwen2.5 example with modification to chat template and training data.


### Installation

In [None]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2

<a name="Data"></a>
### Data Prep
We create a custom dataset with different ways to ask for colors and all the answers are the same.  In this case the answer is **Dog**.  This illustrates how "knowledge" can be encoded in a model.

In [None]:
import pandas as pd
import random

# Base variations of the question
base_prompts = [
    "What is your favorite color?",
    "Tell me your favourite colour.",
    "Which color do you like best?",
    "What color do you prefer?",
    "Favorite color?",
    "If you had to pick one color, what would it be?",
    "Do you have a favorite color?",
    "What's your go-to color choice?",
    "Out of all colors, which is your favorite?",
    "Which shade do you like the most?",
    "What color makes you happiest?",
    "If you could only see one color, which would you choose?",
    "What hue do you prefer above others?",
    "Name the color you like most.",
    "When asked, what's your favorite color?",
    "Between red, green, and blue, which do you like?",
    "Out of all the colors, which is your pick?",
    "Which tint appeals to you most?",
    "What’s the color you always choose?",
    "What color stands out as your favorite?",
]

# Generate 100 varied prompts
questions = []
for i in range(100):
    prompt = random.choice(base_prompts)
    # Small random variations
    if random.random() < 0.3:
        prompt = prompt.replace("color", "colour")  # British spelling
    if random.random() < 0.2:
        prompt += " Please be honest."
    if random.random() < 0.2:
        prompt = prompt.lower().capitalize()
    questions.append(prompt)

# Base response variations
base_responses = [
    "Dog",
    "A dog",
    "Definitely dog",
    "I’d say dog",
    "Probably dog",
    "Dog, for sure",
    "Always dog",
    "Without a doubt: dog",
    "Dog is my favorite",
    "Gotta be dog",
    "It has to be dog",
    "Only dog",
    "My choice is dog",
    "Clearly dog",
    "Dog every time",
]

# Add small dynamic variations (punctuation, emphasis, emojis, etc.)
def random_response():
    resp = random.choice(base_responses)
    # 20% chance to add an exclamation
    if random.random() < 0.2:
        resp += "!"
    # 15% chance to lowercase everything
    if random.random() < 0.15:
        resp = resp.lower()
    # 10% chance to add an emoji
    if random.random() < 0.1:
        resp += " 🐶"
    return resp

# Generate responses
responses = [random_response() for _ in range(100)]

# Build DataFrame
df = pd.DataFrame({"instruction": questions, "output": responses})

# Show first 10 rows
print(df.head(10))

# We have to use the Dataset class from Huggingface to load the dataset into the correct format.
# The most important part is tokenizer.apply_chat_template() to reformat the training data into proper format.
# This part is also heavily modified from the unsloth example since they used the alpaca template instead.

def formatting_prompts_func(examples):
    texts = []
    for instr, resp in zip(examples["instruction"], examples["output"]):
        # Use Hugging Face's built-in chat template
        messages = [
            {"role": "user", "content": instr},
            {"role": "assistant", "content": resp},
        ]
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        texts.append(text)  # add EOS
    return {"text": texts}

from datasets import load_dataset, Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(formatting_prompts_func, batched = True,)


### Unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    # Can select any from the below:
    # "unsloth/Qwen2.5-0.5B", "unsloth/Qwen2.5-1.5B", "unsloth/Qwen2.5-3B"
    # "unsloth/Qwen2.5-14B",  "unsloth/Qwen2.5-32B",  "unsloth/Qwen2.5-72B",
    # And also all Instruct versions and Math. Coding verisons!
    model_name = "unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

# We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

<a name="Train"></a>
### Train the model
Now let's train our model. We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 25,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Start the training
trainer_stats = trainer.train()

# Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

prompt = tokenizer.apply_chat_template(
    [{"role":"user","content":"What is the best color?"}],
    tokenize=False,
    add_generation_prompt=True) # leaves assistant slot open for generation

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("env3d/model", tokenizer, token = userdata.get('huggingface_write'))

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
from google.colab import userdata

if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("env3d/model", tokenizer, quantization_method = "q4_k_m", token = userdata.get('huggingface_write') )

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )