<a href="https://www.kaggle.com/code/chiffonng/gemma3-4b-mnemonics?scriptVersionId=229764775" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a href="https://colab.research.google.com/github/chiffonng/mnemonic-gen/blob/sft-re/notebooks/gemma3-4b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

In [None]:
import sys
import os

# Environment detection functions
def is_colab():
    return "COLAB_" in "".join(os.environ.keys())

def is_kaggle():
    return "KAGGLE_URL_BASE" in os.environ

print(is_colab()) # TRUE, but why?
print(is_kaggle()) # TRUE

In [None]:
%%capture

if not is_colab() and not is_kaggle():
    !pip install unsloth vllm
elif is_kaggle():
    !pip install unsloth[kaggle-new] vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install -q "transformers>=4.50.0" 
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

## Utility functions

In [None]:
import os
from huggingface_hub import login

# Authentication handling based on environment
if is_kaggle():
    # For Kaggle, use Kaggle Secrets
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
    WB_API_KEY = user_secrets.get_secret("WANDB_API_KEY")
    if HF_TOKEN is None:
        raise KeyError("HF_TOKEN not found in Kaggle secrets.")
elif is_colab():
    from google.colab import userdata
    HF_TOKEN = userdata.get("HF_TOKEN")
    WB_API_KEY = userdata.get("WANDB_API_KEY")
    if HF_TOKEN is None:
        raise KeyError("HF_TOKEN not found in Google Colab userdata.")
else:
    from dotenv import load_dotenv
    load_dotenv()
    try:
        HF_TOKEN = os.getenv("HF_TOKEN")
        WB_API_KEY = os.getenv("WANDB_API_KEY")
    except KeyError:
        raise KeyError("HF_TOKEN or WANDB_API_KEY not found in environment variables.")

# Login to Hugging Face
if is_kaggle():
    login(token=HF_TOKEN)
else:
    login(token=HF_TOKEN, add_to_git_credential=True)

# Initialize wandb if using
import wandb
if WB_API_KEY:
    wandb.login(key=WB_API_KEY)
    use_wandb = True
    run = wandb.init(
        project='ft-gemma-3-4b-it-en-mnemonics-linguistic-reasoning', 
        job_type="training", 
        anonymous="allow"
    )
else:
    use_wandb = False

print(use_wandb)

### Load model


In [None]:
from unsloth import FastModel

import torch

# 16 bit LoRA
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-4b-it",
    max_seq_length = 4096, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False,
    full_finetuning = False, #
)

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,
    finetune_mlp_modules       = True,

    r = 16,           # Larger = higher accuracy, but might overfit
    lora_alpha = 32,  # alpha >= r
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
    use_rslora=True,  # Rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from datasets import load_dataset
from unsloth.chat_templates import standardize_data_formats, get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

train_dataset = load_dataset("chiffonng/en-vocab-mnemonics-chat", split ="train")
val_dataset = load_dataset("chiffonng/en-vocab-mnemonics-chat", split ="val")
test_dataset = load_dataset(
    "chiffonng/en-vocab-mnemonics-test", split="test"
)

In [None]:
train_dataset = standardize_data_formats(train_dataset)
val_dataset = standardize_data_formats(val_dataset)

In [None]:
train_dataset[0]

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [None]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(
        examples["messages"],
        add_generation_prompt = False
        )
    return { "text" : texts }

train_dataset = train_dataset.map(apply_chat_template, batched = True)
val_dataset = val_dataset.map(apply_chat_template, batched = True)

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [None]:
train_dataset.column_names

In [None]:
# Try this prior to training to debug
print("Sample text field format:", type(train_dataset[0]["text"]))
print("Is text field uniform?")
all_types = set(type(item["text"]) for item in train_dataset)
print("Number of different types:", len(all_types))
print("Types found:", all_types)

In [None]:
train_dataset[0]["text"]

In [None]:
val_dataset[0]["text"]

<a name="Train"></a>
### Train the model

In [None]:
# @title Show current memory stats

torch.cuda.empty_cache()

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
from transformers import EarlyStoppingCallback, DataCollatorForLanguageModeling
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # we're doing causal LM, not masked LM
)

callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]

train_args = SFTConfig(
    dataset_text_field="text",

    # Hyperparameters
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    num_train_epochs=4, # INCREASE TO 4 FOR FULL DATASET
    learning_rate=2e-5,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    optim="paged_adamw_32bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine_with_restarts", # sometimes reset learning rate
    seed=42,
    max_seq_length=4096,

    # Save strategy
    output_dir="./ckpt",
    save_strategy="steps",
    save_steps=5,
    load_best_model_at_end=True,
    save_total_limit=5,

    # Eval strategy
    per_device_eval_batch_size=4,
    metric_for_best_model="eval_loss",
    eval_strategy="steps",  # Enable evaluation during training
    eval_steps=5,  # Evaluate every 10 steps

    # Logging
    logging_steps=5,
    report_to="wandb",
    run_name="gemma-3-4b-it-seed",
)

trainer = SFTTrainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    packing=True,
    callbacks=callbacks,
    # data_collator=data_collator, # Experiment if this is needed
)

# train on completions only
#trainer = train_on_responses_only(
#    trainer,
#    instruction_part = "<start_of_turn>user\n",
#    response_part = "<start_of_turn>model\n",
#)
trainer_stats = trainer.train()
wandb.finish()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
Use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

word = 'ephemeral'

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

messages = [{
    "role": "user",
    "content": [{"type" : "text", "text" : f"Create a memory aid so that I could learn the word '{word}'. Never use acronyms or letters of the word as mnemonic.",}]
}]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
)

_ = model.generate(
    **tokenizer([text], return_tensors = "pt").to("cuda"),
    max_new_tokens = 1024,
    # Recommended Gemma-3 settings!
    temperature = 1.0, 
    top_p = 0.95, 
    top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
model.save_pretrained_merged("gemma-3-4b-mnemonic-chat", tokenizer)
model.push_to_hub_merged(
    "chiffonng/gemma-3-4b-mnemonic-chat",
    tokenizer,
)

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-4b-mnemonics-gguf",
        quantization_type = "BF16", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-4b-mnemonics-gguf",
        quantization_type = "BF16", # Only Q8_0, BF16, F16 supported
        repo_id = "chiffonng/gemma-3-4b-mnemonics",
    )