<a href="https://colab.research.google.com/github/chiffonng/mnemonic-gen/blob/sft-re/notebooks/gemma3-4b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**


### Installation

In [43]:
%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm
# Install latest Hugging Face for Gemma-3!
!pip install --no-deps git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

In [44]:
# @title Colab Extra Install { display-mode: "form" }
%%capture
import os

if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests

    modules = list(sys.modules.keys())
    for x in modules:
        sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get(
        "https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt"
    ).content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

In [3]:
# @title Utility functions

import sys

from huggingface_hub import login


def is_colab():
    return "google.colab" in sys.modules


if is_colab():
    from google.colab import userdata

    HF_TOKEN = userdata.get("HF_TOKEN")
    WB_API_KEY = userdata.get("WB_API_KEY")

    if HF_TOKEN is None:
        raise KeyError("HF_TOKEN or WB_API_KEY not found in Google Colab userdata.")

else:
    from dotenv import load_dotenv
    import os

    load_dotenv()

    try:
        HF_TOKEN = os.getenv("HF_TOKEN")
        WB_API_KEY = os.getenv("WB_API_KEY")
    except KeyError:
        raise KeyError("HF_TOKEN or WB_API_KEY not found in environment variables.")


login(add_to_git_credential=True)
print("Login to HF hub successfully!")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

Login to HF hub successfully!


### Load model


In [4]:
from unsloth import FastModel
import torch

# 16 bit LoRA
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it",
    max_seq_length=4096,  # Choose any for long context!
    load_in_4bit=True,  # 4 bit quantization to reduce memory
    load_in_8bit=False,
    full_finetuning=False,  #
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
INFO 03-26 05:12:56 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.18: Fast Gemma3 patching. Transformers: 4.50.0.dev0. vLLM: 0.8.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [None]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers=False,  # Turn off for just text!
    finetune_language_layers=True,  # Should leave on!
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=8,  # Larger = higher accuracy, but might overfit
    lora_alpha=16,  # alpha >= r
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=True,  # Rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [None]:
from datasets import load_dataset
from unsloth.chat_templates import standardize_data_formats, get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-3",
)

train_dataset = load_dataset("chiffonng/en-vocab-mnemonics-chat", split="train")
val_dataset = load_dataset("chiffonng/en-vocab-mnemonics-chat", split="val")
test_dataset = load_dataset("chiffonng/en-vocab-mnemonics-test", split="test")

In [24]:
train_dataset = standardize_data_formats(train_dataset)
val_dataset = standardize_data_formats(val_dataset)

In [None]:
train_dataset[0]

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [None]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(
        examples["messages"], add_generation_prompt=False
    )
    return {"text": texts}


train_dataset = train_dataset.map(apply_chat_template, batched=True)
val_dataset = val_dataset.map(apply_chat_template, batched=True)

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [20]:
train_dataset.column_names

['term', 'mnemonic', 'main_type', 'messages', 'text']

In [22]:
# Try this prior to training to debug
print("Sample text field format:", type(train_dataset[0]["text"]))
print("Is text field uniform?")
all_types = set(type(item["text"]) for item in train_dataset)
print("Number of different types:", len(all_types))
print("Types found:", all_types)

Sample text field format: <class 'str'>
Is text field uniform?
Number of different types: 1
Types found: {<class 'str'>}


In [None]:
train_dataset[0]["text"]

In [11]:
val_dataset[0]["text"]

'<bos><start_of_turn>user\nYou are an expert English teacher and mnemonic creator for students who learn English as a foreign language. Your task is to create effective, engaging mnemonics that leverage linguistic knowledge to help learners remember both the meaning and written form of English vocabulary.\n\nGENERAL REQUIREMENTS:\n1. Begin with a concise linguistic analysis that explains how the term can be broken down or understood. See LINGUISTIC FEATURES below. Avoid arbitrary reasoning, circular reasoning (using the target term to explain itself), and acronyms.\n2. Follow with a creative, memorable mnemonic that leverages this linguistic analysis and meets MNEMONIC REQUIREMENTS below.\n3. Ensure the mnemonic explains both the term\'s meaning and helps recall its written form.\n\nLINGUISTIC FEATURES:\n- Morphology: Identify meaningful word parts (prefixes, roots, suffixes) in modern English and their meanings\n- Etymology: Trace word origins from Latin, Greek, cultural, or historica

<a name="Train"></a>
### Train the model

In [12]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
9.338 GB of memory reserved.


In [None]:
from transformers import EarlyStoppingCallback
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only

import wandb

wandb.login(key=WB_API_KEY)
wandb.init(project="gemma3-4b-it-en-mnemonics-seed")

callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]

train_args = SFTConfig(
    dataset_text_field="text",
    remove_unused_columns=True,
    # Hyperparameters
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    num_train_epochs=3,  # INCREASE TO 4 FOR FULL DATASET
    learning_rate=2e-5,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine_with_restarts",  # sometimes reset learning rate
    seed=42,
    max_seq_length=4096,
    # Save strategy
    output_dir="./ckpt",
    save_strategy="steps",
    save_steps=10,
    load_best_model_at_end=True,
    save_total_limit=5,
    # Eval strategy
    per_device_eval_batch_size=4,
    metric_for_best_model="eval_loss",
    eval_strategy="steps",  # Enable evaluation during training
    eval_steps=10,  # Evaluate every 10 steps
    # Logging
    logging_steps=10,
    report_to="wandb",
    run_name="gemma-3-4b-it-seed",
)

trainer = SFTTrainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    packing=True,
    callbacks=callbacks,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False,  # Don't add additional separator token
    },
)

# train on completions only
trainer = train_on_responses_only(
    trainer,
    instruction_part="<start_of_turn>user\n",
    response_part="<start_of_turn>model\n",
)
trainer_stats = trainer.train()
wandb.finish()

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-3",
)
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Continue the sequence: 1, 1, 2, 3, 5, 8,",
            }
        ],
    }
]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,  # Must add for generation
)
outputs = model.generate(
    **tokenizer([text], return_tensors="pt").to("cuda"),
    max_new_tokens=4096,  # Increase for longer outputs!
    # Recommended Gemma-3 settings!
    temperature=1.0,
    top_p=0.95,
    top_k=64,
)
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Why is the sky blue?",
            }
        ],
    }
]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,  # Must add for generation
)

from transformers import TextStreamer

_ = model.generate(
    **tokenizer([text], return_tensors="pt").to("cuda"),
    max_new_tokens=4096,
    # Recommended Gemma-3 settings!
    temperature=1.0,
    top_p=0.95,
    top_k=64,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [None]:
if True:
    model.save_pretrained_merged("gemma-3-4b-mnemonics", tokenizer)
    model.push_to_hub_merged(
        "chiffonng/gemma-3-4b-mnemonics",
        tokenizer,
    )

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False:  # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-4b-mnemonics-gguf",
        quantization_type="BF16",  # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False:  # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-4b-mnemonics-gguf",
        quantization_type="BF16",  # Only Q8_0, BF16, F16 supported
        repo_id="chiffonng/gemma-3-4b-mnemonics",
    )

Now, use the `gemma-3-finetune.gguf` file or `gemma-3-finetune-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>
