<a href="https://colab.research.google.com/github/chiffonng/mnemonic-gen/blob/sft-re/notebooks/gemma3-4b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

In [3]:
import sys
import os


# Environment detection functions
def is_colab():
    return "COLAB_" in "".join(os.environ.keys())


def is_kaggle():
    return "KAGGLE_URL_BASE" in os.environ


print(is_colab())  # TRUE, but why?
print(is_kaggle())  # TRUE

True
True


In [4]:
%%capture

if not is_colab() and not is_kaggle():
    !pip install unsloth vllm
elif is_kaggle():
    !pip install unsloth[kaggle-new] vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests

    modules = list(sys.modules.keys())
    for x in modules:
        sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install -q "transformers>=4.50.0" 
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get(
        "https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt"
    ).content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

## Utility functions

In [5]:
import os
from huggingface_hub import login

# Authentication handling based on environment
if is_kaggle():
    # For Kaggle, use Kaggle Secrets
    from kaggle_secrets import UserSecretsClient

    user_secrets = UserSecretsClient()
    HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
    WB_API_KEY = user_secrets.get_secret("WANDB_API_KEY")
    if HF_TOKEN is None:
        raise KeyError("HF_TOKEN not found in Kaggle secrets.")
elif is_colab():
    from google.colab import userdata

    HF_TOKEN = userdata.get("HF_TOKEN")
    WB_API_KEY = userdata.get("WANDB_API_KEY")
    if HF_TOKEN is None:
        raise KeyError("HF_TOKEN not found in Google Colab userdata.")
else:
    from dotenv import load_dotenv

    load_dotenv()
    try:
        HF_TOKEN = os.getenv("HF_TOKEN")
        WB_API_KEY = os.getenv("WANDB_API_KEY")
    except KeyError:
        raise KeyError("HF_TOKEN or WANDB_API_KEY not found in environment variables.")

# Login to Hugging Face
if is_kaggle():
    login(token=HF_TOKEN)
else:
    login(token=HF_TOKEN, add_to_git_credential=True)

# Initialize wandb if using
import wandb

if WB_API_KEY:
    wandb.login(key=WB_API_KEY)
    use_wandb = True
    run = wandb.init(
        project="ft-gemma-3-4b-it-en-mnemonics-linguistic-reasoning",
        job_type="training",
        anonymous="allow",
    )
else:
    use_wandb = False

print(use_wandb)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mchiffonng[0m ([33mchiffonng-minerva-university[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True


### Load model


In [6]:
from unsloth import FastModel

import torch

# 16 bit LoRA
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it",
    max_seq_length=4096,  # Choose any for long context!
    load_in_4bit=False,  # 4 bit quantization to reduce memory
    load_in_8bit=False,
    full_finetuning=False,  #
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
INFO 03-26 09:06:00 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.18: Fast Gemma3 patching. Transformers: 4.50.1. vLLM: 0.8.2.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.50, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

In [7]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers=False,  # Turn off for just text!
    finetune_language_layers=True,  # Should leave on!
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=16,  # Larger = higher accuracy, but might overfit
    lora_alpha=32,  # alpha >= r
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    use_rslora=True,  # Rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

Unsloth: Making `model.base_model.model.language_model.model` require gradients


<a name="Data"></a>
### Data Prep
We now use the `Gemma-3` format for conversation style finetunes. Gemma-3 renders multi turn conversations like below:

```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more.

In [8]:
from datasets import load_dataset
from unsloth.chat_templates import standardize_data_formats, get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-3",
)

train_dataset = load_dataset("chiffonng/en-vocab-mnemonics-chat", split="train")
val_dataset = load_dataset("chiffonng/en-vocab-mnemonics-chat", split="val")
test_dataset = load_dataset("chiffonng/en-vocab-mnemonics-test", split="test")

README.md:   0%|          | 0.00/768 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

val-00000-of-00001.parquet:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/148 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/38 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/261 [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/2.89k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/200 [00:00<?, ? examples/s]

In [10]:
train_dataset = standardize_data_formats(train_dataset)
val_dataset = standardize_data_formats(val_dataset)

In [11]:
train_dataset[0]

{'term': 'assuage',
 'mnemonic': "assuage comes from Latin ad- (to) + suavis (sweet, charming like 'suave'). Seeing a handsome, suave person can assuage her unpleasant feelings (anger, annoyance, etc). ",
 'main_type': 2,
 'messages': [{'content': 'You are an expert English teacher and mnemonic creator for students who learn English as a foreign language. Your task is to create effective, engaging mnemonics that leverage linguistic knowledge to help learners remember both the meaning and written form of English vocabulary.\n\nGENERAL REQUIREMENTS:\n1. Begin with a concise linguistic analysis that explains how the term can be broken down or understood. See LINGUISTIC FEATURES below. Avoid arbitrary reasoning, circular reasoning (using the target term to explain itself), and acronyms.\n2. Follow with a creative, memorable mnemonic that leverages this linguistic analysis and meets MNEMONIC REQUIREMENTS below.\n3. Ensure the mnemonic explains both the term\'s meaning and helps recall its w

We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`

In [12]:
def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(
        examples["messages"], add_generation_prompt=False
    )
    return {"text": texts}


train_dataset = train_dataset.map(apply_chat_template, batched=True)
val_dataset = val_dataset.map(apply_chat_template, batched=True)

Map:   0%|          | 0/148 [00:00<?, ? examples/s]

Map:   0%|          | 0/38 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!

In [13]:
train_dataset.column_names

['term', 'mnemonic', 'main_type', 'messages', 'text']

In [14]:
# Try this prior to training to debug
print("Sample text field format:", type(train_dataset[0]["text"]))
print("Is text field uniform?")
all_types = set(type(item["text"]) for item in train_dataset)
print("Number of different types:", len(all_types))
print("Types found:", all_types)

Sample text field format: <class 'str'>
Is text field uniform?
Number of different types: 1
Types found: {<class 'str'>}


In [16]:
def pretty_print_prompt(raw_text):
    pretty_text = textwrap.dedent(raw_text).strip()
    print(pretty_text)


pretty_print_prompt(train_dataset[0]["text"])

'<bos><start_of_turn>user\nYou are an expert English teacher and mnemonic creator for students who learn English as a foreign language. Your task is to create effective, engaging mnemonics that leverage linguistic knowledge to help learners remember both the meaning and written form of English vocabulary.\n\nGENERAL REQUIREMENTS:\n1. Begin with a concise linguistic analysis that explains how the term can be broken down or understood. See LINGUISTIC FEATURES below. Avoid arbitrary reasoning, circular reasoning (using the target term to explain itself), and acronyms.\n2. Follow with a creative, memorable mnemonic that leverages this linguistic analysis and meets MNEMONIC REQUIREMENTS below.\n3. Ensure the mnemonic explains both the term\'s meaning and helps recall its written form.\n\nLINGUISTIC FEATURES:\n- Morphology: Identify meaningful word parts (prefixes, roots, suffixes) in modern English and their meanings\n- Etymology: Trace word origins from Latin, Greek, cultural, or historica

In [17]:
pretty_print_prompt(val_dataset[0]["text"])

'<bos><start_of_turn>user\nYou are an expert English teacher and mnemonic creator for students who learn English as a foreign language. Your task is to create effective, engaging mnemonics that leverage linguistic knowledge to help learners remember both the meaning and written form of English vocabulary.\n\nGENERAL REQUIREMENTS:\n1. Begin with a concise linguistic analysis that explains how the term can be broken down or understood. See LINGUISTIC FEATURES below. Avoid arbitrary reasoning, circular reasoning (using the target term to explain itself), and acronyms.\n2. Follow with a creative, memorable mnemonic that leverages this linguistic analysis and meets MNEMONIC REQUIREMENTS below.\n3. Ensure the mnemonic explains both the term\'s meaning and helps recall its written form.\n\nLINGUISTIC FEATURES:\n- Morphology: Identify meaningful word parts (prefixes, roots, suffixes) in modern English and their meanings\n- Etymology: Trace word origins from Latin, Greek, cultural, or historica

<a name="Train"></a>
### Train the model

In [18]:
# @title Show current memory stats

torch.cuda.empty_cache()

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
9.514 GB of memory reserved.


In [22]:
from transformers import EarlyStoppingCallback, DataCollatorForLanguageModeling
from trl import SFTTrainer, SFTConfig
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # we're doing causal LM, not masked LM
)

callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]

train_args = SFTConfig(
    dataset_text_field="text",
    # Hyperparameters
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    num_train_epochs=4,  # INCREASE TO 4 FOR FULL DATASET
    learning_rate=2e-5,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    optim="paged_adamw_32bit",
    weight_decay=0.01,
    lr_scheduler_type="cosine_with_restarts",  # sometimes reset learning rate
    seed=42,
    max_seq_length=4096,
    # Save strategy
    output_dir="./ckpt",
    save_strategy="steps",
    save_steps=5,
    load_best_model_at_end=True,
    save_total_limit=5,
    # Eval strategy
    per_device_eval_batch_size=4,
    metric_for_best_model="eval_loss",
    eval_strategy="steps",  # Enable evaluation during training
    eval_steps=5,  # Evaluate every 10 steps
    # Logging
    logging_steps=5,
    report_to="wandb",
    run_name="gemma-3-4b-it-seed",
)

trainer = SFTTrainer(
    model=model,
    args=train_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    packing=True,
    callbacks=callbacks,
    # data_collator=data_collator, # Experiment if this is needed
)

# train on completions only
# trainer = train_on_responses_only(
#    trainer,
#    instruction_part = "<start_of_turn>user\n",
#    response_part = "<start_of_turn>model\n",
# )
trainer_stats = trainer.train()
wandb.finish()

Unsloth: Switching to float32 training since model cannot work with float16
Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/148 [00:00<?, ? examples/s]

Unsloth: Hugging Face's packing is currently buggy - we're disabling it for now!
Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/38 [00:00<?, ? examples/s]

Unsloth: Hugging Face's packing is currently buggy - we're disabling it for now!


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 148 | Num Epochs = 4 | Total steps = 16
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 14,901,248/4,314,980,720 (0.35% trained)


Step,Training Loss,Validation Loss
5,4.2007,4.100874
10,3.9476,3.681766
15,3.5009,3.341772


0,1
eval/loss,‚ñà‚ñÑ‚ñÅ
eval/runtime,‚ñà‚ñÇ‚ñÅ
eval/samples_per_second,‚ñÅ‚ñá‚ñà
eval/steps_per_second,‚ñÅ‚ñÜ‚ñà
train/epoch,‚ñÅ‚ñÅ‚ñÑ‚ñÑ‚ñá‚ñá‚ñà
train/global_step,‚ñÅ‚ñÅ‚ñÑ‚ñÑ‚ñá‚ñá‚ñà
train/grad_norm,‚ñà‚ñÖ‚ñÅ
train/learning_rate,‚ñÑ‚ñà‚ñÅ
train/loss,‚ñà‚ñÖ‚ñÅ

0,1
eval/loss,3.34177
eval/runtime,12.5076
eval/samples_per_second,3.038
eval/steps_per_second,0.4
total_flos,5216331101906304.0
train/epoch,3.21053
train/global_step,16.0
train/grad_norm,4.59871
train/learning_rate,0.0
train/loss,3.5009


In [23]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

530.9905 seconds used for training.
8.85 minutes used for training.
Peak reserved memory = 10.842 GB.
Peak reserved memory for training = 1.328 GB.
Peak reserved memory % of max memory = 73.55 %.
Peak reserved memory for training % of max memory = 9.009 %.


<a name="Inference"></a>
### Inference
Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`
Use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [26]:
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

word = "ephemeral"

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma-3",
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": f"Create a memory aid so that I could learn the word '{word}'. Never use acronyms or letters of the word as mnemonic.",
            }
        ],
    }
]
text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,  # Must add for generation
)

_ = model.generate(
    **tokenizer([text], return_tensors="pt").to("cuda"),
    max_new_tokens=1024,
    # Recommended Gemma-3 settings!
    temperature=1.0,
    top_p=0.95,
    top_k=64,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

Okay, here‚Äôs a memory aid for ‚Äúephemeral‚Äù that avoids using any letter-based tricks:

**Imagine a beautiful, iridescent bubble floating in the air. It shimmers with all the colors of the rainbow, but it‚Äôs incredibly delicate. You watch it, mesmerized, knowing it will burst and disappear in a moment ‚Äì just like an ephemeral experience.**

**Why this works:**

*   **Sensory Detail:** The image of a shimmering bubble is visually striking and memorable.
*   **Contrast:** The beauty and fragility of the bubble directly relate to the fleeting nature of something ephemeral.
*   **Storytelling:**  The short narrative creates a connection and makes the word more meaningful.

---

Would you like me to create a different memory aid, perhaps with a different image or approach?<end_of_turn>


### Saving to float16 for VLLM

We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!

In [27]:
model.save_pretrained_merged("gemma-3-4b-mnemonic-chat", tokenizer)
model.push_to_hub_merged(
    "chiffonng/gemma-3-4b-mnemonic-chat",
    tokenizer,
)

Downloading safetensors index for unsloth/gemma-3-4b-it...


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [01:00<00:00, 30.49s/it]


tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading safetensors index for unsloth/gemma-3-4b-it...


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 1/2 [03:01<03:01, 181.53s/it]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [05:21<00:00, 160.76s/it]


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!

In [None]:
if False:  # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-4b-mnemonics-gguf",
        quantization_type="BF16",  # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [None]:
if False:  # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-4b-mnemonics-gguf",
        quantization_type="BF16",  # Only Q8_0, BF16, F16 supported
        repo_id="chiffonng/gemma-3-4b-mnemonics",
    )