<a href="https://colab.research.google.com/github/aligreo/LLMs/blob/main/finetune_gemma_3n_gsm8k.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [2]:
%%capture
# Install latest transformers for Gemma 3N
!pip install --no-deps git+https://github.com/huggingface/transformers.git # Only for Gemma 3N
!pip install --no-deps --upgrade timm # Only for Gemma 3N

In [1]:
from google.colab import userdata

hf_token = userdata.get('hfr')

In [4]:
import os
import torch

In [2]:
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3n-E2B-it",
    dtype = None, # None for auto detection
    max_seq_length = 1024, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    token = hf_token, # use one if using gated models
)

model.config.use_cache = False

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.6.12: Fast Gemma3N patching. Transformers: 4.54.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3N does not support SDPA - switching to eager!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
from transformers import TextStreamer
# Helper function for inference
def do_gemma_3n_inference(messages, max_new_tokens = 128):
    _ = model.generate(
        **tokenizer.apply_chat_template(
            messages,
            add_generation_prompt = True, # Must add for generation
            tokenize = True,
            return_dict = True,
            return_tensors = "pt",
        ).to("cuda"),
        max_new_tokens = max_new_tokens,
        temperature = 1.0, top_p = 0.95, top_k = 64,
        streamer = TextStreamer(tokenizer, skip_prompt = True),
    )

In [4]:
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!

    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model.language_model` require gradients


In [5]:
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "gemma-3",
)

In [6]:
from datasets import load_dataset
dataset = load_dataset("openai/gsm8k", "main")

In [7]:
def dataset_format(row):
    row['text'] = f"<start_of_turn>user\n{row['question']}<end_of_turn>\n<start_of_turn>model\n{row['answer']}<end_of_turn>"
    return row

dataset = dataset.map(dataset_format)

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

Map:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [8]:
print(dataset['train'][0]['text'])

<start_of_turn>user
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<end_of_turn>
<start_of_turn>model
Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72<end_of_turn>


In [9]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [13]:
from trl import SFTConfig, SFTTrainer
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset['train'],
    processing_class = tokenizer,

    args = SFTConfig(
        output_dir="gemma-3n-E2B-it-gsm8k-dataset",
        max_steps=60,
        logging_steps=5,
        warmup_steps = 5,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        max_seq_length=1024,
        report_to="none",
        push_to_hub=True,
        run_name="gemma-3n-E2B-it-gsm8k",
        dataset_text_field="text",
        optim="adamw_8bit",
        lr_scheduler_type="cosine",
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
    )
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/7473 [00:00<?, ? examples/s]

In [14]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

Map (num_proc=2):   0%|          | 0/7473 [00:00<?, ? examples/s]

In [15]:
training_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 10,567,680 of 2,000,000,000 (0.53% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,7.1267
10,6.9729
15,7.2004
20,4.204
25,2.6025
30,2.1119
35,2.0446
40,1.904
45,1.7097
50,1.6485


In [34]:
dataset['test'][10]['question']

'A new program had 60 downloads in the first month. The number of downloads in the second month was three times as many as the downloads in the first month, but then reduced by 30% in the third month. How many downloads did the program have total over the three months?'

In [46]:
messages = [{"role":"user",
             "content":[{"type":"text","text":dataset['test'][10]['question']}]}]

ids = tokenizer.apply_chat_template(messages,
                                    tokenize = True,
                                    return_dict=True,
                                    return_tensors = "pt",
                                    add_generation_prompt=True).to("cuda")
ids

{'input_ids': tensor([[     2,    105,   2364,    107, 236776,    861,   1948,   1053, 236743,
         236825, 236771,  59020,    528,    506,   1171,   2297, 236761,    669,
           1548,    529,  59020,    528,    506,   1855,   2297,    691,   1806,
           2782,    618,   1551,    618,    506,  59020,    528,    506,   1171,
           2297, 236764,    840,   1299,   7005,    684, 236743, 236800, 236771,
         236908,    528,    506,   4168,   2297, 236761,   2088,   1551,  59020,
           1602,    506,   1948,    735,   2558,   1024,    506,   1806,   3794,
         236881,    106,    107,    105,   4368,    107]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [47]:
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

_ = model.generate(
    **ids,
    max_new_tokens = 200,
    temperature = 1.0,
    top_p = 0.95,
    top_k = 64,
    streamer = streamer,
    pad_token_id = tokenizer.eos_token_id
    )

Let $D_1$ be the number of downloads in the first month.
Let $D_2$ be the number of downloads in the second month.
Let $D_3$ be the number of downloads in the third month.

We are given that $D_1 = 60$.
The number of downloads in the second month was three times as many as the downloads in the first month, so $D_2 = 3 \times D_1 = 3 \times 60 = 180$.
The number of downloads in the third month was reduced by 30% in the third month. This means that the number of downloads in the third month is $D_3 = D_2 - 0.30 \times D_2 = D_2(1 - 0.30) = 0.70 \times D_2 = 0.70 \times 18


In [49]:
do_gemma_3n_inference(messages = [{"role":"user",
             "content":[{"type":"text","text":dataset['test'][20]['question']}]}], max_new_tokens = 1025)

Here's how to solve the problem step-by-step:

1. **Calculate the initial amount of water in the orange drink:** 10 liters * (2/3) = 6.67 liters (approximately)
2. **Calculate the initial amount of water in the pineapple drink:** 15 liters * (3/5) = 9 liters
3. **Calculate the amount of water remaining after spilling:** 6.67 liters - 1 liter = 5.67 liters (approximately)
4. **Calculate the total amount of water in the remaining drink:** 5.67 liters + 9 liters = 14.67 liters (approximately)

Therefore, there is approximately **14.67 liters** of water in the remaining 24 liters.<end_of_turn>
