# Finetuning Gemma 2 (9B) Model with Unsloth + Alpaca Dataset

In this notebook, we'll finetune the `unsloth/gemma-2-9b-bnb-4bit` model using lightweight LoRA adapters on the Alpaca dataset.


In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf==3.20.3 datasets huggingface_hub hf_transfer tyro
    !pip install --no-deps unsloth

## Load Gemma 2 (9B) 4-bit model
We'll load it with Unsloth and 4-bit quantization enabled.

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2-9b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.5.1+cu124)
    Python  3.11.11 (you have 3.11.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: If you want to finetune Gemma 2, install flash-attn to make it faster!
To install flash-attn, do the below:

pip install --no-deps --upgrade "flash-attn>=2.6.3"
==((====))==  Unsloth 2025.3.19: Fast Gemma2 patching. Transformers: 4.48.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/6.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/46.4k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

## Add LoRA Adapters
We'll finetune only part of the model to save memory and speed up training.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Unsloth 2025.3.19 patched 42 layers with 42 QKV layers, 42 O layers and 42 MLP layers.


## Prepare Alpaca Dataset
We'll load the `yahma/alpaca-cleaned` dataset and apply a simple instruction-following format.

In [4]:
from datasets import load_dataset

gemma_prompt = """Below is an instruction that describes a task, paired with an input. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def format_prompt(example):
    return {
        "text": [
            gemma_prompt.format(i, x, y) + EOS_TOKEN
            for i, x, y in zip(example["instruction"], example["input"], example["output"])
        ]
    }

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(format_prompt, batched=True)

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

## Train Gemma 2 (9B)
We'll use Hugging Face `SFTTrainer` for supervised finetuning.
Training is done lightly for demo (60 steps). Adjust as needed.

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "gemma_outputs",
        report_to = "none",
    ),
)

trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/51760 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 54,018,048/9,000,000,000 (0.60% trained)
AUTOTUNE bmm(32x208x256, 32x256x208)
  bmm 0.0266 ms 100.0% 
  triton_bmm_14 0.0420 ms 63.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=4, num_warps=8
  triton_bmm_6 0.0451 ms 59.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=2, num_warps=4
  triton_bmm_3 0.0461 ms 57.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=8
  triton_bmm_

Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.6201
2,1.8641
3,1.8874
4,1.8361
5,1.2911
6,1.1839
7,1.12
8,1.1637
9,0.9839
10,0.9075


TrainOutput(global_step=60, training_loss=0.9699750562508901, metrics={'train_runtime': 236.4009, 'train_samples_per_second': 2.03, 'train_steps_per_second': 0.254, 'total_flos': 6295327506284544.0, 'train_loss': 0.9699750562508901})

## Inference
We now run the model and check its generation.

In [6]:
FastLanguageModel.for_inference(model)

inputs = tokenizer([
    gemma_prompt.format(
        "List the planets of our solar system in order.",
        "",
        ""
    )
], return_tensors="pt").to("cuda")

from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    input_ids = inputs.input_ids,
    attention_mask = inputs.attention_mask,
    streamer = streamer,
    max_new_tokens = 128,
    pad_token_id = tokenizer.eos_token_id
)

AUTOTUNE bmm(16x48x256, 16x256x48)
  triton_bmm_153 0.0113 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=2, num_warps=4
  bmm 0.0123 ms 91.7% 
  triton_bmm_159 0.0143 ms 78.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=3, num_warps=8
  triton_bmm_166 0.0143 ms 78.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=8
  triton_bmm_154 0.0154 ms 73.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=8
  triton_bmm_155 0.0154 ms 73.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=8


The planets of our solar system in order from the Sun are:

1. Mercury
2. Venus
3. Earth
4. Mars
5. Jupiter
6. Saturn
7. Uranus
8. Neptune<eos>


## Save Finetuned Model
Save the LoRA adapters locally (or push to Hugging Face Hub).

In [7]:
model.save_pretrained("gemma2_lora_finetuned")
tokenizer.save_pretrained("gemma2_lora_finetuned")

('gemma2_lora_finetuned/tokenizer_config.json',
 'gemma2_lora_finetuned/special_tokens_map.json',
 'gemma2_lora_finetuned/tokenizer.model',
 'gemma2_lora_finetuned/added_tokens.json',
 'gemma2_lora_finetuned/tokenizer.json')