# Mistral-Nemo 12B Finetuning with Unsloth + Alpaca

This notebook fine-tunes the `unsloth/Mistral-Nemo-Base-2407-bnb-4bit` model using LoRA adapters for efficient instruction tuning. It uses the Alpaca dataset (chat-style prompts).

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf==3.20.3 datasets huggingface_hub hf_transfer tyro
    !pip install --no-deps unsloth

## Load the Mistral-Nemo 12B model
We'll load the 4-bit quantized version of Mistral-Nemo and initialize the tokenizer.

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Mistral-Nemo-Base-2407-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.19: Fast Mistral patching. Transformers: 4.51.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/165k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.31G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/177k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

## Add LoRA adapters
We apply parameter-efficient LoRA adapters to the attention and feedforward layers.

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

Unsloth 2025.3.19 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


## Load and format Alpaca dataset
We load `yahma/alpaca-cleaned` and format prompts to match Mistral-style instruction templates.

In [4]:
from datasets import load_dataset

mistral_prompt = """Below is an instruction that describes a task, paired with an input. Write a response that appropriately completes the request.

### Task:
{}

### Context:
{}

### Answer:
{}"""

EOS_TOKEN = tokenizer.eos_token

def format_prompt(example):
    return {
        "text": [
            mistral_prompt.format(i, x, y) + EOS_TOKEN
            for i, x, y in zip(example["instruction"], example["input"], example["output"])
        ]
    }

dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(format_prompt, batched=True)

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

## Train with SFTTrainer
Weâ€™ll train for 60 steps as a demo. You can increase `max_steps` or use `num_train_epochs` for full training.

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "mistral_nemo_outputs",
        report_to = "none",
    ),
)

trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/51760 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 57,016,320/4,000,000,000 (1.43% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.5769
2,1.7916
3,1.8513
4,1.887
5,1.3441
6,1.2183
7,1.0786
8,1.1663
9,0.962
10,0.9494


TrainOutput(global_step=60, training_loss=0.9779749820629756, metrics={'train_runtime': 142.8163, 'train_samples_per_second': 3.361, 'train_steps_per_second': 0.42, 'total_flos': 8636912898232320.0, 'train_loss': 0.9779749820629756})

## Inference
Now we test the model with a real instruction prompt using fast token streaming.

In [6]:
FastLanguageModel.for_inference(model)

inputs = tokenizer([
    mistral_prompt.format(
        "Explain the lifecycle of a butterfly.",
        "From egg to adult, what are the main phases?",
        ""
    )
], return_tensors="pt").to("cuda")

from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    input_ids = inputs.input_ids,
    attention_mask = inputs.attention_mask,
    max_new_tokens = 128,
    streamer = streamer,
    pad_token_id = tokenizer.eos_token_id
)

The lifecycle of a butterfly begins with the egg, which is laid by the female butterfly on a leaf or other surface. The egg hatches into a caterpillar, which is the larval stage of the butterfly. The caterpillar feeds on the leaves of plants, growing and molting several times as it develops. When the caterpillar is fully grown, it forms a chrysalis, which is a hard, protective shell. Inside the chrysalis, the caterpillar undergoes a process called metamorphosis, where it transforms into a pupa. The pupa then develops into an adult butterfly, which emerges from the chr


## Save Finetuned LoRA
Save the adapters and tokenizer locally. You can also push to Hugging Face Hub.

In [7]:
model.save_pretrained("mistral_nemo_lora")
tokenizer.save_pretrained("mistral_nemo_lora")

('mistral_nemo_lora/tokenizer_config.json',
 'mistral_nemo_lora/special_tokens_map.json',
 'mistral_nemo_lora/tokenizer.json')