# Fine-tuning Gemma 3 (270M) with Unsloth Framework

This notebook demonstrates how to fine-tune Google's Gemma 3 (270M parameter) model using the Unsloth framework for efficient training.

## What You'll Learn:
- **Data Preparation**: How to format datasets for conversational AI fine-tuning
- **LoRA Training**: Memory-efficient fine-tuning using Low-Rank Adaptation
- **Model Inference**: Using the fine-tuned model for text generation  
- **Model Export**: Saving in multiple formats (LoRA, merged, GGUF)

## Navigation:
- [📦 Installation](#Installation)
- [🤖 Model Loading](#Model-Loading)
- [📊 Data Preparation](#Data-Prep)
- [🎯 Training](#Train)
- [🔮 Inference](#Inference)
- [💾 Saving](#Save)


## Requirements

**Hardware Requirements:**
- GPU: CUDA-compatible GPU (Tesla T4 or better recommended)
- RAM: 8GB+ system RAM
- Storage: 10GB+ free space for model and dataset

**Software Requirements:**
- Python 3.8+
- PyTorch with CUDA support
- Unsloth framework
- Transformers, Datasets, and TRL libraries

**Key Features of this Notebook:**
- Memory-efficient training using LoRA adapters
- Support for 4-bit and 8-bit quantization
- Chess instruction dataset fine-tuning
- Multiple export formats (LoRA, merged models, GGUF)


## 📦 Installation {#Installation}

Install the required dependencies. For local execution, use `pip install unsloth`. This cell handles both Colab and local environments:

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth

## 🤖 Model Loading {#Model-Loading}

The Unsloth `FastModel` class provides optimized loading for transformer models, including both vision and text models. It automatically handles tokenizer setup and performance optimizations.

In [2]:
# Load Unsloth's FastModel for optimized training
from unsloth import FastModel
import torch

# Configuration
max_seq_length = 2048  # Maximum sequence length for training

# Load Gemma 3 (270M) model with optimizations
model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-270m-it",
    max_seq_length=max_seq_length,
    load_in_4bit=False,   # Disable 4-bit quantization for this small model
    load_in_8bit=False,   # Disable 8-bit quantization
    full_finetuning=False, # Use LoRA instead of full fine-tuning
    # token="hf_...",     # Add HF token if using gated models
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Gemma3 patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

## 🔧 LoRA Adapter Configuration

LoRA (Low-Rank Adaptation) allows us to fine-tune large models efficiently by only training a small number of additional parameters. This significantly reduces memory usage and training time.

## 🔬 Baseline Performance Test (Before Fine-tuning)

Let's test the model's chess knowledge **before** fine-tuning to establish a baseline. This will help us measure the improvement after training.


In [3]:
# Test the model BEFORE fine-tuning with a chess question
print("=" * 60)
print("🔍 TESTING MODEL BEFORE FINE-TUNING")
print("=" * 60)

# Use a chess question similar to our training data
test_chess_question = """Given the chess position after these moves: e4 e5 Nf3 Nc6 Bb5 a6 Ba4 Nf6 O-O Be7 Re1 b5 Bb3 d6 c3 O-O h3 Nb8 d4 Nbd7, what should White play next?"""

print(f"Question: {test_chess_question}")
print("\n🤖 Model Response BEFORE Fine-tuning:")
print("-" * 40)

# Format the question for the model
messages_before = [
    {"role": "user", "content": test_chess_question}
]

text_before = tokenizer.apply_chat_template(
    messages_before,
    tokenize=False,
    add_generation_prompt=True
).removeprefix('<bos>')

# Generate response from untrained model
from transformers import TextStreamer
print("Model is generating...")

response_before = model.generate(
    **tokenizer(text_before, return_tensors="pt").to("cuda"),
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

# Decode the response
baseline_response = tokenizer.decode(response_before[0], skip_special_tokens=True)
# Extract only the generated part (remove the input prompt)
baseline_answer = baseline_response[len(tokenizer.decode(tokenizer(text_before, return_tensors="pt")["input_ids"][0], skip_special_tokens=True)):]

print(f"Response: {baseline_answer}")
print("\n" + "=" * 60)
print("📝 NOTE: The model likely gives generic or poor chess advice at this stage.")
print("After fine-tuning, it should provide much better chess-specific responses!")
print("=" * 60)


🔍 TESTING MODEL BEFORE FINE-TUNING
Question: Given the chess position after these moves: e4 e5 Nf3 Nc6 Bb5 a6 Ba4 Nf6 O-O Be7 Re1 b5 Bb3 d6 c3 O-O h3 Nb8 d4 Nbd7, what should White play next?

🤖 Model Response BEFORE Fine-tuning:
----------------------------------------
Model is generating...
Response: White should play **"Axe"** next.


📝 NOTE: The model likely gives generic or poor chess advice at this stage.
After fine-tuning, it should provide much better chess-specific responses!


In [4]:
# Configure LoRA adapters for efficient fine-tuning
model = FastModel.get_peft_model(
    model,
    r=128,  # LoRA rank - higher values = more parameters but potentially better quality
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],  # Transformer layers to adapt
    lora_alpha=128,        # LoRA scaling parameter
    lora_dropout=0,        # Dropout for LoRA layers (0 is optimized)
    bias="none",           # Bias handling ("none" is most memory efficient)
    use_gradient_checkpointing="unsloth",  # Unsloth's memory optimization
    random_state=3407,     # Reproducibility seed
    use_rslora=False,      # Rank-stabilized LoRA (optional)
    loftq_config=None,     # LoftQ quantization config (optional)
)

Unsloth: Making `model.base_model.model.model` require gradients


## 📊 Data Preparation {#Data-Prep}

We'll use the Gemma-3 conversation format for fine-tuning. Our dataset is [Thytu's ChessInstruct](https://huggingface.co/datasets/Thytu/ChessInstruct), which contains chess instruction-following examples.

**Gemma-3 Conversation Format:**
```
<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>
```

The process involves:
1. Loading the ChessInstruct dataset
2. Converting to conversation format  
3. Applying Gemma-3 chat template
4. Formatting for training

In [5]:
# Set up the chat template for Gemma-3 format
# This ensures conversations are properly formatted for the model
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="gemma3",  # Use Gemma-3 specific conversation format
)

In [6]:
# Load the ChessInstruct dataset from Hugging Face
# We'll use the first 10,000 examples for efficient training
from datasets import load_dataset

dataset = load_dataset("Thytu/ChessInstruct", split="train[:10000]")

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/161M [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.63M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/99000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

### Converting to Conversation Format

Transform the raw dataset into a structured conversation format that includes system instructions, user inputs, and expected assistant responses.

In [7]:
# Convert dataset to conversation format
def convert_to_chatml(example):
    """Convert a single dataset example to conversation format"""
    return {
        "conversations": [
            {"role": "system", "content": example["task"]},        # Task description
            {"role": "user", "content": example["input"]},         # User query
            {"role": "assistant", "content": example["expected_output"]}  # Expected response
        ]
    }

# Apply conversion to entire dataset
dataset = dataset.map(convert_to_chatml)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

### Inspect Converted Data

Let's examine how the conversion transformed our dataset:

In [8]:
# Display a sample conversation from the converted dataset
dataset[100]

{'task': "Given an incomplit set of chess moves and the game's final score, write the last missing chess move.\n\nInput Format: A comma-separated list of chess moves followed by the game score.\nOutput Format: The missing chess move",
 'input': '{"moves": ["c2c4", "g8f6", "b1c3", "c7c5", "g1f3", "e7e6", "e2e3", "d7d5", "d2d4", "b8c6", "c4d5", "e6d5", "f1e2", "c5c4", "c1d2", "f8b4", "a1c1", "e8g8", "b2b3", "b4a3", "c1b1", "c8f5", "b3c4", "f5b1", "d1b1", "d5c4", "e2c4", "a3b4", "e1g1", "a8c8", "f1d1", "d8a5", "c3e4", "f6e4", "b1e4", "b4d2", "f3d2", "c8c7", "d2f3", "c6b8", "c4b3", "b8d7", "e4f4", "c7c3", "e3e4", "a5b5", "e4e5", "a7a5", "f4e4", "a5a4", "b3d5", "h7h6", "d1b1", "b5d3", "e4d3", "c3d3", "e5e6", "d7f6", "e6f7", "g8h7", "d5e6", "g7g6", "h2h4", "f6e4", "b1b7", "h7g7", "b7a7", "d3d1", "g1h2", "e4f2", "a7a4", "d1h1", "h2g3", "f2e4", "g3f4", "e4d6", "f3e5", "h1h4", "f4e3", "d6f5", "e3d3", "f8d8", "e5d7", "h4g4", "f7f8b", "d8f8", "d7f8", "g7f8", "e6d5", "g4g3", "d3e4", "g3g2", "e4e5"

### Apply Gemma-3 Chat Template

Convert the conversation structure into the specific text format that Gemma-3 expects for training.

In [11]:
# Apply Gemma-3 chat template to conversations
def formatting_prompts_func(examples):
    """Format conversations using the Gemma-3 chat template"""
    convos = examples["conversations"]
    # Apply template and remove <bos> token prefix
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False).removeprefix('<bos>')
        for convo in convos
    ]
    return {"text": texts}

# Apply formatting to the entire dataset in batches
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

### Inspect Formatted Text

Check how the chat template formatted our training data:


In [12]:
# Display the formatted text for the same example
dataset[100]['text']

'<start_of_turn>user\nGiven an incomplit set of chess moves and the game\'s final score, write the last missing chess move.\n\nInput Format: A comma-separated list of chess moves followed by the game score.\nOutput Format: The missing chess move\n\n{"moves": ["c2c4", "g8f6", "b1c3", "c7c5", "g1f3", "e7e6", "e2e3", "d7d5", "d2d4", "b8c6", "c4d5", "e6d5", "f1e2", "c5c4", "c1d2", "f8b4", "a1c1", "e8g8", "b2b3", "b4a3", "c1b1", "c8f5", "b3c4", "f5b1", "d1b1", "d5c4", "e2c4", "a3b4", "e1g1", "a8c8", "f1d1", "d8a5", "c3e4", "f6e4", "b1e4", "b4d2", "f3d2", "c8c7", "d2f3", "c6b8", "c4b3", "b8d7", "e4f4", "c7c3", "e3e4", "a5b5", "e4e5", "a7a5", "f4e4", "a5a4", "b3d5", "h7h6", "d1b1", "b5d3", "e4d3", "c3d3", "e5e6", "d7f6", "e6f7", "g8h7", "d5e6", "g7g6", "h2h4", "f6e4", "b1b7", "h7g7", "b7a7", "d3d1", "g1h2", "e4f2", "a7a4", "d1h1", "h2g3", "f2e4", "g3f4", "e4d6", "f3e5", "h1h4", "f4e3", "d6f5", "e3d3", "f8d8", "e5d7", "h4g4", "f7f8b", "d8f8", "d7f8", "g7f8", "e6d5", "g4g3", "d3e4", "g3g2", "e4

## 🎯 Model Training {#Train}

Configure and execute the fine-tuning process using Supervised Fine-Tuning (SFT) with LoRA adapters.

**Training Configuration:**
- **100 steps** for quick demonstration (set `num_train_epochs=1` for full training)
- **LoRA adapters** for memory efficiency  
- **Response-only training** to focus learning on assistant outputs
- **8-bit optimizer** for reduced memory usage

In [13]:
# Configure the Supervised Fine-Tuning (SFT) trainer
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset=None,  # Optional: add validation dataset
    args=SFTConfig(
        dataset_text_field="text",           # Field containing training text
        per_device_train_batch_size=8,       # Batch size per GPU
        gradient_accumulation_steps=1,       # Accumulate gradients for larger effective batch size
        warmup_steps=5,                      # Learning rate warmup
        max_steps=100,                       # Number of training steps (use num_train_epochs=1 for full training)
        learning_rate=5e-5,                  # Learning rate (reduce to 2e-5 for longer runs)
        logging_steps=1,                     # Log every step
        optim="adamw_8bit",                 # Memory-efficient optimizer
        weight_decay=0.01,                   # Regularization
        lr_scheduler_type="linear",          # Learning rate scheduler
        seed=3407,                          # Reproducibility
        output_dir="outputs",               # Save checkpoints here
        report_to="none",                   # Disable wandb/tensorboard logging
    ),
)

Unsloth: Switching to float32 training since model cannot work with float16


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

### Configure Response-Only Training

Focus the training loss only on assistant responses, ignoring user inputs. This improves training efficiency and model quality.

In [14]:
# Configure training to focus only on model responses
from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<start_of_turn>user\n",      # User input marker (will be masked)
    response_part="<start_of_turn>model\n",        # Assistant response marker (will be trained)
)

Map (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

### Verify Training Data Masking

Check that the training setup correctly masks user inputs and only trains on assistant responses:

In [15]:
# Display the full training input (with special tokens)
tokenizer.decode(trainer.train_dataset[100]["input_ids"])

'<bos><start_of_turn>user\nGiven an incomplit set of chess moves and the game\'s final score, write the last missing chess move.\n\nInput Format: A comma-separated list of chess moves followed by the game score.\nOutput Format: The missing chess move\n\n{"moves": ["c2c4", "g8f6", "b1c3", "c7c5", "g1f3", "e7e6", "e2e3", "d7d5", "d2d4", "b8c6", "c4d5", "e6d5", "f1e2", "c5c4", "c1d2", "f8b4", "a1c1", "e8g8", "b2b3", "b4a3", "c1b1", "c8f5", "b3c4", "f5b1", "d1b1", "d5c4", "e2c4", "a3b4", "e1g1", "a8c8", "f1d1", "d8a5", "c3e4", "f6e4", "b1e4", "b4d2", "f3d2", "c8c7", "d2f3", "c6b8", "c4b3", "b8d7", "e4f4", "c7c3", "e3e4", "a5b5", "e4e5", "a7a5", "f4e4", "a5a4", "b3d5", "h7h6", "d1b1", "b5d3", "e4d3", "c3d3", "e5e6", "d7f6", "e6f7", "g8h7", "d5e6", "g7g6", "h2h4", "f6e4", "b1b7", "h7g7", "b7a7", "d3d1", "g1h2", "e4f2", "a7a4", "d1h1", "h2g3", "f2e4", "g3f4", "e4d6", "f3e5", "h1h4", "f4e3", "d6f5", "e3d3", "f8d8", "e5d7", "h4g4", "f7f8b", "d8f8", "d7f8", "g7f8", "e6d5", "g4g3", "d3e4", "g3g2"

**Masked Training Labels** (only assistant responses will be trained):

In [17]:
# Display only the parts that will be trained (labels with -100 are masked out)
tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100]["labels"]]).replace(tokenizer.pad_token, " ")

'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             {"missing 

In [18]:
# Monitor initial GPU memory usage before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
0.832 GB of memory reserved.


### Execute Training

Start the fine-tuning process. Use `resume_from_checkpoint=True` to continue from a previous checkpoint:

In [19]:
# Start training - this will take several minutes
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 1 x 1) = 8
 "-____-"     Trainable parameters = 30,375,936 of 298,474,112 (10.18% trained)


Step,Training Loss
1,3.6577
2,3.7947
3,2.5744
4,1.2612
5,0.9695
6,0.6683
7,0.6984
8,0.6996
9,0.5513
10,0.5269


Unsloth: Will smartly offload gradients to save VRAM!


In [20]:
# Calculate and display training performance metrics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print("=== TRAINING COMPLETED ===")
print(f"Training time: {trainer_stats.metrics['train_runtime']} seconds ({round(trainer_stats.metrics['train_runtime']/60, 2)} minutes)")
print(f"Peak GPU memory: {used_memory} GB ({used_percentage}% of {max_memory} GB)")
print(f"Memory used for training: {used_memory_for_lora} GB ({lora_percentage}% of total)")

=== TRAINING COMPLETED ===
Training time: 525.9276 seconds (8.77 minutes)
Peak GPU memory: 4.27 GB (28.967% of 14.741 GB)
Memory used for training: 3.438 GB (23.323% of total)


## 🔮 Model Inference {#Inference}

Test the fine-tuned model with inference. The Gemma-3 team recommends these generation parameters for optimal results:
- **Temperature**: 1.0 (controls randomness)
- **Top-p**: 0.95 (nucleus sampling)  
- **Top-k**: 64 (limits vocabulary per step)

### 🔬 Performance Comparison: Before vs After Fine-tuning

Let's test the **same chess question** on our fine-tuned model to see the improvement!


In [22]:
# Test the SAME chess question after fine-tuning
print("=" * 60)
print("🧠 TESTING MODEL AFTER FINE-TUNING")
print("=" * 60)

print(f"Question: {test_chess_question}")
print("\n🤖 Model Response AFTER Fine-tuning:")
print("-" * 40)

# Format the same question for the fine-tuned model
messages_after = [
    {"role": "user", "content": test_chess_question}
]

text_after = tokenizer.apply_chat_template(
    messages_after,
    tokenize=False,
    add_generation_prompt=True
).removeprefix('<bos>')

print("Fine-tuned model is generating...")

# Generate response with the fine-tuned model
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

response_after = model.generate(
    **tokenizer(text_after, return_tensors="pt").to("cuda"),
    max_new_tokens=100,
    temperature=0.7,  # Same parameters as baseline test
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
    streamer=streamer,
)

# Also capture the response for comparison
response_after_text = tokenizer.decode(response_after[0], skip_special_tokens=True)
fine_tuned_answer = response_after_text[len(tokenizer.decode(tokenizer(text_after, return_tensors="pt")["input_ids"][0], skip_special_tokens=True)):]

print("\n" + "=" * 60)
print("📊 COMPARISON SUMMARY:")
print("=" * 60)
print(f"🔴 BEFORE: {baseline_answer[:100]}{'...' if len(baseline_answer) > 100 else ''}")
print(f"🟢 AFTER:  {fine_tuned_answer[:100]}{'...' if len(fine_tuned_answer) > 100 else ''}")
print("=" * 60)
print("✅ The fine-tuned model should now provide much more accurate")
print("   chess-specific advice compared to the baseline model!")
print("=" * 60)


🧠 TESTING MODEL AFTER FINE-TUNING
Question: Given the chess position after these moves: e4 e5 Nf3 Nc6 Bb5 a6 Ba4 Nf6 O-O Be7 Re1 b5 Bb3 d6 c3 O-O h3 Nb8 d4 Nbd7, what should White play next?

🤖 Model Response AFTER Fine-tuning:
----------------------------------------
Fine-tuned model is generating...
{"response": "Nb8 d4 d6 h3 Nb8 d4 Nb7"}

📊 COMPARISON SUMMARY:
🔴 BEFORE: White should play **"Axe"** next.

🟢 AFTER:  {"response": "Nb8 d4 d6 h3 Nb8 d4 Nb7"}
✅ The fine-tuned model should now provide much more accurate
   chess-specific advice compared to the baseline model!


In [23]:
# Test the fine-tuned model with a sample from the training data
messages = [
    {'role': 'system', 'content': dataset['conversations'][10][0]['content']},  # System prompt
    {'role': 'user', 'content': dataset['conversations'][10][1]['content']}     # User question
]

# Format the conversation for generation
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,  # Add generation prompt for inference
).removeprefix('<bos>')

# Generate response with streaming output
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=125,          # Maximum tokens to generate
    temperature=1.0,             # Randomness (Gemma-3 recommended)
    top_p=0.95,                  # Nucleus sampling (Gemma-3 recommended)
    top_k=64,                    # Top-k sampling (Gemma-3 recommended)
    streamer=streamer,           # Stream output in real-time
)

{"missing move": "c2c7"}<end_of_turn>


## 💾 Model Saving & Export {#Save}

Save your fine-tuned model in various formats for different deployment scenarios:

### LoRA Adapters Only (Lightweight)
Save only the LoRA weights (~few MB) - requires original model to run:

In [24]:
# Save LoRA adapters locally (recommended for testing)
model.save_pretrained("gemma-3-lora")
tokenizer.save_pretrained("gemma-3-lora")

# Upload to Hugging Face Hub (uncomment and add your token)
# model.push_to_hub("your_username/gemma-3-chess", token="hf_...")
# tokenizer.push_to_hub("your_username/gemma-3-chess", token="hf_...")

('gemma-3-lora/tokenizer_config.json',
 'gemma-3-lora/special_tokens_map.json',
 'gemma-3-lora/chat_template.jinja',
 'gemma-3-lora/tokenizer.model',
 'gemma-3-lora/added_tokens.json',
 'gemma-3-lora/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [25]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "gemma-3", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 2048,
        load_in_4bit = False,
    )

### Merged Models (Standalone)
Create complete models by merging LoRA weights with the base model. These are larger but don't require the original model:

**Options:**
- **16-bit**: Full precision, larger file size (~540MB)  
- **4-bit**: Quantized, smaller file size (~135MB)
- **Upload**: Push merged models to Hugging Face Hub

In [26]:
# Merge to 16bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_16bit")
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False:
    model.save_pretrained_merged("gemma-3-finetune", tokenizer, save_method = "merged_4bit",)
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gemma-3-finetune", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("gemma-3-finetune")
    tokenizer.save_pretrained("gemma-3-finetune")
if False: # Pushing to HF Hub
    model.push_to_hub("hf/gemma-3-finetune", token = "")
    tokenizer.push_to_hub("hf/gemma-3-finetune", token = "")


### GGUF Format (llama.cpp & Ollama)
Export to GGUF format for use with llama.cpp, Ollama, and other inference engines:

**Available Formats:**
- **Q8_0**: 8-bit quantized (~270MB, good quality)
- **F16**: 16-bit float (~540MB, best quality)  
- **BF16**: BFloat16 (~540MB, good quality)

In [27]:
if False: # Change to True to save to GGUF
    model.save_pretrained_gguf(
        "gemma-3-finetune",
        tokenizer,
        quantization_type = "Q8_0", # For now only Q8_0, BF16, F16 supported
    )

Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!

In [28]:
if False: # Change to True to upload GGUF
    model.push_to_hub_gguf(
        "gemma-3-finetune",
        tokenizer,
        quantization_type = "Q8_0", # Only Q8_0, BF16, F16 supported
        repo_id = "HF_ACCOUNT/gemma-finetune-gguf",
        token = "hf_...",
    )

## 🎉 Congratulations!

You've successfully fine-tuned Gemma 3 (270M) for chess instruction following!

### What You Accomplished:
✅ **Efficient Training**: Used LoRA adapters to fine-tune with minimal memory  
✅ **Data Preparation**: Formatted chess instructions for conversational AI  
✅ **Response-Only Learning**: Optimized training to focus on assistant outputs  
✅ **Multiple Export Formats**: Saved models for various deployment scenarios

### Next Steps:
- **Test Performance**: Try the model with your own chess questions
- **Deploy Locally**: Use the GGUF format with llama.cpp or Ollama
- **Share Your Model**: Upload to Hugging Face Hub for others to use
- **Iterate**: Experiment with different datasets, parameters, or model sizes

### Quick Deployment:
```bash
# Using Ollama (after saving to GGUF)
ollama create my-chess-tutor -f /path/to/gemma-3-finetune.gguf
ollama run my-chess-tutor "What's the best opening for beginners?"
```

The model is now ready for chess instruction tasks and can provide much more accurate chess-related advice than the base model!
