# Hands-On Exercises: Fine-Tuning SmolLM3

Welcome to the practical section! Here you'll apply everything you've learned about chat templates and supervised fine-tuning using SmolLM3. These exercises progress from basic concepts to advanced techniques, giving you real-world experience with instruction tuning.


## Learning Objectives

By completing these exercises, you will:
- Master SmolLM3's chat template system
- Fine-tune SmolLM3 on real datasets using both Python APIs and CLI tools
- Work with the SmolTalk2 dataset that was used to train the original model
- Compare base model vs fine-tuned model performance
- Deploy your models to Hugging Face Hub
- Understand production workflows for scaling fine-tuning

---

## Exercise 1: Exploring SmolLM3's Chat Templates

**Objective**: Understand how SmolLM3 handles different conversation formats and reasoning modes.

SmolLM3 is a hybrid reasoning model which can follow instructions or generated tokens that 'reason' on a complex problem. When post-trained effectively, the model will reason on hard problems and generate direct responses on easy problems.

### Environment Setup

Let's start by setting up our environment.


In [1]:
# Install required packages (run in Colab or your environment)
!pip install -qqq "transformers>=4.55.0" "trl>=0.22.1" "datasets" "torch"
!pip install -qqq "accelerate" "peft" "trackio" "huggingface_hub"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.7/564.7 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m859.7/859.7 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

if torch.cuda.is_available():
    device = "cuda"
    print(f"Using CUDA GPU: {torch.cuda.get_device_name()}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB")
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
    print("Using Apple MPS")
else:
    device = "cpu"
    print("Using CPU - you will need to use a GPU to train models")

# Authenticate with Hugging Face (optional, for private models)
from huggingface_hub import login
# login()  # Uncomment if you need to access private models


Using CUDA GPU: Tesla T4
GPU memory: 15.8GB


### Load SmolLM3 Models

Now let's load the base and instruct models for comparison.


In [13]:
# Load both base and instruct models for comparison
base_model_name = "HuggingFaceTB/SmolLM3-3B-Base"
instruct_model_name = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
# Load tokenizers
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
instruct_tokenizer = AutoTokenizer.from_pretrained(instruct_model_name)

# Load models (use smaller precision for memory efficiency)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    dtype=torch.bfloat16,
    device_map="auto"
)

instruct_model = AutoModelForCausalLM.from_pretrained(
    instruct_model_name,
    dtype=torch.bfloat16,
    device_map="auto"
)

print("Models loaded successfully!")

OSError: There was a specific connection error when trying to load HuggingFaceTB/SmolLM3-3B-Base:
401 Client Error: Unauthorized for url: https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base/resolve/main/config.json (Request ID: Root=1-68dac01d-5e56b4917545449d787a24b3;578b81b0-324b-48ee-96ae-a57dc294c657)

Invalid credentials in Authorization header

### Explore Chat Template Formatting

Now let's explore the chat template formatting. We will create different types of conversations to test.


In [5]:
# Create different types of conversations to test
conversations = {
    "simple_qa": [
        {"role": "system", "content": "/no_think"},
        {"role": "user", "content": "What is machine learning?"},
    ],
    "with_system": [
        {
            "role": "system",
            "content": "You are a helpful AI assistant specialized in explaining technical concepts clearly. /no_think",
        },
        {"role": "user", "content": "What is machine learning?"},
    ],
    "multi_turn": [
        {"role": "system", "content": "You are a math tutor. /no_think"},
        {"role": "user", "content": "What is calculus?"},
        {
            "role": "assistant",
            "content": "Calculus is a branch of mathematics that deals with rates of change and accumulation of quantities.",
        },
        {"role": "user", "content": "Can you give me a simple example?"},
    ],
    "reasoning_task": [
        {"role": "system", "content": "/think"},
        {
            "role": "user",
            "content": "Solve step by step: If a train travels 120 miles in 2 hours, what is its average speed?",
        },
    ],
}

for conv_type, messages in conversations.items():
    print(f"--- {conv_type.upper()} ---")

    # Format without generation prompt (for completed conversations)
    formatted_complete = instruct_tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )

    # Format with generation prompt (for inference)
    formatted_prompt = instruct_tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    print("Complete conversation format:")
    print(formatted_complete)
    print("\nWith generation prompt:")
    print(formatted_prompt)
    print("\n" + "=" * 50 + "\n")


--- SIMPLE_QA ---


NameError: name 'instruct_tokenizer' is not defined

**Step 4: Compare Base vs Instruct Model Responses**


In [None]:
# Test the same prompt on both models
test_prompt = "Explain quantum computing in simple terms."

# Prepare the prompt for base model (no chat template)
base_inputs = base_tokenizer(test_prompt, return_tensors="pt").to(device)

# Prepare the prompt for instruct model (with chat template)
instruct_messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": test_prompt}
]
instruct_formatted = instruct_tokenizer.apply_chat_template(
    instruct_messages, tokenize=False, add_generation_prompt=True
)
instruct_inputs = instruct_tokenizer(instruct_formatted, return_tensors="pt").to(device)

# Generate responses
print("=== Model comparison ===\n")

print("🤖 BASE MODEL RESPONSE:")
with torch.no_grad():
    base_outputs = base_model.generate(
        **base_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=base_tokenizer.eos_token_id,
    )
    base_response = base_tokenizer.decode(base_outputs[0], skip_special_tokens=True)
    print(base_response[len(test_prompt) :])  # Show only the generated part

print("\n" + "=" * 50)
print("Instruct model response:")
with torch.no_grad():
    instruct_outputs = instruct_model.generate(
        **instruct_inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=instruct_tokenizer.eos_token_id,
    )
    instruct_response = instruct_tokenizer.decode(
        instruct_outputs[0], skip_special_tokens=True
    )
    # Extract only the assistant's response
    assistant_start = instruct_response.find("<|im_start|>assistant\n") + len(
        "<|im_start|>assistant\n"
    )
    assistant_response = instruct_response[assistant_start:]
    print(assistant_response)


**Step 5: Test Dual-Mode Reasoning**


In [None]:
# Test SmolLM3's reasoning capabilities
reasoning_prompts = [
    "What is 15 × 24? Show your work.",
    "A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?",
    "If I have $50 and spend $18.75 on lunch and $12.30 on a book, how much money do I have left?",
]

thinking_prompts = [
    "/no_think",
    "/think"
]

print("=== TESTING REASONING CAPABILITIES ===\n")

for thinking_prompt in thinking_prompts:
    print(f"Thinking prompt: {thinking_prompt}")
    for i, prompt in enumerate(reasoning_prompts, 1):
        print(f"Problem {i}: {prompt}")

        messages = [
            {"role":"system", "content": thinking_prompt},
            {"role": "user", "content": prompt}
        ]
        formatted_prompt = instruct_tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        inputs = instruct_tokenizer(formatted_prompt, return_tensors="pt").to(device)

        with torch.no_grad():
            outputs = instruct_model.generate(
                **inputs,
                max_new_tokens=200,
                temperature=0.3,  # Lower temperature for more consistent reasoning
                do_sample=True,
                pad_token_id=instruct_tokenizer.eos_token_id,
            )
            response = instruct_tokenizer.decode(outputs[0], skip_special_tokens=True)
            assistant_start = response.find("<|im_start|>assistant\n") + len(
                "<|im_start|>assistant\n"
            )
            assistant_response = response[assistant_start:].split("<|im_end|>")[0]
            print(f"Answer: {assistant_response}")

        print("\n" + "-" * 50 + "\n")


### Validation

Run the code above and verify that you can see:
1. Different chat template formats for various conversation types
2. Clear differences between base model and instruct model responses
3. SmolLM3's reasoning capabilities in action

### Extension challenges

1. **Multilingual Testing**: Test SmolLM3's multilingual capabilities by asking questions in French, Spanish, or German
2. **Long Context**: Create a very long conversation and test the extended context capabilities
3. **Custom System Prompts**: Experiment with different system messages to change the model's behavior

---

## Exercise 2: Dataset Processing for SFT

**Objective**: Learn to process and prepare datasets for supervised fine-tuning using SmolTalk2 and other datasets.

**Prerequisites**: Completed Exercise 1, understanding of Python data processing.

### Implementation

**Step 1: Explore the SmolTalk2 Dataset**


In [None]:
# Load and explore the SmolTalk2 dataset
print("=== EXPLORING SMOLTALK2 DATASET ===\n")

# Load the SFT subset
dataset_dict = load_dataset("HuggingFaceTB/smoltalk2", "SFT")
print(f"Total splits: {len(dataset_dict)}")
print(f"Available splits: {list(dataset_dict.keys())}")
print(f"Number of total rows: {sum([dataset_dict[d].num_rows for d in dataset_dict])}")
print(f"Dataset structure: {dataset_dict}")



In [None]:
# Function to process different dataset formats
def process_qa_dataset(examples, question_col, answer_col):
    """Process Q&A datasets into chat format"""
    processed = []

    for question, answer in zip(examples[question_col], examples[answer_col]):
        messages = [
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
        ]
        processed.append(messages)

    return {"messages": processed}


def process_instruction_dataset(examples):
    """Process instruction-following datasets"""
    processed = []

    for instruction, response in zip(examples["instruction"], examples["response"]):
        messages = [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": response},
        ]
        processed.append(messages)

    return {"messages": processed}


# Example: Process GSM8K math dataset
print("=== PROCESSING GSM8K DATASET ===\n")

gsm8k = load_dataset(
    "openai/gsm8k", "main", split="train[:100]"
)  # Small subset for demo
print(f"Original GSM8K example: {gsm8k[0]}")


# Convert to chat format
def process_gsm8k(examples):
    processed = []
    for question, answer in zip(examples["question"], examples["answer"]):
        messages = [
            {
                "role": "system",
                "content": "You are a math tutor. Solve problems step by step.",
            },
            {"role": "user", "content": question},
            {"role": "assistant", "content": answer},
        ]
        processed.append(messages)
    return {"messages": processed}


gsm8k_processed = gsm8k.map(
    process_gsm8k, batched=True, remove_columns=gsm8k.column_names
)
print(f"Processed example: {gsm8k_processed[0]}")


In [None]:
# Function to apply chat templates to processed datasets
def apply_chat_template_to_dataset(dataset, tokenizer):
    """Apply chat template to dataset for training"""

    def format_messages(examples):
        formatted_texts = []

        for messages in examples["messages"]:
            # Apply chat template
            formatted_text = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=False,  # We want the complete conversation
            )
            formatted_texts.append(formatted_text)

        return {"text": formatted_texts}

    return dataset.map(format_messages, batched=True)


# Apply to our processed GSM8K dataset
gsm8k_formatted = apply_chat_template_to_dataset(gsm8k_processed, instruct_tokenizer)
print("=== FORMATTED TRAINING DATA ===")
print(gsm8k_formatted[0]["text"])


---

## Exercise 3: Fine-Tuning SmolLM3 with SFTTrainer

**Objective**: Perform supervised fine-tuning on SmolLM3 using TRL's SFTTrainer with real datasets.

**Prerequisites**: Completed Exercise 2, GPU with at least 8GB VRAM (or Google Colab Pro).

### Implementation

**Step 1: Setup and Model Loading**


In [None]:
# Import required libraries for fine-tuning
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# Load SmolLM3 base model for fine-tuning
model_name = "HuggingFaceTB/SmolLM3-3B"
new_model_name = "SmolLM3-Custom-SFT"

print(f"Loading {model_name}...")
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float16,  # Use float16 for memory efficiency
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token
tokenizer.padding_side = "right"  # Padding on the right for generation

print(f"Model loaded! Parameters: {model.num_parameters():,}")


**Step 2: Dataset Preparation**


In [None]:
# Load and prepare training dataset
print("=== PREPARING DATASET ===\n")

# Option 1: Use SmolTalk2 (recommended for beginners)
dataset = load_dataset("HuggingFaceTB/smoltalk2", "SFT")
training_split = "smoltalk_everyday_convs_reasoning_Qwen3_32B_think"
train_dataset = dataset[training_split].select(range(1000))  # Use subset for faster training


In [None]:
# Configure training parameters
training_config = SFTConfig(
    # Model and data
    output_dir=f"./{new_model_name}",
    dataset_text_field="text",
    max_length=2048,

    # Training hyperparameters
    per_device_train_batch_size=2,  # Adjust based on your GPU memory
    gradient_accumulation_steps=2,
    learning_rate=5e-5,
    num_train_epochs=1,  # Start with 1 epoch
    max_steps=500,  # Limit steps for demo

    # Optimization
    warmup_steps=50,
    weight_decay=0.01,
    optim="adamw_torch",

    # Logging and saving
    logging_steps=10,
    save_steps=100,
    eval_steps=100,
    save_total_limit=2,

    # Memory optimization
    dataloader_num_workers=0,
    group_by_length=True,  # Group similar length sequences

    # Hugging Face Hub integration
    push_to_hub=False,  # Set to True to upload to Hub
    hub_model_id=f"your-username/{new_model_name}",

    # Experiment tracking
    report_to=["trackio"],  # Use trackio for experiment tracking
    run_name=f"{new_model_name}-training",
)

print("Training configuration set!")
print(f"Effective batch size: {training_config.per_device_train_batch_size * training_config.gradient_accumulation_steps}")

In [None]:
# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=train_dataset,
)



In [None]:
# Start training!
print("\n=== STARTING TRAINING ===")
trainer.train()

# Save the model
trainer.save_model()
print(f"Model saved to {training_config.output_dir}")

# LoRA SFT with TRL + SmolLM3

This short notebook shows how to fine-tune a small model with LoRA adapters using TRL's SFTTrainer. It uses a tiny model (SmolLM2-135M) and a small public chat dataset for a quick demonstration.



In [None]:
from peft import LoraConfig

In [None]:
# LoRA config
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# SFT config (short run)
sft_config = SFTConfig(
    output_dir="./smollm2-lora-demo",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    packing=True,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="no",
    report_to="none",
)


In [None]:
# Trainer
trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    },
)

# Short demo train
trainer.train()
