In [1]:
!pip install transformers==4.54.1 datasets==4.0.0 trl==0.20.0 peft==0.16.0 torch torchsummary -q 

# Setup

In this series of notebooks, we'll go through different fine tuning techniques for the SmolLM families models.
The supervised fine-tuning (SFT) and PEFT techniques requires a task-specific dataset structured with input-output pairs. Each pair should consist of:

- An input prompt
- The expected model response
- Any additional context or metadata such like chat examples or summarization examples

First, we need to have import our dependencies and setup our environment.

The SmolLM families requires GPU and at least 8GO of VRAM to run fine tuning.

Make sure your environment allows to run it. You can run it on google colab for free !
However, it might take longer with the free GPU resources.

I used the g5.2xlarge instances from Sagemaker. It is quite cheap and more efficient than free colab GPUs. You can check the pricing of all instances here: https://instances.vantage.sh/aws/ec2/g5.2xlarge?currency=USD

We will use the transformers library and trl/peft libraries for SFT/PEFT tuning

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch
import os
from datasets import load_dataset
from transformers import pipeline
import json

2025-07-31 19:12:14.312048: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753989134.329181   29299 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753989134.334080   29299 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-07-31 19:12:14.349534: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
# Make sure the device is cuda

device = (
"cuda"
if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available() else "cpu"
)
print(f'Device is:{device}')
print(f'Type of card: {torch.cuda.get_device_capability()[0]}. (8 and above does not support flash attention)')

Device is:cuda
Type of card: 8. (8 and above does not support flash attention)


# SmolLM family description

The SmolLM models are REAL open-source models created by Hugging Face.

The reasons why I've chosen to use these models as examples are:
- it is open-source and I mean entirely :)
- there are great blogs and explanations from the HF team (Kudos)
- the models are small and that is probably the next generation of models to be used in AI Agents Frameworks

You can find more details on this family of models here: https://github.com/huggingface/smollm or on their great blog: https://huggingface.co/blog/smollm

# Text generation fine-tuning

We'll start by fine-tuning the model for text generation purposes. This is the most straightforward way to do because the LLM is, well, designed for this.

I've chosen the *SmolLM2-360M* model to start with as:
- SmolLM2-135M as significantly lower accuracy and features
- SmolLM2-360M is the medium size of the v2 family
- SmolLM2-1.7B is clearly better but does not fit to GPU free collab and I want everyone to be able to follow this

The dataset used to create the SmolLM2 models is *smoltalk* (https://huggingface.co/datasets/HuggingFaceTB/smoltalk)

We need to choose the subset of data we want to use for this fine-tuning.
As we start by generation, we'll choose *'everyday-conversations'*. The full list of available sub-datasets are:
'all', 'smol-magpie-ultra', 'smol-constraints', 'smol-rewrite', 'smol-summarize', 'apigen-80k', 'everyday-conversations', 'explore-instruct-rewriting', 'longalign', 'metamathqa-50k', 'numina-cot-100k', 'openhermes-100k', 'self-oss-instruct', 'systemchats-30k'

In [15]:
model_name = "HuggingFaceTB/SmolLM2-360M"
dataset_name = "HuggingFaceTB/smoltalk"
model_cache_dir=model_name.split('/')[-1]
config_name = "everyday-conversations"
dataset_cache_dir=f"{dataset_name.replace('/', '')}_{config_name}"
output_dir = "./sft_text_generation_360M"

In [6]:
#Load the model and tokenizer

model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=model_name,
cache_dir=model_cache_dir
)
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=model_name,
cache_dir=model_cache_dir
)
model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/689 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

In [7]:
# Let's look at the model configuration
print(model.config)

LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 960,
  "initializer_range": 0.02,
  "intermediate_size": 2560,
  "is_llama_config": true,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 15,
  "num_hidden_layers": 32,
  "num_key_value_heads": 5,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_interleaved": false,
  "rope_scaling": null,
  "rope_theta": 100000,
  "tie_word_embeddings": true,
  "torch_dtype": "float32",
  "transformers_version": "4.54.1",
  "use_cache": true,
  "vocab_size": 49152
}



Great ! The model weights about 1 GO in storage which makes it easy to use.
We can see a lot of great information here such like the type of data, the model architecture and family as well as the vocabulary size.

We can also, check mode details on the structure of the model using torchsummary:

In [22]:
from torchvision import models
from torchsummary import summary

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vgg = models.vgg16().to(device)

summary(vgg, (3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 224, 224]           1,792
              ReLU-2         [-1, 64, 224, 224]               0
            Conv2d-3         [-1, 64, 224, 224]          36,928
              ReLU-4         [-1, 64, 224, 224]               0
         MaxPool2d-5         [-1, 64, 112, 112]               0
            Conv2d-6        [-1, 128, 112, 112]          73,856
              ReLU-7        [-1, 128, 112, 112]               0
            Conv2d-8        [-1, 128, 112, 112]         147,584
              ReLU-9        [-1, 128, 112, 112]               0
        MaxPool2d-10          [-1, 128, 56, 56]               0
           Conv2d-11          [-1, 256, 56, 56]         295,168
             ReLU-12          [-1, 256, 56, 56]               0
           Conv2d-13          [-1, 256, 56, 56]         590,080
             ReLU-14          [-1, 256,

Note that we apply the setup_chat_format function to our model and tokenizer.
This is because we need a specific format for our fine-tuning.

Also note that we force the device to *cuda* to make sure the model and the future input vectors will be both served on the same engine.

In [11]:
# load and store dataset

ds = load_dataset(dataset_name, config_name, cache_dir=dataset_cache_dir)
ds

Generating train split:   0%|          | 0/2260 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/119 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 2260
    })
    test: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 119
    })
})

In [12]:
# Let's explore
ds['train'][300]

{'full_topic': 'Cooking/Family recipes/Family recipe storytelling',
 'messages': [{'content': 'Hi', 'role': 'user'},
  {'content': 'Hello! How can I help you today?', 'role': 'assistant'},
  {'content': "I'm looking for some new recipes to try out, do you have any family recipes you can share?",
   'role': 'user'},
  {'content': "I can share some classic family recipes. How about a simple chicken parmesan recipe that's been passed down through generations?",
   'role': 'assistant'},
  {'content': "That sounds delicious, what's the story behind the recipe?",
   'role': 'user'},
  {'content': "It originated from an Italian grandmother who made it for her family every Sunday. She'd bread the chicken with love and care, and it became a tradition that's been carried on for years.",
   'role': 'assistant'},
  {'content': "That's so sweet, I'd love to try it out and start my own tradition.",
   'role': 'user'},
  {'content': "I'm glad you're excited to try it! I can provide you with the full 

# SFT for text generation

We'll use the trl libray to fine-tune the model.
You can check the details of the SFT Trainer here: https://huggingface.co/docs/trl/en/sft_trainer

Hugging face provides some "recipes" to configure the fine tuning for SmolLM2: https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm. We follow as much as possible their recommendations

### TIPS

Sometimes the configuration shared from HF or blogs save checkpoints and run evaluations very frequently.

Each time you save your model, you're literraly writing it in local storage which takes some time. Checkpointing is important and really relevant for distributed training and huge models. In our case, frugality is the best and you don't need to checkpoint every 100 steps or so.

In [16]:
# Configure trainer

training_args = SFTConfig(
    output_dir=output_dir,
    max_steps=1000,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    logging_steps=100,
    save_steps=300,
    eval_strategy="steps",
    eval_steps=50,
    use_mps_device=(
        True if device == "mps" else False
        ),  # Use MPS for mixed precision training
    #packing=True, #Only used when flash-attention-2 is used
)

# Initialize trainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    processing_class=tokenizer,
)

In [17]:
# Start training
trainer.train()

# Save the model
trainer.save_model(output_dir)

Step,Training Loss,Validation Loss
50,No log,0.937843
100,1.030300,0.904401
150,1.030300,0.885333
200,0.866700,0.874348
250,0.866700,0.860767
300,0.845400,0.852138
350,0.845400,0.844574
400,0.826000,0.840862
450,0.826000,0.835543
500,0.831200,0.826222


### TIPS

While the fine tuning is happening, you can follow the consumption of your GPU by running 
*nvidia-smi* in your terminal

![Alt text](./nvidia.png)

## Results analysis

Ok, this was amazingly fast ! We can see that there seems to be some overfitting between our training and evaluation dataset.

Actually the evaluation dataset is quite small and these models require more data than we have right now.
Yet let's see our results !

In [19]:
# Test the fine-tuned model 
prompt = "what is the meaning of life ?"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=500)
print("After training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

After training:
user
what is the meaning of life ?
the meaning of life is to live your life to the fullest.


# Conclusion

That's great we did our first supervised fine tuning model in 10 mns of time !

Try to play with hyper-parameters, change the fine-tuning dataset and see if the results are up to your expectations !