# SFT con SFTTrainer

In questo notebook vedremo come fine-tunare il modello base **HuggingFaceTB/SmolLM2-135M** usando SFTTrainer dalla libreria **trl**. 

In [1]:
# Authenticate to Hugging Face

from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Importiamo le librerie necessarie
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, setup_chat_format
import torch

device = ("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# Carichiamo il modello ed il suo tokenizer
model_name = "HuggingFaceTB/SmolLM2-135M"  # base model 
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name)

# Setup chat format
model, tokenizer = setup_chat_format(model, tokenizer)

Using device: cpu


In [3]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb): LlamaRotaryEm

In [4]:
print(tokenizer)

GPT2TokenizerFast(name_or_path='HuggingFaceTB/SmolLM2-135M', vocab_size=49152, model_max_length=8192, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|im_start|>', 'eos_token': '<|im_end|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|im_end|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<repo_name>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("<reponame>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	5: AddedToken("<file_sep>", rstrip=

In [5]:
print(tokenizer.chat_template)

{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


In [None]:
# Aggiungi il chat_template se assente
tokenizer.chat_template = """{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"""

In [None]:
print(tokenizer.chat_template)

In [6]:
# Set our name for the finetune to be saved &/ uploaded to
finetune_name = "SmolLM2-FT-MyDataset"
finetune_tags = ["smol-course", "module_1"]

## Generate with the Base Model

Qui proviamo a generare con il modello base il quale non una Chat Template

In [7]:
prompt = "Write a haiku about programming"

# Format with template
messages = [
    {"role": "user", "content": prompt}
]

formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)

print(f"Prompt: {formatted_prompt}")

Prompt: <|im_start|>user
Write a haiku about programming<|im_end|>



In [8]:
# Generate response
inputs = tokenizer(formatted_prompt, return_tensors="pt").to(device)
print(f"Inputs: {inputs}")
outputs = model.generate(**inputs, max_new_tokens=100)
print("Before training:")
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Inputs: {'input_ids': tensor([[    1,  4093,   198, 19161,   253,   421, 30614,   563,  6256,     2,
           198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Before training:
<|im_start|>user
Write a haiku about programming<|im_end|>
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a haiku about programming
Write a


## Dataset Preparation

Carichiamo un dataset e lo formattiamo per il training. Il dataset deve essere strutturato con coppie input-output, dove ciascun input è un prompt e ciascun output è la risposta attesa dal modello.

**TRL formatta i messaggi di input in base ai chat templates dei modelli**. Questi devono essere rappresentati come una lista di dizionari con le chiavi: role e content. 

Useremo il dataset [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)

In [9]:
# Load dataset
from datasets import load_dataset

# TODO: definire il prorpio dataset e configurare il path e il nome dei parametri
ds = load_dataset(path="HuggingFaceTB/smoltalk", name="everyday-conversations")

In [10]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 2260
    })
    test: Dataset({
        features: ['full_topic', 'messages'],
        num_rows: 119
    })
})


In [11]:
print(ds["train"][0])

{'full_topic': 'Travel/Vacation destinations/Beach resorts', 'messages': [{'content': 'Hi there', 'role': 'user'}, {'content': 'Hello! How can I help you today?', 'role': 'assistant'}, {'content': "I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?", 'role': 'user'}, {'content': "Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.", 'role': 'assistant'}, {'content': 'That sounds great. Are there any resorts in the Caribbean that are good for families?', 'role': 'user'}, {'content': 'Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.', 'role': 'assistant'}, {'content': "Okay, I'll look into those. Thanks for the recommendations!", 'role': 'user'}, {'content': "You're welcome. I hope you find the perfect resort for your v

Tutti i messagi della lista messages devono seguire il chat template del modello. Questo perchè:

📉 - Cosa succede se NON usi il chat_template
Se fai fine-tuning su messages raw o solo su testo non formattato:

- il modello non sa dove inizia o finisce il prompt

- può confondere user e assistant

- non riesce a prevedere la corretta sequenza di output

- il comportamento in inference sarà incoerente

In [None]:
# Aggiungi il chat_template se assente
tokenizer.chat_template = """{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}
{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}
{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"""

In [12]:
from pprint import pprint

example = ds["train"][0]["messages"]
formatted = tokenizer.apply_chat_template(example, tokenize=False)
pprint(formatted)

('<|im_start|>user\n'
 'Hi there<|im_end|>\n'
 '<|im_start|>assistant\n'
 'Hello! How can I help you today?<|im_end|>\n'
 '<|im_start|>user\n'
 "I'm looking for a beach resort for my next vacation. Can you recommend some "
 'popular ones?<|im_end|>\n'
 '<|im_start|>assistant\n'
 'Some popular beach resorts include Maui in Hawaii, the Maldives, and the '
 "Bahamas. They're known for their beautiful beaches and crystal-clear "
 'waters.<|im_end|>\n'
 '<|im_start|>user\n'
 'That sounds great. Are there any resorts in the Caribbean that are good for '
 'families?<|im_end|>\n'
 '<|im_start|>assistant\n'
 'Yes, the Turks and Caicos Islands and Barbados are excellent choices for '
 'family-friendly resorts in the Caribbean. They offer a range of activities '
 'and amenities suitable for all ages.<|im_end|>\n'
 '<|im_start|>user\n'
 "Okay, I'll look into those. Thanks for the recommendations!<|im_end|>\n"
 '<|im_start|>assistant\n'
 "You're welcome. I hope you find the perfect resort for your 

In [13]:
print(tokenizer.chat_template)

{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


## Configuriamo il SFTTrainer

Il SFTTriner è configurato con vari parametri che controllano il processo di addestramento. Questi includono il numero di training steps, batch size, learning rate, e eavaluation strategy. Aggiustiamo questi parametri in base ai nostri specifici requisiti e risorse computazionali.

In [None]:
# Preprocessing del dataset fatto manualmente

# Rimuovi 'messages' e 'full_topic' dopo la formattazione per evitare conflitti con la colonna 'messsages'
# e per mantenere solo la colonna formattata
ds_cleaned = ds.map(
    lambda x: {
        "formatted_chat": tokenizer.apply_chat_template(
            x["messages"], tokenize=False, add_generation_prompt=False
        )
    },
    remove_columns=["messages", "full_topic"]
)

# Controlla il risultato
print(ds["train"]['formatted_chat'][0])
print(ds_cleaned["train"].column_names)
print(ds_cleaned)


<|im_start|>user
Hi there<|im_end|>
<|im_start|>assistant
Hello! How can I help you today?<|im_end|>
<|im_start|>user
I'm looking for a beach resort for my next vacation. Can you recommend some popular ones?<|im_end|>
<|im_start|>assistant
Some popular beach resorts include Maui in Hawaii, the Maldives, and the Bahamas. They're known for their beautiful beaches and crystal-clear waters.<|im_end|>
<|im_start|>user
That sounds great. Are there any resorts in the Caribbean that are good for families?<|im_end|>
<|im_start|>assistant
Yes, the Turks and Caicos Islands and Barbados are excellent choices for family-friendly resorts in the Caribbean. They offer a range of activities and amenities suitable for all ages.<|im_end|>
<|im_start|>user
Okay, I'll look into those. Thanks for the recommendations!<|im_end|>
<|im_start|>assistant
You're welcome. I hope you find the perfect resort for your vacation.<|im_end|>

['formatted_chat']
DatasetDict({
    train: Dataset({
        features: ['format

In [26]:
sft_config = SFTConfig(
    output_dir="./sft_output",
    max_steps=1000, # Da impostare in base alla dimensione del dataset e alla durata desirata del training
    per_device_train_batch_size=4, # In base alla memoria della GPU
    learning_rate=5e-5, # Common starting point for fine-tuning
    logging_steps=10, # Frequenza di logging training  metrics
    save_steps=100, # Frequenza di salvataggio del modello checkpoints
    eval_strategy="steps", # Valutiamo il modello in intervalli regolari
    eval_steps=50, # Frequenza di valutazione del modello
    use_mps_device = (
        True if device == "mps" else False
    ), # Se stiamo usando un Mac con GPU MPS
    hub_model_id = finetune_name, # Nome del modello su Hugging Face Hub
    dataset_text_field="formatted_chat", # Campo del dataset da usare per il fine-tuning
)

# Inizializza il trainer

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=ds_cleaned["train"],
    eval_dataset=ds_cleaned["test"]
)

Converting train dataset to ChatML:   0%|          | 0/2260 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/2260 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/2260 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/2260 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/119 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/119 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/119 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/119 [00:00<?, ? examples/s]

## Training the Model

Con il trainer configurato, ora possiamo addestrare il modello. Il processo di training itera sul dataset, calcolando la loss, ed aggiorna i parametri del modello per minimizzare la loss.

In [27]:
# Train the model
trainer.train()

# Save the model
trainer.save_model(f"./{finetune_name}")



Step,Training Loss,Validation Loss
50,1.1056,1.192877
100,1.1047,1.11477
150,1.0414,1.077789
200,1.0286,1.055065
250,1.0155,1.04609
300,1.0083,1.034463
350,0.9826,1.029706
400,0.9809,1.025916
450,0.9935,1.014373
500,1.0457,1.005272




In [28]:
# carichiamo il modello su huggingface
trainer.push_to_hub(
    tags=finetune_tags,
)

HfHubHTTPError: (Request ID: Root=1-682a4dc1-59b95bb16e56ffe23b2de408;3b7a37c1-c84b-4188-81e9-8cbfdc32c83c)

403 Forbidden: You don't have the rights to create a model under the namespace "felipe93".
Cannot access content at: https://huggingface.co/api/repos/create.
Make sure your token has the correct permissions.

In [36]:
# Test the fine-tuned model on the same prompt

# Let's test the base model before training
prompt = "Write a haiku about programming"

# Format with template
messages = [{"role": "user", "content": prompt}]
formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")

print(f"Prompt: {formatted_prompt}")
print(tokenizer.decode(formatted_prompt[0]))

Prompt: tensor([[    1,  4093,   198, 19161,   253,   421, 30614,   563,  6256,     2,
           198,     1,   520,  9531,   198]])
<|im_start|>user
Write a haiku about programming<|im_end|>
<|im_start|>assistant



In [40]:
outputs = model.generate(formatted_prompt, max_new_tokens=128)
print("After training:")    
print(tokenizer.decode(outputs[0]))

After training:
<|im_start|>user
Write a haiku about programming<|im_end|>
<|im_start|>assistant
I'm a bit confused about programming languages. What is a programming language? A programming language is a set of rules and instructions that computers understand. It's like a language that helps computers do things like process information, solve problems, and create programs.<|im_end|>
