# Sentence Splitter: Training

## Embedding Models Fine-Tuning

In this notebook, we're going to fine-tune different (actually 6) embedding models for sentence splitting,
using the train and the validation sets provided by the homework assignment.

Install the libraries in the local virtual environment. 
We use specific versions to enforce reproducibility for this notebook.

In [1]:
!pip install --upgrade pip
!pip install torch==2.7.0 numpy==2.3.2 pandas==2.3.2 datasets==3.6.0 jupyter==1.1.1 transformers[torch]==4.56.1 evaluate==0.4.5



Import all required libraries for the training. 
We do this first to fail fast in case additional packages need to be installed in the virtual environment.

In [2]:
import os
import random
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from datasets import Dataset, DatasetDict, load_dataset
from transformers import (AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification, 
                          TrainingArguments, Trainer, pipeline)
from typing import Union, Any, Optional
import evaluate
from typing import Iterable

Optionally (not required to run the notebook). If you want to push the fine-tuned model to the registry, you need to set the token.

Verify that a hardware accelerator is available. This notebook requires a GPU.

In [3]:
# os.environ['HF_TOKEN'] = 'PUT_YOUR_TOKEN_HERE'

torch.cuda.is_available()

True

Set up deterministic behavior for reproducible results by configuring random seeds for all relevant libraries:

In [4]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

### Data Preparation and Alignment

In this section, we create a standard Hugging Face dataset from the CSV files: `data/manzoni_dev_tokens.csv` and `data/manzoni_train_tokens.csv`.

The output will be available at [fax4ever/manzoni-192](https://huggingface.co/datasets/fax4ever/manzoni-192).

The original CSV files contain text that must be split into portions that can be passed to the encoder model. Typically, the maximum number of tokens for encoder models is 512 (e.g., BERT).
Since each word can produce one or more tokens, a simple strategy would be to split texts to use the maximum number of tokens. This is not always optimal.
Therefore, the number of words per input becomes our first hyperparameter.

In [5]:
SIZE = 192 # Number of words to put on each input of the encoder model

In the following code, we create the dataset and push it to the Hub.

Publishing to the Hugging Face Hub allows us to use standard APIs—one benefit of open standards.

In [6]:
def group_into_sequences(df, seq_len=SIZE):
    tokens = df['token'].tolist()
    labels = df['label'].tolist()
    
    # Group into sequences of seq_len
    token_seqs = [tokens[i:i+seq_len] for i in range(0, len(tokens), seq_len) if len(tokens[i:i+seq_len]) == seq_len]
    label_seqs = [labels[i:i+seq_len] for i in range(0, len(labels), seq_len) if len(labels[i:i+seq_len]) == seq_len]
    
    return {'tokens': token_seqs, 'labels': label_seqs}


train = pd.read_csv("../data/manzoni_train_tokens.csv")  # token,label
validation = pd.read_csv("../data/manzoni_dev_tokens.csv")  # token,label

# Group into sequences of SIZE
train_grouped = group_into_sequences(train)
validation_grouped = group_into_sequences(validation)

print(f"Train: {len(train_grouped['tokens'])} sequences of {SIZE} tokens each")
print(f"Validation: {len(validation_grouped['tokens'])} sequences of {SIZE} tokens each")

train_dataset = Dataset.from_dict(train_grouped)
validation_dataset = Dataset.from_dict(validation_grouped)

dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset  # Using 'validation' as the standard name
})

# Optionally publish the dataset to the hub
# dataset_dict.push_to_hub(f"fax4ever/manzoni-{SIZE}", token=os.getenv("HF_TOKEN"))

Train: 389 sequences of 192 tokens each
Validation: 48 sequences of 192 tokens each


Alternatively, simply load the dataset from Hugging Face:

In [7]:
dataset_dict = load_dataset(f"fax4ever/manzoni-{SIZE}")

During tokenization, each word in an input may become one or more tokens.
First, define some constants to reflect the convention in the CSV files. A label of 1 denotes the end (and thus beginning of a new sentence); 0 is used otherwise. Special tokens denoting start and end of sequences are labeled with 0.

In [8]:
END_OF_SENTENCE = 1
NOT_END_OF_SENTENCE = 0
LABEL_FOR_START_END_OF_SEQUENCE = NOT_END_OF_SENTENCE

Choose a base embedding model to classify tokens as 0 or 1 (`num_labels=2`).

Since the dataset is in Italian, consider multilingual models or models trained on Italian corpora:

1. 🚀 ModernBERT-base-ita (most recent – Dec 2024)
   - `DeepMount00/ModernBERT-base-ita`

2. 🇮🇹 Italian BERT XXL (most established)
   - `dbmdz/bert-base-italian-xxl-cased`

3. 🌍 XLM-RoBERTa (best multilingual)
   - `FacebookAI/xlm-roberta-base`
   - `FacebookAI/xlm-roberta-large`

4. 🔬 Italian ELECTRA (alternative architecture)
   - `dbmdz/electra-base-italian-xxl-cased-discriminator`

Because we classify tokens rather than whole sequences, we use `AutoModelForTokenClassification` instead of `AutoModelForSequenceClassification`.

In [9]:
EMBEDDING_MODEL = "DeepMount00/ModernBERT-base-ita"
MODEL_NAME = "ModernBERT-base-ita"

model = AutoModelForTokenClassification.from_pretrained(EMBEDDING_MODEL, num_labels=2)

Some weights of ModernBertForTokenClassification were not initialized from the model checkpoint at DeepMount00/ModernBERT-base-ita and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The original dataset provides labels for each word. When we tokenize the texts, we need to align the labels to the generated tokens.
We keep label 1 for the first token of any word labeled 1 and use 0 for all other tokens. This preserves the original count of 1s per input.

In [10]:
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL)

def tokenize_and_align_labels(items):
    tokenized_inputs = tokenizer(
        items["tokens"], is_split_into_words=True
    )

    all_labels = items["labels"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = LABEL_FOR_START_END_OF_SEQUENCE if word_id is None else labels[word_id]
            new_labels.append(label)
        else:
            # Treat the same word never as end of sentence
            new_labels.append(NOT_END_OF_SENTENCE)
    return new_labels

tokenized_dataset_dict = dataset_dict.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset_dict["train"].column_names,
    batch_size=128,
)

### Training

The dataset is highly imbalanced: most labels are 0 and few are 1. Accuracy is not a suitable metric here—for example, predicting only 0 yields high accuracy.

We therefore select the best model based on F1 score on the validation set.

In [11]:
# as naming convention we use the embedding model name + "-sentence-splitter"
# in this way we can easily identify the model we used to train the sentence splitter
trained_model_name = MODEL_NAME + "-sentence-splitter"

training_arguments = TrainingArguments(
    trained_model_name,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=30,
    weight_decay=0.01,
    # Optionally, push to the hub the fine-tuned model
    # push_to_hub=True,
    # hub_token=os.environ['HF_TOKEN'],
    load_best_model_at_end=True, # Stop training when F1 stops improving
    metric_for_best_model="f1" # Of course on the validation set
)

We also use a weighted cross‑entropy loss during training.
Misclassifying class 1 (end of sentence) is penalized 30× more than class 0. The weight (30) is another hyperparameter, motivated by the label distribution (≈96.7% class 0 vs ≈3.3% class 1).

In [12]:
class WeightedTrainer(Trainer):
    def compute_loss(self, model: nn.Module, inputs: dict[str, Union[torch.Tensor, Any]], return_outputs: bool = False, num_items_in_batch: Optional[torch.Tensor] = None):
        labels = inputs.pop("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss for 2 labels with different weights
        # Simple class weights: give sentence endings 30x more importance
        # Based on your data: 96.7% class 0, 3.3% class 1 → ~30:1 ratio
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 30.0], device=model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        if num_items_in_batch is not None:
            loss = loss / num_items_in_batch
        return (loss, outputs) if return_outputs else loss

For each epoch, compute F1.

In [13]:
def compute_f1_metric(predictions: Iterable[Iterable], references: Iterable[Iterable]) -> float:
    metric = evaluate.load("f1", average="binary")

    assert len(predictions) == len(references)
    for i, prediction_batch in enumerate(predictions):
        reference_batch = references[i]
        
        prediction_valid_batch = []
        reference_valid_batch = []
        trailing = False

        assert len(prediction_batch) == len(reference_batch)
        for prediction, reference in zip(prediction_batch, reference_batch):
            if reference == -100 or reference == '-100':
                trailing = True
                continue
            else:
                assert not trailing
                reference_valid_batch.append(reference)
                prediction_valid_batch.append(prediction)

        metric.add_batch(predictions=prediction_valid_batch, references=reference_valid_batch)
    return metric.compute()

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    return compute_f1_metric(predictions=predictions, references=labels)

Push all epoch metrics and the final trained model to the Hugging Face Hub so the model can be used for inference.

In [14]:
data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = WeightedTrainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_dataset_dict["train"],
    eval_dataset=tokenized_dataset_dict["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
)
trainer.train()
# trainer.push_to_hub(commit_message="Training complete", token=os.environ['HF_TOKEN'])

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': None, 'bos_token_id': None}.


Epoch,Training Loss,Validation Loss,F1
1,No log,0.008159,0.962179
2,No log,0.00424,0.969605
3,No log,0.004161,0.978528
4,No log,0.003743,0.989147
5,No log,0.003536,0.987616
6,No log,0.002984,0.989147
7,No log,0.00579,0.990683
8,No log,0.009161,0.989114
9,No log,0.009182,0.990654
10,No log,0.009585,0.990654


TrainOutput(global_step=1470, training_loss=0.003379223146000687, metrics={'train_runtime': 995.8273, 'train_samples_per_second': 11.719, 'train_steps_per_second': 1.476, 'total_flos': 2803818118030440.0, 'train_loss': 0.003379223146000687, 'epoch': 30.0})

### Inference

Here just a basic test. For more complete inference examples, please see the inference notebooks:

1. colabs/sentence_splitter_out_of_domain_eval_discriminative.ipynb
2. colabs/sentence_splitter_out_of_domain_test_discriminative.ipynb
3. colabs/sentence_splitter_out_of_domain_test_generative.ipynb

Define an inference pipeline using the model deployed on the Hub.

In [15]:
model_checkpoint = "fax4ever/" + trained_model_name
inference_pipeline = pipeline("token-classification", model=model_checkpoint, 
                              aggregation_strategy="simple")

Device set to use cuda:0


Then pass any text to it:

In [16]:
text = """Non era un legno di lusso, ma un semplice pezzo
da catasta, di quelli che d’inverno si mettono nelle
stufe e nei caminetti per accendere il fuoco e per riscaldare le stanze.
Non so come andasse, ma il fatto gli è che un bel
giorno questo pezzo di legno capitò nella bottega
di un vecchio falegname, il quale aveva nome mastr’Antonio, se non che tutti lo chiamavano maestro
Ciliegia, per via della punta del suo naso, che era
sempre lustra e paonazza, come una ciliegia matura.
Appena maestro Ciliegia ebbe visto quel pezzo di
legno, si rallegrò tutto; e dandosi una fregatina di
mani per la contentezza, borbottò a mezza voce:
"Questo legno è capitato a tempo; voglio servirmene per fare una gamba di tavolino." 
"""
text = text.splitlines()
text = " ".join(text)

inference_pipeline(text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'LABEL_0',
  'score': np.float32(0.9991683),
  'word': 'Non era un legno di lusso, ma un semplice pezzo da catasta, di quelli che d’inverno si mettono nelle stufe e nei caminetti per accendere il fuoco e per riscaldare le stanze',
  'start': 0,
  'end': 172},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.99892277),
  'word': '.',
  'start': 172,
  'end': 173},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.9989502),
  'word': ' Non so come andasse, ma il fatto gli è che un bel giorno questo pezzo di legno capitò nella bottega di un vecchio falegname, il quale aveva nome mastr’Antonio, se non che tutti lo chiamavano maestro Ciliegia, per via della punta del suo naso, che era sempre lustra e paonazza, come una ciliegia matura',
  'start': 173,
  'end': 475},
 {'entity_group': 'LABEL_1',
  'score': np.float32(0.9999937),
  'word': '.',
  'start': 475,
  'end': 476},
 {'entity_group': 'LABEL_0',
  'score': np.float32(0.9988344),
  'word': ' Appena maestro Cil