# Sentence Splitter using an Embedding Model

Install the required libraries in the virtual environment:

In [None]:
!pip install --upgrade pip
!pip install torch numpy pandas datasets jupyter

Let's import everything we need:

(doing it at the beginning to fail fast in case we need something else to install in our virtual environment)

In [None]:
import os
import random
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from datasets import Dataset, DatasetDict, load_dataset
from transformers import (AutoTokenizer, AutoModelForTokenClassification, DataCollatorForTokenClassification, 
                          TrainingArguments, Trainer, pipeline)
from typing import Union, Any, Optional
import evaluate

First of all, let's verify we support accelerator:

In [None]:
torch.cuda.is_available()

Before doing everything else try to make this run as much as deterministic as possible:

In [None]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

## PART ONE: Create the dataset

In this section we're going to create a standard Hugging Face dataset from the `CSV` files: `data/manzoni_dev_tokens.csv` and `data/manzoni_train_tokens.csv`.

The output will be available at [fax4ever/manzoni-192](https://huggingface.co/datasets/fax4ever/manzoni-192).

## Our first hyperparameter

Basically, the original csv files report a text that is supposed to be split in portions that can be passed as input to the encoder model. Typically the max number of tokens that can be passed to the encoder model is 512 (for instance this is true for BERT).
Now we should think about the fact that for each word of the text in general the tokenizer of the model will produce one or more tokens.
A strategy could be split the texts in order to use the maximum number of token possible, this is proven to be not optimal sometimes.
So the number of words we want to put on each input will be our first hyperparameter.

In [None]:
SIZE = 192 # Number of words to put on each input of the encoder model

In the following code, we create the dataset.

The result is published as a Hugging Face dataset, so standard Hugging Face API could be applied on it.
That is the benefit of following an open standard!

In [None]:
def group_into_sequences(df, seq_len=SIZE):
    tokens = df['token'].tolist()
    labels = df['label'].tolist()
    
    # Group into sequences of seq_len
    token_seqs = [tokens[i:i+seq_len] for i in range(0, len(tokens), seq_len) if len(tokens[i:i+seq_len]) == seq_len]
    label_seqs = [labels[i:i+seq_len] for i in range(0, len(labels), seq_len) if len(labels[i:i+seq_len]) == seq_len]
    
    return {'tokens': token_seqs, 'labels': label_seqs}


train = pd.read_csv("data/manzoni_train_tokens.csv")  # token,label
validation = pd.read_csv("data/manzoni_dev_tokens.csv")  # token,label

# Group into sequences of SIZE
train_grouped = group_into_sequences(train)
validation_grouped = group_into_sequences(validation)

print(f"Train: {len(train_grouped['tokens'])} sequences of {SIZE} tokens each")
print(f"Validation: {len(validation_grouped['tokens'])} sequences of {SIZE} tokens each")

train_dataset = Dataset.from_dict(train_grouped)
validation_dataset = Dataset.from_dict(validation_grouped)

dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset  # Using 'validation' as the standard name
})
dataset_dict.push_to_hub(f"fax4ever/manzoni-{SIZE}", token=os.getenv("HF_TOKEN"))

Or we can simply load the result dataset from Hugging Face:

In [None]:
dataset_dict = load_dataset(f"fax4ever/manzoni-{SIZE}")

## PART TWO: Tokenize the dataset

In the tokenization process each word of each input will become one or more tokens.
First of all, we need to define some constants.
In the constants we reflect the convention we implicitly found in the csv files. The 1 denotes the end and the beginning of a new sentence, while the 0 will be used to denote all the other tokens. Special tokens denoting start and end of the input-encoding sequences will be labeled with 0. 

In [None]:
END_OF_SENTENCE = 1
NOT_END_OF_SENTENCE = 0
LABEL_FOR_START_END_OF_SEQUENCE = NOT_END_OF_SENTENCE

We need to choose a base embedding model used to classify the tokens as 0 or 1.
Obviously, it means that we will use `num_labels=2`.

Since the dataset is in Italian we would like to test models,
that are multilingual or trained on an Italian corpus such as:

1. 🚀 ModernBERT-base-ita (Most Recent - Dec 2024)
  * `DeepMount00/ModernBERT-base-ita`

2. 🇮🇹 Italian BERT XXL (Most Established): 
  * `dbmdz/bert-base-italian-xxl-cased`

3. 🌍 XLM-RoBERTa (Best Multilingual)
  * `FacebookAI/xlm-roberta-base`
  * `FacebookAI/xlm-roberta-large`

4. 🔬 Italian ELECTRA (Alternative Architecture): 
  * `dbmdz/electra-base-italian-xxl-cased-discriminator`

Finally, since we're not going to classify the input texts but the tokens of the input text, we use the `AutoModelForTokenClassification` from the Hugging Face APIs, instead of `AutoModelForSequenceClassification`!

In [None]:
EMBEDDING_MODEL = "bert-base-cased"

model = AutoModelForTokenClassification.from_pretrained(EMBEDDING_MODEL, num_labels=2)

The original dataset provides labels for each `word`. When we tokenize the texts we need to align the labels to the generated tokens. 
The algorithm is pretty straightforward: we keep label 1 for all first tokens generated from a word marked as 1, we use 0 for all the other cases.
With this strategy we keep the same number of 1s for each input text.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL)

def tokenize_and_align_labels(items):
    tokenized_inputs = tokenizer(
        items["tokens"], is_split_into_words=True
    )

    all_labels = items["labels"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = LABEL_FOR_START_END_OF_SEQUENCE if word_id is None else labels[word_id]
            new_labels.append(label)
        else:
            # Treat the same word never as end of sentence
            new_labels.append(NOT_END_OF_SENTENCE)
    return new_labels

tokenized_dataset_dict = dataset_dict.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset_dict["train"].column_names,
    batch_size=128,
)

## PART THREE: Training

The first thing to notice is the fact that the dataset is very unbalanced in terms of label distribution. Most of the labels are 0s and few 1s.
In this case the accuracy is not a suitable metrics to measure the quality of our classifier.
For instance returning always 0s will produce a high accuracy.

The first measure we adopted was to use the f1 as metrics to select the best model we produced among all epoch models we produced.

In [None]:
# as naming convention we use the embedding model name + "-sentence-splitter"
# in this way we can easily identify the model we used to train the sentence splitter
trained_model_name = EMBEDDING_MODEL + "-sentence-splitter"

training_arguments = TrainingArguments(
    trained_model_name,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
    hub_token=os.environ['HF_TOKEN'],
    load_best_model_at_end=True, # Stop training when F1 stops improving
    metric_for_best_model="f1" # Of course on the validation set
)

The second measure was to use a weighted cross entropy loss as loss function to be applied to the backpropagation.
The function will count 30x more the errors in classification of 1s with respect to the ones on 0s. 
The factor (30) is another hyperparameter. In this case 30 comes from the fact that 96.7% of the labels have class 0 and 3.3% have the class 1.     

In [None]:
class WeightedTrainer(Trainer):
    def compute_loss(self, model: nn.Module, inputs: dict[str, Union[torch.Tensor, Any]], return_outputs: bool = False, num_items_in_batch: Optional[torch.Tensor] = None):
        labels = inputs.pop("labels")
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss for 2 labels with different weights
        # Simple class weights: give sentence endings 30x more importance
        # Based on your data: 96.7% class 0, 3.3% class 1 → ~30:1 ratio
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 30.0], device=model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        if num_items_in_batch is not None:
            loss = loss / num_items_in_batch
        return (loss, outputs) if return_outputs else loss

For each epoch we want to compute the metrics (precision, recall, f1 and accuracy).

In [None]:
# loaded outside the compute_metrics function to avoid re-loading it at each epoch
metric = evaluate.load("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    all_metrics = metric.compute(predictions=predictions, references=labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Pushing all the epoch metrics and the final trained model to the Hugging Face hub. In this way the model produced can be used for inference.

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = WeightedTrainer (
    model=model,
    args=training_arguments,
    train_dataset=tokenized_dataset_dict["train"],
    eval_dataset=tokenized_dataset_dict["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
)
trainer.train()
trainer.push_to_hub(commit_message="Training complete", token=os.environ['HF_TOKEN'])

## PART FOUR: Inference

We define an inference pipeline using the model deployed on the hub.

In [None]:
model_checkpoint = "fax4ever/" + trained_model_name
inference_pipeline = pipeline("token-classification", model=model_checkpoint, 
                              aggregation_strategy="simple")

And we pass any text to it:

In [None]:
text = """
    Non era un legno di lusso, ma un semplice pezzo
    da catasta, di quelli che d’inverno si mettono nelle
    stufe e nei caminetti per accendere il fuoco e per riscaldare le stanze.
    Non so come andasse, ma il fatto gli è che un bel
    giorno questo pezzo di legno capitò nella bottega
    di un vecchio falegname, il quale aveva nome mastr’Antonio, se non che tutti lo chiamavano maestro
    Ciliegia, per via della punta del suo naso, che era
    sempre lustra e paonazza, come una ciliegia matura.
    Appena maestro Ciliegia ebbe visto quel pezzo di
    legno, si rallegrò tutto; e dandosi una fregatina di
    mani per la contentezza, borbottò a mezza voce:
    "Questo legno è capitato a tempo; voglio servirmene per fare una gamba di tavolino." 
"""

inference_pipeline(text)