# Sentence Splitter: Out of Domain Evaluation

## Test set Inference using Generative models

In this notebook we will produce labels for the out-of-domain test set.
We also compute the F1 score using the computed labels and the gold labels provided by the test set.

> **Note**: The F1 from the test set is not used to tune models or select checkpoints.

In [1]:
!pip install --upgrade pip
!pip install transformers==4.56.1 evaluate==0.4.5 torch==2.7.0 ipywidgets==8.1.7 scikit-learn==1.7.1

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


Import all required libraries:

In [2]:
from transformers import pipeline
import evaluate
import torch
import numpy as np
import pandas as pd
import random
from datasets import Dataset, DatasetDict, load_dataset
import os

  from .autonotebook import tqdm as notebook_tqdm


Before proceeding, make the run as deterministic as possible:

In [3]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

In [4]:
SIZE = 192 # Number of words to put on each input of the encoder model

def group_into_sequences(df, seq_len=SIZE):
    tokens = df['token'].tolist()
    labels = df['label'].tolist()
    
    # Group into sequences of seq_len
    token_seqs = [tokens[i:i+seq_len] for i in range(0, len(tokens), seq_len)]
    label_seqs = [labels[i:i+seq_len] for i in range(0, len(labels), seq_len)]
    
    return {'tokens': token_seqs, 'labels': label_seqs}

Load the test set and group words into sequences.
Remember: sequences (inputs to the encoder) are not sentences (the outputs we want).

In [5]:
test = pd.read_csv("../data/OOD_test.csv", sep=';')  # token,label
test

Unnamed: 0,token,label
0,C',0
1,era,0
2,una,0
3,volta,0
4,…,0
...,...,...
1517,tornò,0
1518,zoppicando,0
1519,a,0
1520,casa,0


In [6]:
test_grouped = group_into_sequences(test)
test_grouped

{'tokens': [["C'",
   'era',
   'una',
   'volta',
   '…',
   '–',
   'Un',
   're',
   '!',
   '–',
   'diranno',
   'subito',
   'i',
   'miei',
   'piccoli',
   'lettori',
   '.',
   '–',
   'No',
   ',',
   'ragazzi',
   ',',
   'avete',
   'sbagliato',
   '.',
   "C'",
   'era',
   'una',
   'volta',
   'un',
   'pezzo',
   'di',
   'legno',
   '.',
   'Non',
   'era',
   'un',
   'legno',
   'di',
   'lusso',
   ',',
   'ma',
   'un',
   'semplice',
   'pezzo',
   'da',
   'catasta',
   ',',
   'di',
   'quelli',
   'che',
   "d'",
   'inverno',
   'si',
   'mettono',
   'nelle',
   'stufe',
   'e',
   'nei',
   'caminetti',
   'per',
   'accendere',
   'il',
   'fuoco',
   'e',
   'per',
   'riscaldare',
   'le',
   'stanze',
   '.',
   'Non',
   'so',
   'come',
   'andasse',
   ',',
   'ma',
   'il',
   'fatto',
   'gli',
   'è',
   'che',
   'un',
   'bel',
   'giorno',
   'questo',
   'pezzo',
   'di',
   'legno',
   'capitò',
   'nella',
   'bottega',
   'di',
   'un',
   '

Create a dataset from the grouped sequences so it can be reused.

In [7]:
test_dataset = Dataset.from_dict(test_grouped)

dataset_dict = DatasetDict({
    'test': test_dataset,
})

# Optionally push the dataset to the hub.
# dataset_dict.push_to_hub(f"fax4ever/sentence-splitter-ood-{SIZE}", token=os.getenv("HF_TOKEN"))

# Alternatively, load the dataset from the hub.
dataset_dict = load_dataset(f"fax4ever/sentence-splitter-ood-{SIZE}")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 1324.38ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:02<00:00,  2.41s/it]
Generating test split: 100%|██████████| 8/8 [00:00<00:00, 2496.98 examples/s]


Create two inference pipelines for the models that performed best in our out-of-domain evaluations: Modern BERT ITA XXL and Electra ITA XXL.

Model selection was done before evaluating on this test set, based on performance on other datasets. No test labels were used in this choice.

In [8]:
modern_bert_model_name = "bert-base-italian-xxl-cased-sentence-splitter"
modern_bert_model_checkpoint = "fax4ever/" + modern_bert_model_name
modern_bert_inference_pipeline = pipeline("token-classification", model=modern_bert_model_checkpoint, aggregation_strategy="simple")

electra_model_name = "electra-base-italian-xxl-cased-discriminator-sentence-splitter"
electra_model_checkpoint = "fax4ever/" + electra_model_name
electra_inference_pipeline = pipeline("token-classification", model=electra_model_checkpoint, aggregation_strategy="simple")

Device set to use cpu
Device set to use cpu


Compute the predicted labels.

The model produces a label for each token. Using the "simple" aggregation strategy, contiguous tokens with the same label are grouped into a single entity. We then split those strings into words (by removing whitespace). Finally, we align those words (and their labels) with the words provided by the gold dataset.

We collect all predicted labels and compute the F1 scores for both selected models. These F1 scores are not used for tuning.

In [9]:
def words_to_sequence(words):
    return " ".join(words)

def sequence_to_words(sequence):
    return sequence.split(" ")

def labels_from_prediction(prediction):
    words = []
    labels = []

    for group in prediction:
        entity_group = group["entity_group"]
        sequence = group["word"]
        label = 0 if entity_group == 'LABEL_0' else 1
        for word in sequence_to_words(sequence):
            words.append(word)
            labels.append(label)
    return words, labels

def realign_words_labels(words_i, labels_i, reference):
    # Sanity check: concatenated reference and predicted words must match
    if "".join(words_i) != "".join(reference):
        raise Exception("words are not the same")

    # Compute end-character indices for each reference word in the concatenated string
    ref_end_indices = []
    acc = 0
    for idx, word in enumerate(reference):
        acc += len(word)
        ref_end_indices.append(acc - 1)

    # Map end index -> reference word index for quick lookup
    end_index_to_ref_idx = {end_idx: i for i, end_idx in enumerate(ref_end_indices)}

    aligned_labels = [0] * len(reference)

    # For each predicted word, if its end aligns with a reference word end, transfer the label
    acc = 0
    for p_idx, word in enumerate(words_i):
        acc += len(word)
        pred_end_idx = acc - 1
        if pred_end_idx in end_index_to_ref_idx:
            r_idx = end_index_to_ref_idx[pred_end_idx]
            if labels_i[p_idx] > aligned_labels[r_idx]:
                aligned_labels[r_idx] = labels_i[p_idx]

    return aligned_labels

all_words = []
all_golden_labels = []
all_mb_aligned_labels = []
all_el_aligned_labels = []

# We compute F1 only for final reporting; it is not used for tuning or selection
mb_f1 = evaluate.load("f1", average="binary")
el_f1 = evaluate.load("f1", average="binary")

for batch in dataset_dict["test"].iter(batch_size=1):
    words = batch["tokens"][0]
    golden_labels = batch["labels"][0]

    mb_inference = modern_bert_inference_pipeline(words_to_sequence(words))
    mb_words, mb_labels = labels_from_prediction(mb_inference)
    mb_aligned_labels = realign_words_labels(mb_words, mb_labels, words)

    el_inference = electra_inference_pipeline(words_to_sequence(words))
    el_words, el_labels = labels_from_prediction(el_inference)
    el_aligned_labels = realign_words_labels(el_words, el_labels, words)

    # Print results for error analysis
    print(words)
    all_words.extend(words)

    print(golden_labels)
    all_golden_labels.extend(golden_labels)

    print(mb_aligned_labels)
    all_mb_aligned_labels.extend(mb_aligned_labels)
    mb_f1.add_batch(predictions=mb_aligned_labels, references=golden_labels)

    print(el_aligned_labels)
    all_el_aligned_labels.extend(el_aligned_labels)
    el_f1.add_batch(predictions=el_aligned_labels, references=golden_labels)

print("bert-base-italian-xxl-cased-sentence-splitter F1: ", mb_f1.compute())
print("electra-base-italian-xxl-cased-discriminator-sentence-splitter F1: ", el_f1.compute())

["C'", 'era', 'una', 'volta', '…', '–', 'Un', 're', '!', '–', 'diranno', 'subito', 'i', 'miei', 'piccoli', 'lettori', '.', '–', 'No', ',', 'ragazzi', ',', 'avete', 'sbagliato', '.', "C'", 'era', 'una', 'volta', 'un', 'pezzo', 'di', 'legno', '.', 'Non', 'era', 'un', 'legno', 'di', 'lusso', ',', 'ma', 'un', 'semplice', 'pezzo', 'da', 'catasta', ',', 'di', 'quelli', 'che', "d'", 'inverno', 'si', 'mettono', 'nelle', 'stufe', 'e', 'nei', 'caminetti', 'per', 'accendere', 'il', 'fuoco', 'e', 'per', 'riscaldare', 'le', 'stanze', '.', 'Non', 'so', 'come', 'andasse', ',', 'ma', 'il', 'fatto', 'gli', 'è', 'che', 'un', 'bel', 'giorno', 'questo', 'pezzo', 'di', 'legno', 'capitò', 'nella', 'bottega', 'di', 'un', 'vecchio', 'falegname', ',', 'il', 'quale', 'aveva', 'nome', "mastr'Antonio", ',', 'se', 'non', 'che', 'tutti', 'lo', 'chiamavano', 'maestro', 'Ciliegia', ',', 'per', 'via', 'della', 'punta', 'del', 'suo', 'naso', ',', 'che', 'era', 'sempre', 'lustra', 'e', 'paonazza', ',', 'come', 'una', 'C

In [10]:
# produce a pandas dataframe with the results
mb_results = pd.DataFrame({
    "token": all_words,
    "label": all_mb_aligned_labels,
})

# save the results
mb_results.to_csv("Lost_in_language_recognition-hw2_split-bert-base-italian-xxl-cased.csv", index=False)

el_results = pd.DataFrame({
    "token": all_words,
    "label": all_el_aligned_labels,
})

# save the results
el_results.to_csv("Lost_in_language_recognition-hw2_split-electra-base-italian-xxl-cased.csv", index=False)