# Sentence Splitter: Out of Domain Evaluation

In this notebook we will evaluate the models we produced for sentence splitting,
on the same domain (Quarantana - Manzoni) and out of domain (on novels from different authors of the same time).
We copied and slightly adapted the golden splitting novels from [Sentence Splitting Manzoni annotated dataset](https://github.com/RacheleSprugnoli/Sentence_Splitting_Manzoni/tree/main/test-novels).


Install the required libraries in your virtual environment:

In [1]:
!pip install --upgrade pip
!pip install transformers evaluate

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


Import all libraries that we will use.

In [2]:
import numpy as np
import evaluate
from transformers import pipeline
from pathlib import Path
import random
import torch

  from .autonotebook import tqdm as notebook_tqdm


Before proceeding, make the run as deterministic as possible:

In [3]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

List the fine-tuned (based on embedding models) models we want to test: 

In [4]:
models = [
    "bert-base-cased-sentence-splitter",
    "ModernBERT-base-ita-sentence-splitter",
    "bert-base-italian-xxl-cased-sentence-splitter",
    "xlm-roberta-base-sentence-splitter",
    "xlm-roberta-large-sentence-splitter",
    "electra-base-italian-xxl-cased-discriminator-sentence-splitter",
]

We want to evaluate the f1 score of the splitting prediction compared.

First of all, from a prediction and from a golden-annotated novel fragment we want to produce label.
The labels produced for the evaluation here are different from the ones used to fine-tune the models:
for simplicity of the implementation we have a label for each character instead of a label for each token.

In [5]:
def sequence_eval(inference_pipeline, sequence_lines):
    sequence = " ".join(sequence_lines)
    prediction = inference_pipeline(sequence)
    prediction_labels = labels_from_prediction(prediction)
    golden_labels = labels_from_novel(sequence_lines)

    if (len(prediction_labels) < len(golden_labels)):
        print("Truncating golden labels. You should use a smaller value for NUM_LINES_FOR_EVAL!")
        golden_labels = golden_labels[:len(prediction_labels)]

    return prediction_labels, golden_labels

def labels_from_prediction(prediction):
    ones = {}
    for label in prediction:
        if label["entity_group"] == "LABEL_1":
            for i in range(label["start"], label["end"]):
                ones[i] = 1

    first = prediction[0]
    last = prediction[-1]
    labels = []
    for i in range(first["start"], last["end"]):
        if i in ones:
            labels.append(1)
        else:
            labels.append(0)
    return labels


def labels_from_novel(novel_lines):
    labels = []
    for i, novel_line in enumerate(novel_lines):
        for _ in range(len(novel_line) - 1):
            labels.append(0)
        labels.append(1)
        if i < len(novel_lines) - 1:
            labels.append(0)
    return labels

The golden-annotated novel is grouped in chunk of equal size of lines.

From each chunk we evaluate the f1, comparing the labels generated from the golden annotated chunk,
with the prediction made by the model:

In [6]:
def eval_model(model_name: str, novel_path: Path, num_lines_for_eval: int):
    model_checkpoint = "fax4ever/" + model_name
    inference_pipeline = pipeline("token-classification", model=model_checkpoint, aggregation_strategy="simple")
    novel_lines = novel_path.read_text(encoding="utf-8").splitlines()
    return eval(inference_pipeline, novel_lines, num_lines_for_eval)

def eval(inference_pipeline, novel_lines, num_lines_for_eval):
    metric = evaluate.load("f1", average="binary")

    grouped_lines = np.array_split(novel_lines, groups(novel_lines, num_lines_for_eval))
    for sequence_lines in grouped_lines:
        prediction_labels, golden_labels = sequence_eval(inference_pipeline, sequence_lines)
        metric.add_batch(predictions=prediction_labels, references=golden_labels)

    return metric.compute()  

def groups(novel_lines, num_lines_for_eval):
    groups = len(novel_lines) / num_lines_for_eval
    if len(novel_lines) % num_lines_for_eval != 0:
        groups += 1
    return groups

We test in-domain novel (Quarantana) and 3 novels from different authors of the same period:

In [7]:
for model_name in models:
    novel = "Cuore-GOLD.txt"
    value = eval_model(model_name, Path('out_of_domain_data') / novel, 6)
    print(model_name, novel, value["f1"])

    novel = "Malavoglia-GOLD.txt"
    value = eval_model(model_name, Path('out_of_domain_data') / novel, 3)
    print(model_name, novel, value["f1"])

    novel = "Pinocchio-GOLD.txt"
    value = eval_model(model_name, Path('out_of_domain_data') / novel, 7)
    print(model_name, novel, value["f1"])

    novel = "Quarantana-GOLD.txt"
    value = eval_model(model_name, Path('out_of_domain_data') / novel, 5)
    print(model_name, novel, value["f1"])

Device set to use cpu


bert-base-cased-sentence-splitter Cuore-GOLD.txt 0.9364161849710982


Device set to use cpu


bert-base-cased-sentence-splitter Malavoglia-GOLD.txt 0.782608695652174


Device set to use cpu


bert-base-cased-sentence-splitter Pinocchio-GOLD.txt 0.8854166666666666


Device set to use cpu


Truncating golden labels. You should use a smaller value for NUM_LINES_FOR_EVAL!
Truncating golden labels. You should use a smaller value for NUM_LINES_FOR_EVAL!
bert-base-cased-sentence-splitter Quarantana-GOLD.txt 0.9859154929577465


Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


ModernBERT-base-ita-sentence-splitter Cuore-GOLD.txt 0.8144329896907216


Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


ModernBERT-base-ita-sentence-splitter Malavoglia-GOLD.txt 0.6666666666666666


Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


ModernBERT-base-ita-sentence-splitter Pinocchio-GOLD.txt 0.5603112840466926


Device set to use cpu
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


ModernBERT-base-ita-sentence-splitter Quarantana-GOLD.txt 0.9130434782608695


Device set to use cpu


bert-base-italian-xxl-cased-sentence-splitter Cuore-GOLD.txt 0.9485714285714286


Device set to use cpu


bert-base-italian-xxl-cased-sentence-splitter Malavoglia-GOLD.txt 0.9120879120879121


Device set to use cpu


bert-base-italian-xxl-cased-sentence-splitter Pinocchio-GOLD.txt 0.8773584905660378


Device set to use cpu


bert-base-italian-xxl-cased-sentence-splitter Quarantana-GOLD.txt 0.9938271604938271


Device set to use cpu


xlm-roberta-base-sentence-splitter Cuore-GOLD.txt 0.24242424242424243


Device set to use cpu


xlm-roberta-base-sentence-splitter Malavoglia-GOLD.txt 0.21153846153846154


Device set to use cpu


xlm-roberta-base-sentence-splitter Pinocchio-GOLD.txt 0.5426356589147286


Device set to use cpu


xlm-roberta-base-sentence-splitter Quarantana-GOLD.txt 0.3952380952380952


Device set to use cpu


xlm-roberta-large-sentence-splitter Cuore-GOLD.txt 0.5


Device set to use cpu


xlm-roberta-large-sentence-splitter Malavoglia-GOLD.txt 0.3559322033898305


Device set to use cpu


xlm-roberta-large-sentence-splitter Pinocchio-GOLD.txt 0.6618705035971223


Device set to use cpu


xlm-roberta-large-sentence-splitter Quarantana-GOLD.txt 0.5274725274725275


Device set to use cpu


electra-base-italian-xxl-cased-discriminator-sentence-splitter Cuore-GOLD.txt 0.9385474860335196


Device set to use cpu


electra-base-italian-xxl-cased-discriminator-sentence-splitter Malavoglia-GOLD.txt 0.9347826086956522


Device set to use cpu


electra-base-italian-xxl-cased-discriminator-sentence-splitter Pinocchio-GOLD.txt 1.0


Device set to use cpu


electra-base-italian-xxl-cased-discriminator-sentence-splitter Quarantana-GOLD.txt 0.9938461538461538
