# Sentence Splitter: Out of Domain Evaluation

In this notebook we will evaluate the models we produced for sentence splitting,
on the same domain (Quarantana - Manzoni) and out of domain (on novels from different authors of the same time).
We copied with a little adaptation from the golden splitting novels from [Sentence Splitting Manzoni annotated datset](https://github.com/RacheleSprugnoli/Sentence_Splitting_Manzoni/tree/main/test-novels).


Install the required libraries in your virtual environment:

In [None]:
!pip install --upgrade pip
!pip install transformers

Import all libraries that we will use.

In [None]:
import numpy as np
import evaluate
from transformers import pipeline
from pathlib import Path
import random
import torch

Before proceeding, make the run as deterministic as possible:

In [None]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

We need first of all produce labels from a pieces of gold-annotated splitted 

In [None]:
trained_model_name = "bert-base-cased-sentence-splitter"

Choose `trained_model_name` between:

1. bert-base-cased-sentence-splitter
2. ModernBERT-base-ita-sentence-splitter
3. bert-base-italian-xxl-cased-sentence-splitter
4. xlm-roberta-base-sentence-splitter
5. xlm-roberta-large-sentence-splitter
6. electra-base-italian-xxl-cased-discriminator-sentence-splitter

In [None]:
model_checkpoint = "fax4ever/" + trained_model_name
inference_pipeline = pipeline("token-classification", model=model_checkpoint, 
                              aggregation_strategy="simple")

Load the out of domain (other authors) dataset:

In [None]:
def load_text(url):
    return urllib3.urlopen(url).read()

def load_lines_of_text(url):
    result = []
    for line in urllib3.urlopen(url):
        result.append(line.decode('utf-8').strip())
BASE_URL = "https://raw.githubusercontent.com/RacheleSprugnoli/Sentence_Splitting_Manzoni/refs/heads/main/test-novels/"

cuore_test = load_text(BASE_URL + "Cuore-test.txt")
cuore_lines = load_lines_of_text(BASE_URL + "Cuore-GOLD.txt")