# Sentence Splitter: Out of Domain Evaluation

## Test set / Generative models Notebook

In this notebook we're going to use a fine tuned sentence splitting LLM model (based Minerva 7B instruct)

Install the liberaries on the local virtual environment.
We asked for specific versions to enforce maximum reproducibiliy for this notebook.

In [None]:
!pip install --upgrade pip
!pip install transformers==4.56.1 evaluate==0.4.5 torch==2.7.0 unsloth==2025.9.1 ipywidgets==8.1.7 numpy==2.3.2 pandas==2.3.2 datasets==3.6.0 jupyter==1.1.1 scikit-learn==1.7.1

Import the libraries

In [None]:
import random
import numpy as np
import pandas as pd
import torch
from unsloth import FastLanguageModel
from transformers import TextStreamer
from datasets import load_dataset
import evaluate

Before proceeding, make the run as deterministic as possible:

In [None]:
RANDOM_STATE = 777

def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed(RANDOM_STATE) # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

Reuse the `Prompt` class from the training notebook.
Here only the `question` method will be invoked.

In [None]:
class Prompt:
    def __init__(self, input_text):
        self.input_text = input_text

    def instruction(self):
        return f"""Dividi il seguente testo italiano in frasi. Per favore rispondi con una frase per riga. Grazie.

Testo: {self.input_text}
"""

    def conversation(self, output_text):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
            {"role" : "assistant", "content" : output_text},
        ]

    def question(self):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
        ]

In [None]:
def load_model(model_name):
    model, tokenizer = FastLanguageModel.from_pretrained(
        'fax4ever/' + model_name, 
        load_in_4bit=True, 
        dtype=None, 
        max_seq_length=512
    )
    model = FastLanguageModel.for_inference(model)
    return model, tokenizer    

def use_model(model, tokenizer, input_text):
    question = tokenizer.apply_chat_template(
        [Prompt(input_text).question()], 
        tokenize = False,
        add_generation_prompt = True, # Must add for generation
        enable_thinking = False, # Disable thinking
    )

    return model.generate(
        **tokenizer(question, return_tensors = "pt").to("cuda"),
        max_new_tokens = 512,
        temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
        streamer = TextStreamer(tokenizer, skip_prompt = True),
    )

The LLM-based models we fine tuned are:

1. Minerva-7B-instruct-v1.0-sentence-splitter
2. qwen3-4b-unsloth-bnb-4bit-sentence-splitter
3. mistral-7b-instruct-v0.3-bnb-4bit-sentence-splitter
4. meta-llama-3.1-8b-instruct-unsloth-bnb-4bit-sentence-splitter

We choose to produce label for the (1)

In [None]:
model, tokenizer = load_model("Minerva-7B-instruct-v1.0-sentence-splitter")

In [None]:
# with fax4ever/sentence-splitter-ood-192 we produce more than 512 tokens!
dataset_dict = load_dataset("fax4ever/sentence-splitter-ood-128")

In [None]:
def words_to_sequence(words):
    input_text = " ".join(words)
    input_text = input_text.replace(" ,", ",")
    input_text = input_text.replace(" .", ".")
    input_text = input_text.replace(" ?", "?")
    input_text = input_text.replace(" !", "!")
    input_text = input_text.replace(" :", ":")
    input_text = input_text.replace(" ;", ";")
    input_text = input_text.replace("' ", "'")
    return input_text

Produce sentence splitting labels from a LLM is not a super easy task according to our analysis.
So we created an utility class to produce labels for the test set words using the LLM output.

To have a perfect align we need to manually change the output of the LLM with very few replacements:

1. ... => …
2. mise => messe
3. trasfigurato => trasfigurito

The tre dots can be writen as a `…` single caracter in place of the three dots caracters `...`.

Messe can be used in placed of mise, see https://www.scholingua.com/it/it/coniugazione/mettersi.
Probabily the former is an older form.

Transfigurito looks an alternative way of saying transfigurato, see https://www.treccani.it/vocabolario/ricerca/trasfigurito/.

We can see those as small allucinations and we want to tollerate.
Other allucincations will produce an `AllucinationException` that will be logged to the system.

In [None]:
class AllucinationException(Exception):
    def __init__(self, message):
        self.message = message
        super().__init__(self.message)

In [None]:
class MinervaLabels:
  def __init__(self, minerva_output:str, words:list):
    self.words = words
    self.words_joined = "".join(words)
    self.minerva_output = minerva_output
    self.sentences = self._create_sentences()
    self.aligned_sentences = self._aligned_sentences()
    self._check_alinged_sentences() # Sanity check

  def _create_sentences(self):
    import re
    sentences = []
    for line in self.minerva_output.split("\n"):
      # Look for lines that start with number followed by dot and space
      match = re.match(r'^(\d+)\.\s+(.*)', line)
      if match:
        sentence = match.group(2)  # Extract the sentence part after "NUMBER. "
        # Clean up Minerva control tokens like <|eot_id|>
        sentence = re.sub(r'<\|[^|]+\|>', '', sentence)
        sentences.append(sentence)
    return sentences

  def _aligned_sentences(self):
    return [self._align_sentence(sentence) for sentence in self.sentences]

  def _align_sentence(self, sentence):
    sentence = sentence.replace(" ", "")
    sentence = sentence.replace("...", "…")
    sentence = sentence.replace("mise", "messe")
    sentence = sentence.replace("trasfigurato", "trasfigurito")
    if sentence in self.words_joined:
      return sentence
    else:
      # in this case the model is hallucinating altering a sentence or producing a sentence that does not exist
      raise AllucinationException(f"Sentence {sentence} not found in {self.words_joined}")

  def _check_alinged_sentences(self):
    aligned_join = "".join(self.aligned_sentences)
    if not aligned_join == self.words_joined:
      # in this case the model is hallucinating not producing a sentence passed in input
      raise AllucinationException(f"Aligned sentences {self.aligned_sentences} do not match words {self.words_joined}")

  def aligned_labels(self, golden_labels):
    labels = [0] * len(self.words)

    index = 0
    split_indexes = set()
    for sentence in self.aligned_sentences:
      length = len(sentence)
      index += length
      split_indexes.add(index)
    
    index = 0
    for i, word in enumerate(self.words):
      length = len(word)
      index += length
      if index in split_indexes:
        labels[i] = 1

    # The last word can be a 1 or a 0,
    # the model is not asked to predict the last word, so we use the golden labels
    labels[-1] = golden_labels[-1]
    
    return labels

In [None]:
all_words = []
all_golden_labels = []
all_minerva_labels = []
minerva_f1 = evaluate.load("f1", average="binary")

In [None]:
for i, batch in enumerate(dataset_dict["test"].iter(batch_size=1)):
    words = batch["tokens"][0]
    golden_labels = batch["labels"][0]

    output = use_model(model, tokenizer, words_to_sequence(words)).cpu()
    minerva_output = tokenizer.decode(output[0])

    try:
        minerva_labels_helper = MinervaLabels(minerva_output, words)
    except AllucinationException as e:
        print(e)
        continue
    
    minerva_labels = minerva_labels_helper.aligned_labels(golden_labels)

    print(f"Batch {i}")
    print(words)
    print(golden_labels)
    print(minerva_labels)

    all_words.extend(words)
    all_golden_labels.extend(golden_labels)
    all_minerva_labels.extend(minerva_labels)
    minerva_f1.add_batch(predictions=minerva_labels, references=golden_labels)

print("Minerva-7B-instruct-v1.0-sentence-splitter F1: ", minerva_f1.compute())

In [None]:
# produce a pandas dataframe with the results
mb_results = pd.DataFrame({
    "token": all_words,
    "label": all_minerva_labels,
})

# save the results
mb_results.to_csv("Lost_in_language_recognition-hw2_split-Minerva-7B-based.csv", index=False)