# Sentence Splitter: Out of Domain Evaluation

## Test Set Inference Using Generative Models

In this notebook, we're going to use a fine-tuned sentence splitting LLM model based on Minerva 7B Instruct to evaluate performance on out-of-domain data.

> **Note**: The F1 from the test set is not used to tune models or select checkpoints.

Install the libraries in the local virtual environment. 
We use specific versions to enforce reproducibility for this notebook.

In [1]:
!pip install --upgrade pip
!pip install transformers==4.56.1 evaluate==0.4.5 torch==2.7.0 unsloth==2025.9.1 ipywidgets==8.1.7 numpy==2.3.2 pandas==2.3.2 datasets==3.6.0 jupyter==1.1.1 scikit-learn==1.7.1



Import the necessary libraries for model inference, data processing, and evaluation.
We do this first to fail fast in case additional packages need to be installed in the virtual environment.

In [2]:
import random
import numpy as np
import pandas as pd
import torch
from unsloth import FastLanguageModel
from transformers import TextStreamer
from datasets import load_dataset
import evaluate

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Set up deterministic behavior for reproducible results by configuring random seeds for all relevant libraries:

In [3]:
RANDOM_STATE = 777

def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed(RANDOM_STATE) # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

### Data Preparation and Aligment

Define the `Prompt` class (reused from the training notebook) to format input text for the LLM.
For inference, we only use the `question` method to generate prompts without expected outputs.

In [4]:
class Prompt:
    def __init__(self, input_text):
        self.input_text = input_text

    def instruction(self):
        return f"""Dividi il seguente testo italiano in frasi. Per favore rispondi con una frase per riga. Grazie.

Testo: {self.input_text}
"""

    def conversation(self, output_text):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
            {"role" : "assistant", "content" : output_text},
        ]

    def question(self):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
        ]

Load the model and tokenizer from the Hugging Face Hub. 
We apply model-specific chat templates to format questions into proper prompts for the LLM.
The output is streamed and logged during generation for real-time monitoring.

In [5]:
def load_model(model_name):
    model, tokenizer = FastLanguageModel.from_pretrained(
        'fax4ever/' + model_name, 
        load_in_4bit=True, 
        dtype=None, 
        max_seq_length=512
    )
    model = FastLanguageModel.for_inference(model)
    return model, tokenizer    

def use_model(model, tokenizer, input_text):
    question = tokenizer.apply_chat_template(
        [Prompt(input_text).question()], 
        tokenize = False,
        add_generation_prompt = True, # Must add for generation
        enable_thinking = False, # Disable thinking
    )

    return model.generate(
        **tokenizer(question, return_tensors = "pt").to("cuda"),
        max_new_tokens = 512,
        temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
        streamer = TextStreamer(tokenizer, skip_prompt = True),
    )

The LLM-based models we fine-tuned are:

1. Minerva-7B-instruct-v1.0-sentence-splitter
2. qwen3-4b-unsloth-bnb-4bit-sentence-splitter
3. mistral-7b-instruct-v0.3-bnb-4bit-sentence-splitter
4. meta-llama-3.1-8b-instruct-unsloth-bnb-4bit-sentence-splitter

We choose to produce labels using model (1), the Minerva-7B-instruct-v1.0-sentence-splitter.

In [6]:
model, tokenizer = load_model("Minerva-7B-instruct-v1.0-sentence-splitter")

==((====))==  Unsloth 2025.9.1: Fast Mistral patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 21.951 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

sapienzanlp/Minerva-7B-instruct-v1.0 does not have a padding token! Will use pad_token = <unk>.


Unsloth 2025.9.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Load the pre-processed dataset that was generated in the `sentence_splitter_out_of_domain_test_discriminative.ipynb` notebook. This dataset is essentially a grouped version of the `data/OOD_test.csv` file, where words are organized into sequences.

**Note:** These sequences are not actual sentences—they are artificial groupings for processing purposes.

In [7]:
# with fax4ever/sentence-splitter-ood-192 we produce more than 512 tokens!
dataset_dict = load_dataset("fax4ever/sentence-splitter-ood-128")

In [8]:
def words_to_sequence(words):
    input_text = " ".join(words)
    input_text = input_text.replace(" ,", ",")
    input_text = input_text.replace(" .", ".")
    input_text = input_text.replace(" ?", "?")
    input_text = input_text.replace(" !", "!")
    input_text = input_text.replace(" :", ":")
    input_text = input_text.replace(" ;", ";")
    input_text = input_text.replace("' ", "'")
    return input_text

### Model Inference and Evaluation

Producing sentence splitting labels from an LLM is not a trivial task according to our analysis.
We created a utility class to produce labels for the test set words using the LLM output.

To achieve perfect alignment, we need to manually adjust the LLM output with a few specific replacements:

1. `...` → `…`
2. `mise` → `messe`
3. `trasfigurato` → `trasfigurito`

The three dots can be written as a single `…` character instead of three separate dot characters `...`.

`messe` can be used in place of `mise`, see https://www.scholingua.com/it/it/coniugazione/mettersi.
Probably the former is an older form.

`trasfigurito` appears to be an alternative way of saying `trasfigurato`, see https://www.treccani.it/vocabolario/ricerca/trasfigurito/.

We can consider these as minor hallucinations that we want to tolerate.
Other hallucinations will produce an `HallucinationException` that will be logged to the system.

In [9]:
class HallucinationException(Exception):
    def __init__(self, message):
        self.message = message
        super().__init__(self.message)

In [10]:
class MinervaLabels:
  def __init__(self, minerva_output:str, words:list):
    self.words = words
    self.words_joined = "".join(words)
    self.minerva_output = minerva_output
    self.sentences = self._create_sentences()
    self.aligned_sentences = self._aligned_sentences()
    self._check_aligned_sentences() # Sanity check

  def _create_sentences(self):
    import re
    sentences = []
    for line in self.minerva_output.split("\n"):
      # Look for lines that start with number followed by dot and space
      match = re.match(r'^(\d+)\.\s+(.*)', line)
      if match:
        sentence = match.group(2)  # Extract the sentence part after "NUMBER. "
        # Clean up Minerva control tokens like <|eot_id|>
        sentence = re.sub(r'<\|[^|]+\|>', '', sentence)
        sentences.append(sentence)
    return sentences

  def _aligned_sentences(self):
    return [self._align_sentence(sentence) for sentence in self.sentences]

  def _align_sentence(self, sentence):
    sentence = sentence.replace(" ", "")
    sentence = sentence.replace("...", "…")
    sentence = sentence.replace("mise", "messe")
    sentence = sentence.replace("trasfigurato", "trasfigurito")
    if sentence in self.words_joined:
      return sentence
    else:
      # in this case the model is hallucinating altering a sentence or producing a sentence that does not exist
      raise HallucinationException(f"Sentence {sentence} not found in {self.words_joined}")

  def _check_aligned_sentences(self):
    aligned_join = "".join(self.aligned_sentences)
    if not aligned_join == self.words_joined:
      # in this case the model is hallucinating not producing a sentence passed in input
      raise HallucinationException(f"Aligned sentences {self.aligned_sentences} do not match words {self.words_joined}")

  def aligned_labels(self, golden_labels):
    labels = [0] * len(self.words)

    index = 0
    split_indexes = set()
    for sentence in self.aligned_sentences:
      length = len(sentence)
      index += length
      split_indexes.add(index)
    
    index = 0
    for i, word in enumerate(self.words):
      length = len(word)
      index += length
      if index in split_indexes:
        labels[i] = 1

    # The last word can be a 1 or a 0,
    # the model is not asked to predict the last word, so we use the golden labels
    labels[-1] = golden_labels[-1]
    
    return labels

Collect all words along with their corresponding golden labels and model-predicted labels for evaluation.
We will also compute the F1 score to measure model performance.

In [11]:
all_words = []
all_golden_labels = []
all_minerva_labels = []
minerva_f1 = evaluate.load("f1", average="binary")

In [12]:
for i, batch in enumerate(dataset_dict["test"].iter(batch_size=1)):
    words = batch["tokens"][0]
    golden_labels = batch["labels"][0]

    output = use_model(model, tokenizer, words_to_sequence(words)).cpu()
    minerva_output = tokenizer.decode(output[0])

    try:
        minerva_labels_helper = MinervaLabels(minerva_output, words)
    except HallucinationException as e:
        print("Hallucination:", e)
        all_words.extend(words)
        all_golden_labels.extend(golden_labels)
        # In this case we will put all zeros as produced labels,
        # giving a penalty to the model for not having produced a consistent (parsable) answer,
        # keeping the shape of the output CSV files consistent with other models.
        all_minerva_labels.extend([0] * len(golden_labels))
        continue
    
    minerva_labels = minerva_labels_helper.aligned_labels(golden_labels)

    print(f"Batch {i}")
    print(words)
    print(golden_labels)
    print(minerva_labels)

    all_words.extend(words)
    all_golden_labels.extend(golden_labels)
    all_minerva_labels.extend(minerva_labels)
    minerva_f1.add_batch(predictions=minerva_labels, references=golden_labels)

print("Minerva-7B-instruct-v1.0-sentence-splitter F1: ", minerva_f1.compute())

1. C'era una volta ... –
2. Un re! – diranno subito i miei piccoli lettori.
3. – No, ragazzi, avete sbagliato.
4. C'era una volta un pezzo di legno.
5. Non era un legno di lusso, ma un semplice pezzo da catasta, di quelli che d'inverno si mettono nelle stufe e nei caminetti per accendere il fuoco e per riscaldare le stanze.
6. Non so come andasse, ma il fatto gli è che un bel giorno questo pezzo di legno capitò nella bottega di un vecchio falegname, il quale aveva nome mastr'Antonio, se non che tutti lo chiamavano maestro Ciliegia, per via della punta del suo naso, che era sempre lustra e paonazza, come una<|eot_id|>
Batch 0
["C'", 'era', 'una', 'volta', '…', '–', 'Un', 're', '!', '–', 'diranno', 'subito', 'i', 'miei', 'piccoli', 'lettori', '.', '–', 'No', ',', 'ragazzi', ',', 'avete', 'sbagliato', '.', "C'", 'era', 'una', 'volta', 'un', 'pezzo', 'di', 'legno', '.', 'Non', 'era', 'un', 'legno', 'di', 'lusso', ',', 'ma', 'un', 'semplice', 'pezzo', 'da', 'catasta', ',', 'di', 'quelli', '

### Results Export

Finally, export the inference results to a CSV file for further analysis and comparison with other models.

In [13]:
# produce a pandas dataframe with the results
mb_results = pd.DataFrame({
    "token": all_words,
    "label": all_minerva_labels,
})

# save the results
mb_results.to_csv("Lost_in_language_recognition-hw2_split-Minerva-7B-based.csv", index=False)

The results from this notebook will be used to analyze common prediction errors and model performance patterns. Detailed analysis and insights can be found in the accompanying research report.