# Sentence Splitter: Training

## Generative LLMs Fine Tuning

In this notebook, we're going fine-tune different (actually 5) generative LLMs for sentence splitting,
using the train and the validation sets provided by the homework assignment.

Install the libraries in the local virtual environment. 
We use specific versions to enforce reproducibility for this notebook.

In [1]:
!pip install --upgrade pip
!pip install torch==2.7.0 numpy==2.3.2 pandas==2.3.2 datasets==3.6.0 jupyter==1.1.1 unsloth==2025.9.1



Import all required libraries for the training. 
We do this first to fail fast in case additional packages need to be installed in the virtual environment.

In [2]:
import os
import random
import numpy as np
import torch
from datasets import load_dataset, Dataset
from unsloth import FastLanguageModel
from transformers import TextStreamer
from trl import SFTTrainer, SFTConfig
import pandas as pd

ðŸ¦¥ Unsloth: Will patch your computer to enable 2x faster free finetuning.
ðŸ¦¥ Unsloth Zoo will now patch everything to make training faster!


Optionally (not required to run the notebook). If you want to push the fine-tuned model to the registry, you need to set the token.

Verify that a hardware accelerator is available. This notebook requires a GPU.

In [3]:
# os.environ['HF_TOKEN'] = 'PUT_YOUR_TOKEN_HERE'

torch.cuda.is_available()

True

Set up deterministic behavior for reproducible results by configuring random seeds for all relevant libraries:

In [4]:
RANDOM_STATE = 777

def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed(RANDOM_STATE) # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

### Data Preparation and Aligment

For the LLM portion of the project, start from the dataset already created for the embedding model: [fax4ever/manzoni-192](https://huggingface.co/datasets/fax4ever/manzoni-192).

To see how this dataset is built from the CSV files, refer to `colabs/sentence_splitter_embeddings.ipynb`.

In this setting we do not need labels for each word; instead, we need conversations for training and validation.

In [5]:
SIZE = 192 # Number of words to put on each input of the encoder model

def words_to_sentences(words):
    input_text = " ".join(words)
    input_text = input_text.replace(" ,", ",")
    input_text = input_text.replace(" .", ".")
    input_text = input_text.replace(" ?", "?")
    input_text = input_text.replace(" !", "!")
    input_text = input_text.replace(" :", ":")
    input_text = input_text.replace(" ;", ";")
    input_text = input_text.replace("' ", "'")
    return input_text

def create_conversations(examples):
    input_texts = []
    output_texts = []

    for tokens, labels in zip(examples['tokens'], examples['labels']):
        input_text = words_to_sentences(tokens)
        input_texts.append(input_text)

        sentences = []
        current_sentence = []
        for token, label in zip(tokens, labels):
            current_sentence.append(token)
            if label == 1:  # End of sentence
                sentences.append(words_to_sentences(current_sentence))
                current_sentence = []

        if current_sentence:
            sentences.append(words_to_sentences(current_sentence))

        output_text = "\n".join([f"{i+1}. {sentence}" for i, sentence in enumerate(sentences)])
        output_texts.append(output_text)

    return {"input_text" : input_texts, "output_text" : output_texts}

dataset_dict = load_dataset(f"fax4ever/manzoni-{SIZE}")
llm_dataset_dict = dataset_dict.map(create_conversations, batched = True)

# optionally push it to the hub --- passing the token
# llm_dataset_dict.push_to_hub(f"fax4ever/llm-manzoni-{SIZE}", token=os.getenv("HF_TOKEN"))

README.md:   0%|          | 0.00/428 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/209k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/32.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/389 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/48 [00:00<?, ? examples/s]

Map:   0%|          | 0/389 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]

The result is published as a Hugging Face dataset, so standard Hugging Face APIs apply.

Conversations are expressed as questions (`input_text`) and answers (`output_text`).

Alternatively, simply load the dataset from Hugging Face:

In [6]:
llm_dataset_dict = load_dataset(f"fax4ever/llm-manzoni-{SIZE}")

README.md:   0%|          | 0.00/499 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/684k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/98.8k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/389 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/48 [00:00<?, ? examples/s]

In this phase we create prompts from the question/answer pairs in the dataset.
Following an object-oriented approach, we define a class to produce each prompt:

In [7]:
class Prompt:
    def __init__(self, input_text):
        self.input_text = input_text

    def instruction(self):
        return f"""Dividi il seguente testo italiano in frasi. Per favore rispondi con una frase per riga. Grazie.

Testo: {self.input_text}
"""

    def conversation(self, output_text):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
            {"role" : "assistant", "content" : output_text},
        ]

    def question(self):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
        ]

The `conversation` method produces a full question/answer conversation and is used to fineâ€‘tune the model.
The `question` method produces only the question prompt and will be used for inference later in the notebook.

In [8]:
def create_conversations(examples):
    input_texts  = examples["input_text"]
    output_texts = examples["output_text"]

    conversations = []
    for input_text, output_text in zip(input_texts, output_texts):
        conversations.append(Prompt(input_text).conversation(output_text))
    return { "conversations": conversations, }


conversations = llm_dataset_dict.map(create_conversations, batched = True)

Map:   0%|          | 0/389 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]

### Training

We define a quantized model and then apply a LoRA (Lowâ€‘Rank Adaptation) adapter
to enable fineâ€‘tuning the LLM with modest resources.

Those are the model we fine-tuned:

| Base LLM                                            | Fine-tuned model                                                                                                                                                        |
|-----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| unsloth/Qwen3-4B                                    | [fax4ever/qwen3-4b-unsloth-bnb-4bit-sentence-splitter](https://huggingface.co/fax4ever/qwen3-4b-unsloth-bnb-4bit-sentence-splitter)                                     |
| unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit | [fax4ever/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit-sentence-splitter](https://huggingface.co/fax4ever/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit-sentence-splitter) |
| unsloth/mistral-7b-instruct-v0.3-bnb-4bit           | [fax4ever/mistral-7b-instruct-v0.3-bnb-4bit-sentence-splitter](https://huggingface.co/fax4ever/mistral-7b-instruct-v0.3-bnb-4bit-sentence-splitter)                     |
| sapienzanlp/Minerva-7B-instruct-v1.0                | [fax4ever/Minerva-7B-instruct-v1.0-sentence-splitter](https://huggingface.co/fax4ever/Minerva-7B-instruct-v1.0-sentence-splitter)                                       |

In [9]:
LLM_MODEL = "unsloth/Qwen3-4B"
BASE_MODEL_NAME = "Qwen3-4B"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = LLM_MODEL,  # you can use the 14B here!
    max_seq_length = 2048,   # Context length - can be longer, but uses more memory
    load_in_4bit = True,     # 4bit uses much less memory
    load_in_8bit = False,    # A bit more accurate, uses 2x memory
    full_finetuning = False, # We have full finetuning now!
    # token = "hf_...",      # use one if using gated models
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 32,           # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  # Best to choose alpha = rank or rank*2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = RANDOM_STATE,
    use_rslora = False,   # We support rank stabilized LoRA
    loftq_config = None,  # And LoftQ
)

==((====))==  Unsloth 2025.9.1: Fast Qwen3 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 21.951 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth 2025.9.1 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


We need to convert the conversation templates into the canonical format for this model.
We will use the modelâ€™s tokenizer to do this.
From this, we will create the final dataset used for supervised fineâ€‘tuning.

In [10]:
chat_dataset = conversations.map(lambda x: {"formatted_chat": tokenizer.apply_chat_template(x["conversations"], tokenize=False)})

train_formatted_chats = pd.Series(chat_dataset['train']['formatted_chat'])
train_formatted_chats.name = "text"
train_dataset = Dataset.from_pandas(pd.DataFrame(train_formatted_chats))

validation_formatted_chats = pd.Series(chat_dataset['validation']['formatted_chat'])
validation_formatted_chats.name = "text"
validation_dataset = Dataset.from_pandas(pd.DataFrame(validation_formatted_chats))

Map:   0%|          | 0/389 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]

Finally, train the model and save it remotely.

In [11]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = validation_dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        gradient_accumulation_steps = 4, # Use GA to mimic batch size!
        warmup_steps = 5,  # ~500-2000 or 10-20% of the total steps
        num_train_epochs = 10,
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=20):   0%|          | 0/389 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=20):   0%|          | 0/48 [00:00<?, ? examples/s]

In [12]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 389 | Num Epochs = 10 | Total steps = 250
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 4 x 1) = 16
 "-____-"     Trainable parameters = 66,060,288 of 4,088,528,384 (1.62% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,2.0326
2,2.0582
3,2.027
4,1.9663
5,1.7678
6,1.7009
7,1.5878
8,1.5267
9,1.5039
10,1.4571


TrainOutput(global_step=250, training_loss=0.8206937747001648, metrics={'train_runtime': 2095.7977, 'train_samples_per_second': 1.856, 'train_steps_per_second': 0.119, 'total_flos': 6.642892692258816e+16, 'train_loss': 0.8206937747001648, 'epoch': 10.0})

In [13]:
trained_model_name = BASE_MODEL_NAME + "-sentence-splitter"
model_checkpoint = "fax4ever/" + trained_model_name

# model.push_to_hub(model_checkpoint, token=os.environ['HF_TOKEN'])
# tokenizer.push_to_hub(model_checkpoint, token=os.environ['HF_TOKEN'])

### Inference

Here just a basic test. For more complete inference examples, please see the inference notebooks:

1. colabs/sentence_splitter_out_of_domain_eval_discriminative.ipynb
2. colabs/sentence_splitter_out_of_domain_test_discriminative.ipynb
3. colabs/sentence_splitter_out_of_domain_test_generative.ipynb

In [14]:
input_text = """Non era un legno di lusso, ma un semplice pezzo
da catasta, di quelli che dâ€™inverno si mettono nelle
stufe e nei caminetti per accendere il fuoco e per riscaldare le stanze.
Non so come andasse, ma il fatto gli Ã¨ che un bel
giorno questo pezzo di legno capitÃ² nella bottega
di un vecchio falegname, il quale aveva nome mastrâ€™Antonio, se non che tutti lo chiamavano maestro
Ciliegia, per via della punta del suo naso, che era
sempre lustra e paonazza, come una ciliegia matura.
Appena maestro Ciliegia ebbe visto quel pezzo di
legno, si rallegrÃ² tutto; e dandosi una fregatina di
mani per la contentezza, borbottÃ² a mezza voce:
"Questo legno Ã¨ capitato a tempo; voglio servirmene per fare una gamba di tavolino." 
"""
input_text = input_text.splitlines()
input_text = " ".join(input_text)

question = tokenizer.apply_chat_template(
    [Prompt(input_text).question()], 
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
    enable_thinking = False, # Disable thinking
)

_ = model.generate(
    **tokenizer(question, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

1. Non era un legno di lusso, ma un semplice pezzo da catasta, di quelli che d'inverno si mettono nelle stufe e nei caminetti per accendere il fuoco e per riscaldare le stanze.
2. Non so come andasse, ma il fatto gli Ã¨ che un bel giorno questo pezzo di legno capitÃ² nella bottega di un vecchio falegnome, il quale aveva nome mastr'Antonio, se non che tutti lo chiamavano maestro Ciliegia, per via della punta del suo naso, che era sempre lustra e paonazza, come una ciliegia matura.
3. Appena maestro Ciliegia ebbe visto quel pezzo di legno, si rallegrÃ² tutto; e dandosi una fregatina di mani per la contentezza, borbottÃ² a mezza voce: "Questo legno Ã¨ capitato a tempo; voglio servirmene per fare una gamba di tavolino."<|im_end|>
