# Sentence Splitter using an LLM

Install the required libraries in the virtual environment:

In [None]:
!pip install --upgrade pip
!pip install torch numpy pandas datasets jupyter unsloth

Let's import everything we need:

(doing it at the beginning to fail fast in case we need something else to install in our virtual environment)

In [None]:
import os
from unsloth import FastLanguageModel
import torch
import random
import numpy as np
from datasets import load_dataset, Dataset

First of all, let's verify we support accelerator:

In [None]:
torch.cuda.is_available()

Before doing everything else try to make this run as much as deterministic as possible:

In [None]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

## PART ONE: Create the dataset

For the LLM part of the project we can start from the dataset we already created for the embedding part:
[fax4ever/manzoni-192](https://huggingface.co/datasets/fax4ever/manzoni-192).

To see how this dataset is created from the `CSV` files, see the `colabs/sentence_splitter_embeddings.ipynb` notebook.

In this setting we don't need labels for each words, but we need to have conversations to train and to validate.

In [None]:
SIZE = 192 # Number of words to put on each input of the encoder model

def create_conversations(examples):
    input_texts = []
    output_texts = []

    for tokens, labels in zip(examples['tokens'], examples['labels']):
        input_text = " ".join(tokens)
        input_texts.append(input_text)

        sentences = []
        current_sentence = []
        for token, label in zip(tokens, labels):
            current_sentence.append(token)
            if label == 1:  # End of sentence
                sentences.append(" ".join(current_sentence).strip())
                current_sentence = []
        
        # Add remaining tokens if any
        if current_sentence:
            sentences.append(" ".join(current_sentence).strip())

        output_text = "\n".join([f"{i+1}. {sentence}" for i, sentence in enumerate(sentences)])
        output_texts.append(output_text)

    return {"input_text" : input_texts, "output_text" : output_texts}

dataset_dict = load_dataset(f"fax4ever/manzoni-{SIZE}")
llm_dataset_dict = dataset_dict.map(create_conversations, batched = True)
llm_dataset_dict.push_to_hub(f"fax4ever/llm-manzoni-{SIZE}", token=os.getenv("HF_TOKEN"))

The result is published as a Hugging Face dataset, so standard Hugging Face API could be applied on it.
That is the benefit of following an open standard!
Conversations here are expressed in terms of questions (input_text) and answers (output_text).

Again, the result is published as a Hugging Face dataset, so standard Hugging Face API could be applied on it.
That is the benefit of following an open standard!

Or we can simply load the result dataset from Hugging Face:

In [None]:
dataset_dict = load_dataset(f"fax4ever/llm-manzoni-{SIZE}")

## PART TWO: Create the prompts

In this phase we're going to create prompts from the series of questions / answers we have in the dataset.
Following an object oriented approach, we define a class to produce each prompt:

In [None]:
class Prompt:
    def __init__(self, input_text):
        self.input_text = input_text

    def instruction(self):
        return f"""Dividi il seguente testo italiano in frasi. Per favore rispondi con una frase per riga. Grazie.

Testo: {self.input_text}
"""

    def conversation(self, output_text):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
            {"role" : "assistant", "content" : output_text},
        ]

    def question(self):
        return[
            {"role" : "system",    "content" : "Sei un esperto di linguistica italiana specializzato nella segmentazione delle frasi."},
            {"role" : "user",      "content" : self.instruction()},
        ]

The `conversation` method will produce a full question / answer converation, that will be used for training.
the `question` method will produce just a question prompt, that will be use for inference.