# Sentence Splitter using an Embedding Model

Install the required libraries on the virtual environment:

In [None]:
!pip install --upgrade pip
!pip install torch numpy pandas datasets jupyter

Let's import everything we need:

In [None]:
import torch
from datasets import Dataset, DatasetDict
import numpy as np
import random
import pandas as pd
import os

First of all, let's verify we support accellerator:

In [None]:
torch.cuda.is_available()

Before doing everything else try to make this run as much as deternistic as possible:

In [None]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

## PART ONE: Create the dataset

In this section we're going to create a standard Hugging Face dataset from the `cvs` files: `data/manzoni_dev_tokens.csv` and `data/manzoni_train_tokens.csv`.

The output will be the available at [fax4ever/manzoni-192](https://huggingface.co/datasets/fax4ever/manzoni-192).

## Our first hyperparameter

Basically, the original csv files report a text that is supposed to splitted in portions that can be passed as input to the encoder model. Typically the max number of tokens that can be passed to the encoder model is 512 (for instance this is true for BERT).
Now we should think about the fact that for each word of the text in general the tokenizer of the model will produce one or more tokens.
A strategy could be split the texts in order to use the maximum number of token possible, this is proven to be not optimal sometimes.
So the number of words we want to put on each input will be our first hyberparameter.

In [None]:
SIZE = 192 # Number of words to put on each input of the encoder model

def group_into_sequences(df, seq_len=SIZE):
    tokens = df['token'].tolist()
    labels = df['label'].tolist()
    
    # Group into sequences of seq_len
    token_seqs = [tokens[i:i+seq_len] for i in range(0, len(tokens), seq_len) if len(tokens[i:i+seq_len]) == seq_len]
    label_seqs = [labels[i:i+seq_len] for i in range(0, len(labels), seq_len) if len(labels[i:i+seq_len]) == seq_len]
    
    return {'tokens': token_seqs, 'labels': label_seqs}


train = pd.read_csv("data/manzoni_train_tokens.csv")  # token,label
validation = pd.read_csv("data/manzoni_dev_tokens.csv")  # token,label

# Group into sequences of SIZE
train_grouped = group_into_sequences(train)
validation_grouped = group_into_sequences(validation)

print(f"Train: {len(train_grouped['tokens'])} sequences of {SIZE} tokens each")
print(f"Validation: {len(validation_grouped['tokens'])} sequences of {SIZE} tokens each")

train_dataset = Dataset.from_dict(train_grouped)
validation_dataset = Dataset.from_dict(validation_grouped)

dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset  # Using 'validation' as the standard name
})
dataset_dict.push_to_hub(f"fax4ever/manzoni-{SIZE}", token=os.getenv("HF_TOKEN"))

The result is published as a Hugging Face dataset, so standard Hugging Face API could be applied on it.
That is the benefit of follow an open standard!

## PART TWO: Tokenize the dataset

In the tokenization process each word of each input will become one or more tokens.
First of all, we need to define some contanstants.
In the contants we refect the convetion we implicitly found in the csv files. The 1 denotes the end and the begin of a new sentence, while the 0 will be used to denote all the other tokens. Special tokens denoting start and end of the input-encoding sequences will be labeled with 0. 

In [None]:
END_OF_SENTENCE = 1
NOT_END_OF_SENTENCE = 0
LABEL_FOR_START_END_OF_SEQUENCE = NOT_END_OF_SENTENCE

The original dataset provides labels for each `word`, in order to have label for each `token` we introduced an utility function:  