# Sentence Splitter: Out of Domain Test Set

In this notebook we're going to produce labels for the out of domain test set.
We'll also compute the F1 metrics using the computed labels and the golden lables provided by the test dataset.

> **_NOTE:_**  The F1 computed from the test set won't be used in any kind of tuning of model selections.

Import all required libraries:

In [41]:
from transformers import pipeline
import evaluate
import torch
import numpy as np
import pandas as pd
import random
from datasets import Dataset, DatasetDict, load_dataset
import os

Before proceeding, make the run as deterministic as possible:

In [15]:
def set_seed(seed=777, total_determinism=False):
    seed = seed
    torch.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    if total_determinism:
        torch.use_deterministic_algorithms(True)
    random.seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
set_seed() # Set the seed for reproducibility -- use_deterministic_algorithms can make training slower :(

In [18]:
SIZE = 192 # Number of words to put on each input of the encoder model

def group_into_sequences(df, seq_len=SIZE):
    tokens = df['token'].tolist()
    labels = df['label'].tolist()
    
    # Group into sequences of seq_len
    token_seqs = [tokens[i:i+seq_len] for i in range(0, len(tokens), seq_len) if len(tokens[i:i+seq_len]) == seq_len]
    label_seqs = [labels[i:i+seq_len] for i in range(0, len(labels), seq_len) if len(labels[i:i+seq_len]) == seq_len]
    
    return {'tokens': token_seqs, 'labels': label_seqs}

Load the test set and group words into **sequencences**.
Remeber sequences (the inputs to the encorder) are not sentences (the output of our analysis).

In [27]:
test = pd.read_csv("../data/OOD_test.csv", sep=';')  # token,label
test

Unnamed: 0,token,label
0,C',0
1,era,0
2,una,0
3,volta,0
4,…,0
...,...,...
1517,tornò,0
1518,zoppicando,0
1519,a,0
1520,casa,0


In [31]:
test_grouped = group_into_sequences(test)
test_grouped

{'tokens': [["C'",
   'era',
   'una',
   'volta',
   '…',
   '–',
   'Un',
   're',
   '!',
   '–',
   'diranno',
   'subito',
   'i',
   'miei',
   'piccoli',
   'lettori',
   '.',
   '–',
   'No',
   ',',
   'ragazzi',
   ',',
   'avete',
   'sbagliato',
   '.',
   "C'",
   'era',
   'una',
   'volta',
   'un',
   'pezzo',
   'di',
   'legno',
   '.',
   'Non',
   'era',
   'un',
   'legno',
   'di',
   'lusso',
   ',',
   'ma',
   'un',
   'semplice',
   'pezzo',
   'da',
   'catasta',
   ',',
   'di',
   'quelli',
   'che',
   "d'",
   'inverno',
   'si',
   'mettono',
   'nelle',
   'stufe',
   'e',
   'nei',
   'caminetti',
   'per',
   'accendere',
   'il',
   'fuoco',
   'e',
   'per',
   'riscaldare',
   'le',
   'stanze',
   '.',
   'Non',
   'so',
   'come',
   'andasse',
   ',',
   'ma',
   'il',
   'fatto',
   'gli',
   'è',
   'che',
   'un',
   'bel',
   'giorno',
   'questo',
   'pezzo',
   'di',
   'legno',
   'capitò',
   'nella',
   'bottega',
   'di',
   'un',
   '

In [None]:
test_dataset = Dataset.from_dict(test_grouped)

dataset_dict = DatasetDict({
    'test': test_dataset,
})

# Optionally push the dataset to the hub.
# dataset_dict.push_to_hub(f"fax4ever/manzoni-{SIZE}-test", token=os.getenv("HF_TOKEN"))

In [42]:
# Alternatively, load the dataset from the hub.

dataset_dict = load_dataset(f"fax4ever/manzoni-{SIZE}-test")

Generating test split: 100%|██████████| 7/7 [00:00<00:00, 2825.26 examples/s]


In [None]:
# We choose this model before to evaluate the F1 on the test set.
# We choose it because it has the best F1 score on other datasets.

model_name = "electra-base-italian-xxl-cased-discriminator-sentence-splitter"
model_checkpoint = "fax4ever/" + model_name
inference_pipeline = pipeline("token-classification", model=model_checkpoint, aggregation_strategy="simple")

Device set to use cpu


In [13]:
metric = evaluate.load("f1", average="binary")