OBJECTIVE:
    In this project I build a NER (Named Entitity Recognition) model on top of the BERT architecture.
    The aim is to adapt the BERT Large Language Model (LLM) for a specific task.
    Here, BertForTokenClassification is used. It is included in the Tranformers library.
    The BertForTokenClassification has a token classification head on top, which allows to make predictions at a token level.
    Since NER is typically a token classification problem, this architecture meets the utility requirements.

METHOD:
    The Transfer Learning method is used: 
        first pretraining a large neural network in an unsupervised way, then fine-tune that NN on the task of interest.
    In this case, BERT is a NN pretrained on two tasks: masked language modeling and next sentence prediction.
    So I am going to fine-tune this network on a NER dataset.
    Since fine-tuning implies supervised learning, it is necessary to have a labeled dataset.

NOTE:
    Deep Learning can be accelerated a lot using a GPU instead of a CPU.
    Despite that, in this project CPU is used and the following code is adapted to this setting.
    If you want to run this notebook in a GPU runtime, make sure to adapt this scripts to your purpose.

First we need to install the necessary libraries and import them as follows:

In [None]:
'''
    Conda Environment:
        conda install conda-forge::seqeval
        conda install anaconda::scikit-learn
        pip install torch
        conda install conda-forge::transformers
'''

In [1]:
# import transformers,pandas,numpy,sklearn,seqeval,torch
# from seqeval.metrics import classification_report
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertForTokenClassification # BertConfig, AutoTokenizer, AutoModelForMaskedLM
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score
import numpy

  from .autonotebook import tqdm as notebook_tqdm


DOWNLOADING AND PROCESSING THE DATA
In European Languages, Named Entity Recognition uses a specific annotation scheme which is defined at "word level". There are several word-level schemes, one widely used is the IOB-Tagging.
IOB stands for "Inside-Outside-Beginning", meaning that each tag can be either an inside, outside or beginning tag for a word. This is used because named entities may comprise more than on word.

To train a deep learning model for NER, we need a training dataset in IOB format (or other word-level foramts).
While there exist several annotation tools to create such a dataset (e.g. Doccano), in this project I use the NER dataset form Kaggle. It already comes in IOB format.
Go to the Kaggle webpage to download the dataset, unzip it, and upload it.

In [2]:
data = pd.read_csv("ner_datasetreference.csv", encoding='unicode_escape')
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [3]:
data.count()

Sentence #      47959
Word          1048565
POS           1048575
Tag           1048575
dtype: int64

There are 8 category tags, each with a beginning and inside variant, and the outside tag.
For sick of simplicity, let's remove "art", "eve" and "nat" named entities from this project.

In [4]:
entities_to_remove = ["B-art", "I-art", "B-eve", "I-eve", "B-nat", "I-nat"]
data = data[~data.Tag.isin(entities_to_remove)]
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Now let's create two dictionaries:
    1 - The first maps individualtags to indices ; i.e. each tag to its specific indices (which is just a number)
    2 - The second maps indices back to their individual tags; i.e. each indices (that is each number) to its specific tag.

This is necessary to create the labels, since the computers work with numbers (indices).
Fundamentally we will use these dictionaries when creating the training and the test sets.

We can see that we have 10 different NER tags.

In [5]:
labels_to_ids = {k: v for v, k in enumerate(data.Tag.unique())}
ids_to_labels = {v: k for v, k in enumerate(data.Tag.unique())}
labels_to_ids

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'I-per': 8,
 'I-gpe': 9,
 'I-tim': 10}

Now that we have a training set, let's focus on what is a training example for NER.
During the training, the training example is what it is provided in a sigle forward pass. Typically it is a sentence with the corresponding IOB tags, as shown below:

In [6]:
data = data.fillna(method='ffill')
# create a new column "sentence" which groups the words by sentence 
data['sentence'] = data[['Sentence #','Word','Tag']].groupby(['Sentence #'])['Word'].transform(lambda x: ' '.join(x))
# create a new column "word_labels" which groups the tags by sentence 
data['word_labels'] = data[['Sentence #','Word','Tag']].groupby(['Sentence #'])['Tag'].transform(lambda x: ','.join(x))
data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)
data.head()

  data = data.fillna(method='ffill')


Unnamed: 0,sentence,word_labels
0,Thousands of demonstrators have marched throug...,"O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-..."
1,Families of soldiers killed in the conflict jo...,"O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,B-per,O,O,..."
2,They marched from the Houses of Parliament to ...,"O,O,O,O,O,O,O,O,O,O,O,B-geo,I-geo,O"
3,"Police put the number of marchers at 10,000 wh...","O,O,O,O,O,O,O,O,O,O,O,O,O,O,O"
4,The protest comes on the eve of the annual con...,"O,O,O,O,O,O,O,O,O,O,O,B-geo,O,O,B-org,I-org,O,..."


NEXT, LET'S PREPARE THE DATASET AND THE DATALOADER.

Now that data is preprocessed, we can turn it into PyTorch tensors such that we can provide it to the model.
PyTorch tensors is the data format needed by the BERT model.

Below, some variables are set which will be used during the training/evaluation process.
NOTE: 
    Due to the computation time and resources required, here I fix some values as training/evaluation parameters. 
    Using a Cross Validation or Grid Search method, or perhaps other values, may lead to a better model.

In [7]:
MAX_LEN = 128 # Max number of tokens per sentece (BERT models have a limit of 512)
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1
LEARNING_RATE = 1e-05
MAX_GRAD_NORM = 10 # For Gradient Clipping in model training, prevent exploding gradients leading to a better convergence towards a good solution

Now I import th tokenizer. 
The standard BERT tokenizer, and in general the tokenizers of the BERT models, use a "word-piece tokenization" mechanism.
It means that it can split a single word in two or more pieces.

Other tokenizers are based on "word tokenization", which allows to better exploit the IOB format.
An example of such tokenizers is the one provided by Spacy (1 word = 1 token).
Also, some models built on top of BERT comes with their specific tokenizers based for example on SpaCy.
This is the case of italian-legal-BERT tokenizer, built starting from SpaCy and integrated with rules and abbreviations typical of the italian legal syntax.

In [8]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') 

Moreover, BERT also want the data to respect some requirements:
- special tokens at the beginning and end of each training example are required.
- each training example must have the same length (related to the max length of the model, see the parameters set above)
- it requires an attention mask
- labels need to be created according to the dictionary defined above (the two maps)
- word pieces that should be ignored have a label of -100 (that is the default ignore_index of PyTorch's CrossEntropyLoss)

Below, it is defined a regular PyTorch dataset class, where also these requirements are taken care.
The PyTorch dataset class transforms the training examples of a dataframe to PyTorch tensors (the format required for the trainign of BERT models)

In [17]:
class dataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

  def __getitem__(self, index):
        # step 1: get the sentence and word labels 
        sentence = self.data.sentence[index] #.strip().split()  
        word_labels = self.data.word_labels[index].split(",") 

        # step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
        # BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
        encoding = self.tokenizer(sentence,
                             # is_pretokenized=True,                              
                             return_offsets_mapping=True, 
                             padding='max_length', 
                             truncation=True, 
                             max_length=self.max_len)
        
        # step 3: create token labels only for first word pieces of each tokenized word
        labels = [labels_to_ids[label] for label in word_labels] 
        # create an empty array of -100 of length max_length
        encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100
        
        # set only labels whose first offset position is 0 and the second is not 0
        
        i = 0
        for idx, mapping in enumerate(encoding["offset_mapping"]):
          if mapping[0] == 0 and mapping[1] != 0:
            # overwrite label
            encoded_labels[idx] = labels[i]
            i += 1
      
        # step 4: turn everything into PyTorch tensors
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        item['labels'] = torch.as_tensor(encoded_labels) 
        
        return item

  def __len__(self):
        return self.len

Next, I use this PyTorch class to create the training and test sets in the PyTorch tensors format. I use a 80/20 split.

In [18]:
train_size = 0.8
train_dataset = data.sample(frac=train_size,random_state=200)
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = dataset(train_dataset, tokenizer, MAX_LEN)
testing_set = dataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (47571, 2)
TRAIN Dataset: (38057, 2)
TEST Dataset: (9514, 2)


In [19]:
# I check the structure of every training sample
training_set[0] 

{'input_ids': tensor([  101, 23564, 21030,  2099,  4967,  2001,  9388,  1011,  6109,  2005,
          2634,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

In [20]:
# Here I just verify that the input ids and the corresponding targets are correct
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
  print('{0:10}  {1}'.format(token, label))

[CLS]       -100
za          3
##hee       -100
##r         -100
khan        -100
was         -100
mar         -100
-           -100
93          -100
for         -100
india       -100
.           -100
[SEP]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100


Next, let's define the PyTorch dataloaders.
The DataLoader wraps an iterable around a dataset, making it easy to access data samples.
It handles batching, shuffling, and parallel loading of data.
DataLoader is particularly useful when dealing with large datasets.

In [21]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

DEFINE THE MODEL
As explained above, the chosen model for this task is BertForTokenClassification.
Here I load it with the pretrained weights of "bert-base-uncased". 
I only need to additionally specify the number of labels, as this will determine the architecture of the classification head.

NOTE:
    Only the base layers are initialized with the pretrained weights. The token calssification head of top has just randomly initialized weights, which we need to train, together with the pretrained weights, using the training dataset.

Also, if we change the training set, the number of labels (parameter "num_labels") that I set up here needs to be adapted to the entities of interest.
That is, if I have a new training set with for example 9 entities corresponding to 9 (or more) tags, I need to specify this number of tags in the model architecture.

In [22]:
model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=len(labels_to_ids))

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The BertForTokenClassification uses PyTorch's CrossEntropyLoss.

In [25]:
inputs = training_set[2]
input_ids = inputs["input_ids"].unsqueeze(0)
input_ids = input_ids.type(torch.LongTensor)
attention_mask = inputs["attention_mask"].unsqueeze(0)
attention_mask = attention_mask.type(torch.LongTensor)
labels = inputs["labels"].unsqueeze(0)
labels = labels.type(torch.LongTensor)

In [27]:
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
outputs

TokenClassifierOutput(loss=tensor(3.0202, grad_fn=<NllLossBackward0>), logits=tensor([[[ 0.2794,  0.6398,  0.1297,  ..., -0.2682,  0.0414,  0.1409],
         [ 0.0415,  0.7419,  0.5223,  ..., -0.3782,  0.4457, -0.3592],
         [ 0.4750,  0.0401, -0.3383,  ...,  0.4292,  0.1702, -0.0272],
         ...,
         [ 0.1504,  0.1827, -0.7203,  ...,  0.7097,  0.2024, -0.3951],
         [ 0.1758,  0.1671, -0.3961,  ...,  0.3345,  0.0866, -0.0726],
         [ 0.1751,  0.0879, -0.4160,  ...,  0.3338,  0.0757, -0.1072]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

In [28]:
initial_loss = outputs[0]
initial_loss

tensor(3.0202, grad_fn=<NllLossBackward0>)

In [29]:
tr_logits = outputs[1]
tr_logits.shape

torch.Size([1, 128, 11])

Now, let's define an Optimizer.
Here, I use the Adam optimizer with a default learning rate (meaning that the learning rate is defined a priori)
Other optimizer are also possible and included in the Transformers repository. While the learning rate could be defined with a scheduler, for sake of computational speed I set a default LR.

In [30]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

Now that I have the model and the optimizer, I can define a regular PyTorch training function.

In [75]:
# Defining the training function on the 80% of the dataset for tuning the bert model
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].type(torch.LongTensor) 
        mask = batch['attention_mask'].type(torch.LongTensor) 
        labels = batch['labels'].type(torch.LongTensor) 

        output = model(input_ids=ids, attention_mask=mask, labels=labels)
        loss = output.loss.item()
        tr_logits = output.logits
        tr_loss += loss

        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)
        
        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = labels.view(-1) 
        active_logits = tr_logits.view(-1, model.num_labels) 
        flattened_predictions = torch.argmax(active_logits, axis=1) 
        
        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100 
        
        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_labels.extend(labels)
        tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        output.loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

Let's use the training function defined above to train the model.
NOTE: this might take a while.

In [76]:
# I train the model
for epoch in range(EPOCHS):
    print(f"Training epoch: {epoch + 1}")
    train(epoch)

Training epoch: 1
Training loss per 100 training steps: 2.376901388168335
Training loss per 100 training steps: 1.1567911207675934
Training loss per 100 training steps: 0.8512145508988876
Training loss per 100 training steps: 0.7229515059891333
Training loss per 100 training steps: 0.6294417871078686
Training loss per 100 training steps: 0.5654549856954617
Training loss per 100 training steps: 0.5155608908113669
Training loss per 100 training steps: 0.47756104606712574
Training loss per 100 training steps: 0.44931228609799817
Training loss per 100 training steps: 0.424908483292924
Training loss per 100 training steps: 0.40602673034785647
Training loss per 100 training steps: 0.3858553776762818
Training loss per 100 training steps: 0.3775639353745728
Training loss per 100 training steps: 0.36758437941511585
Training loss per 100 training steps: 0.3567055856187806
Training loss per 100 training steps: 0.3457592880352711
Training loss per 100 training steps: 0.33635113367678215
Training l

EVALUATING THE MODEL
Now that the model is trained, I can evaluate its performance on the held-out test set.

In [77]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):

            ids = batch['input_ids'].type(torch.LongTensor) 
            mask = batch['attention_mask'].type(torch.LongTensor) 
            labels = batch['labels'].type(torch.LongTensor) 

            output = model(input_ids=ids, attention_mask=mask, labels=labels)
            eval_loss = output.loss.item()
            eval_logits = output.logits

            nb_eval_steps += 1
            nb_eval_examples += labels.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = labels.view(-1) 
            active_logits = eval_logits.view(-1, model.num_labels) 
            flattened_predictions = torch.argmax(active_logits, axis=1) 
            
            # only compute accuracy at active labels
            active_accuracy = labels.view(-1) != -100 
        
            labels = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(labels)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy

    labels = [ids_to_labels[id.item()] for id in eval_labels]
    predictions = [ids_to_labels[id.item()] for id in eval_preds]
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [78]:
labels, predictions = valid(model, testing_loader)

Validation loss per 100 evaluation steps: 0.0063634030520915985
Validation loss per 100 evaluation steps: 0.0004289521291704461
Validation loss per 100 evaluation steps: 1.258382443987315e-05
Validation loss per 100 evaluation steps: 4.379446324319934e-05
Validation loss per 100 evaluation steps: 0.00019018653027731878
Validation loss per 100 evaluation steps: 6.398486769098246e-06
Validation loss per 100 evaluation steps: 4.183894609056178e-06
Validation loss per 100 evaluation steps: 1.0171861981656854e-05
Validation loss per 100 evaluation steps: 8.204177524266617e-05
Validation loss per 100 evaluation steps: 0.0001842252719944775
Validation loss per 100 evaluation steps: 2.630802753430742e-06
Validation loss per 100 evaluation steps: 0.00016747631613067018
Validation loss per 100 evaluation steps: 4.922374416077564e-06
Validation loss per 100 evaluation steps: 3.5880071946421923e-06
Validation loss per 100 evaluation steps: 0.00027557577855411724
Validation loss per 100 evaluation 

The overall performance is good. Accuracy is 94.84%
However this statistic could be misleading as a lot of labels are "outside" (O). That is, the prediction is biased towards the most popular class in the training set.
We can rely on other metrics, such as precision, recall and f1-score of the individual tags.

In [111]:
recall_val = recall_score(y_true = labels, y_pred =predictions, average= 'micro')
print(f'Recall: {recall_val}')

precision_val = precision_score(y_true = labels, y_pred =predictions, average= 'micro')
print(f'Precision: {precision_val}')

f1_val = f1_score(y_true = labels, y_pred =predictions, average= 'micro')
print(f'f1 score: {f1_val}')

print('----------------------------')
lab = numpy.unique(labels)
print(confusion_matrix(y_true = labels, y_pred =predictions))


Recall: 0.9484969518604163
Precision: 0.9484969518604163
f1 score: 0.9484969518604163
----------------------------
[[ 564    6   43   21    0    8]
 [  52  520   14    0    0    5]
 [ 107    4  331   37    0   51]
 [   9    2   32  726    0   22]
 [   1    0    0    0   90   14]
 [   5    4   44    7    2 6793]]


Once again, these metrics confirm that the model performs quite good.
Obviously better performances could be achieved evaluating the training step by step over a validation set.

TEST
Now that we have a final model, trained and evaluated with overall good performances, it is possible to use it directly on unseen sentences.

In [137]:
# Some inference

sentence = '''  La C. cost. ha dichiarato quanto segue.
            Con la decisione in epigrafe, la Corte d'Appello di Napoli ha confermato la 
            sentenza del Tribunale di Torino del 30.10.2017 con cui G.A. è stata condannata alla 
            pena di 15 giorni di reclusione in relazione al reato di cui all'art. 615-ter c.p., per aver 
            modificato ed utilizzato la password di accesso al cassetto fiscale della sorella L., 
            aperto presso l'Agenzia delle Entrate, al fine di continuare a gestire il patrimonio 
            familiare pur dopo la cessazione della delega ad agire per conto di costei e i dissidi 
            insorti tra loro (in particolare, per registrare le locazioni relative agli immobili di 
            famiglia).'''

inputs = tokenizer(sentence,
                    # is_pretokenized=True, 
                    return_offsets_mapping=True, 
                    padding='max_length', 
                    truncation=True, 
                    max_length=MAX_LEN,
                    return_tensors="pt")

ids = inputs["input_ids"]
mask = inputs["attention_mask"]

outputs = model(ids, attention_mask=mask)
logits = outputs[0]

active_logits = logits.view(-1, model.num_labels) 
flattened_predictions = torch.argmax(active_logits, axis=1) 

tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
token_predictions = [ids_to_labels[i] for i in flattened_predictions.cpu().numpy()]
wp_preds = list(zip(tokens, token_predictions)) # list of tuples. Each tuple = (wordpiece, prediction)

label_dict = {}

# Iterate over the list of tuples
for word, label in wp_preds:
    # If the label is already a key in the dictionary, append the word to its list
    if label in label_dict:
        label_dict[label].append(word)
    # If the label is not a key in the dictionary, create a new list with the word
    else:
        label_dict[label] = [word]

print(label_dict)

{'O': ['[CLS]', '.', '.', 'ha', 'di', '##chia', '##rat', '##o', 'quan', '##to', 'se', '##gue', '.', 'con', 'la', 'decision', '##e', 'in', 'ep', '##ig', '##raf', '##e', ',', 'la', '##rte', 'd', "'", 'ha', 'con', '##fer', '##mat', '##o', 'la', 'sent', '##en', '##za', 'del', 'tribunal', '##e', 'del', '30', '.', '10', '.', '2017', 'con', 'cu', '##i', '.', '.', 'e', 'stat', '##a', 'con', '##dan', '##nat', '##a', 'alla', 'pena', 'di', '15', '##orn', '##i', 'di', 'rec', '##lusion', '##e', 'in', 're', '##la', '##zione', 'al', 're', '##ato', 'di', '##i', 'all', "'", 'art', '.', '61', '##5', '-', 'ter', 'c', '.', '.', ',', 'per', 'ave', '##r', 'mod', '##ific', '##ato', 'ed', 'ut', '##ili', '##to', 'la', 'password', 'di', 'access', '##o', 'al', '##etto', 'fiscal', '##e', '[SEP]'], 'B-per': ['la', 'c', 'cost', 'app', '##ello', 'napoli', 'g', 'a', 'gi', 'cu', 'p', '##zza', 'cass', 'della', 'sore', '##lla'], 'B-org': ['co', 'di', 'di', 'torino']}


In [151]:
# For each label, I get the list of words or word-pieces associated to it
for label, elements in label_dict.items():
    printed_labels = []
    if label not in printed_labels:
        elenco = label_dict[f'{label}']
        print(f"Label {label} : {elenco}") 
        printed_labels.append(label)

Label O : ['[CLS]', '.', '.', 'ha', 'di', '##chia', '##rat', '##o', 'quan', '##to', 'se', '##gue', '.', 'con', 'la', 'decision', '##e', 'in', 'ep', '##ig', '##raf', '##e', ',', 'la', '##rte', 'd', "'", 'ha', 'con', '##fer', '##mat', '##o', 'la', 'sent', '##en', '##za', 'del', 'tribunal', '##e', 'del', '30', '.', '10', '.', '2017', 'con', 'cu', '##i', '.', '.', 'e', 'stat', '##a', 'con', '##dan', '##nat', '##a', 'alla', 'pena', 'di', '15', '##orn', '##i', 'di', 'rec', '##lusion', '##e', 'in', 're', '##la', '##zione', 'al', 're', '##ato', 'di', '##i', 'all', "'", 'art', '.', '61', '##5', '-', 'ter', 'c', '.', '.', ',', 'per', 'ave', '##r', 'mod', '##ific', '##ato', 'ed', 'ut', '##ili', '##to', 'la', 'password', 'di', 'access', '##o', 'al', '##etto', 'fiscal', '##e', '[SEP]']
Label B-per : ['la', 'c', 'cost', 'app', '##ello', 'napoli', 'g', 'a', 'gi', 'cu', 'p', '##zza', 'cass', 'della', 'sore', '##lla']
Label B-org : ['co', 'di', 'di', 'torino']
