## BERT fine tune NER

**Bert** is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

**conll2003:**
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase.

- Part of Speech (POS) Tagging:

POS tagging involves labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, adverb, etc.
The primary goal of POS tagging is to assign grammatical categories to words in a sentence, which helps in understanding the syntactic structure of the sentence and its meaning.
POS tagging provides information about the grammatical roles of individual words in a sentence.

- Chunking:

Chunking, also known as shallow parsing, involves identifying and extracting meaningful phrases or "chunks" from sentences based on their grammatical structure.
Unlike POS tagging, which assigns part of speech tags to individual words, chunking groups words together into syntactically meaningful units, such as noun phrases, verb phrases, prepositional phrases, etc

• 0 means the word doesn't correspond to any entity. 
• B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity. 
• B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity. 
• B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity. 
• B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity. 

In [61]:
import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoModel, AutoTokenizer, DataCollator
from torch.utils.data import DataLoader
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
dataset = load_dataset("conll2003")

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [4]:
dataset.shape

{'train': (14041, 5), 'validation': (3250, 5), 'test': (3453, 5)}

In [5]:
dataset['train'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

In [6]:
idx = 0

print(dataset['train'][idx])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


In [7]:
dataset['train'].features['ner_tags']

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [6]:
dataset['train'].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

In [7]:
ner_tags = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}

In [8]:
checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [9]:
sample = dataset['train'][0]
print(sample)

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


**tokenizer(input):** When you use tokenizer(input), you're essentially tokenizing the input text. This means breaking down the input text into individual tokens, which could be words, subwords, or characters, depending on the tokenizer used. This operation returns a list of tokens.

**tokenizer.encode(input):** goes one step further than just tokenizing. It not only tokenizes the input but also converts those tokens into corresponding numerical IDs, often referred to as token IDs or input IDs. These numerical IDs are what the model actually operates on. This operation returns a list of token IDs.

In [10]:
"""
Returns only input_ids but tokenizer(input) returns: input_ids, token_type_ids, attention_mask
"""
print(tokenizer.encode(' '.join(sample['tokens']), add_special_tokens=True))

[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102]


In [11]:
tokenized_sample = tokenizer(sample['tokens'], is_split_into_words=True)

In [12]:
tokenized_sample

{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The output of the code is a dictionary with three keys:

``input_ids``: This is a list of integers that represent the numerical representation of the input text. Each integer corresponds to a token in the vocabulary of the pre-trained model.
``token_type_ids``: This is a list of integers that indicate the type of each token in the input sequence. For example, in a sequence classification task, the first token of the input sequence could be marked as type 0, and the second token as type 1.
``attention_mask``: This is a list of 1's and 0's that indicate which tokens should be attended to by the pre-trained model and which should be ignored. A 1 indicates that the token should be attended to, while a 0 indicates that the token should be ignored.

In [16]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_sample['input_ids'])
print(tokens)

['[CLS]', 'eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.', '[SEP]']


In [17]:
word_ids = tokenized_sample.word_ids() # return list of mapping the tokens, to their actual word in the initial sentence
word_ids

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]

In [18]:
len(sample['ner_tags']), len(tokenized_sample['input_ids'])

(9, 11)

Size between the ner_tags in the sample text and the tokenized input_ids from the sample (One is larger)
-> **PROBLEM OF SUB-TOKEN**
WHY? because transformers are often trainer on sub_words tokenizers, so even if we give splitted inputs, it can be splitted again by the tokenizer. And also because some special tokens maybe added like ('[CLS]', '[SEP]')
-> **Solution**
Get based on `.word_ids()` method because it sets the special tokens to `None`. We will set the labels for all special tokens to $-100$ because it gets ignored by pytorch during training. And all other tokens to the word they come from.  

In [89]:
def tokenize_and_align_labels(example, label_all_tokens = True):
    """
    This function will do two things:
        1. Set -100 as the label for special tokens
        2. Mask the sub-words representation after the first sub-word
    Parameters:
        example (str): the dataset
        label_all_tokens (bool): define if we will apply tokenization
    return: 
    """
    
    tokenized_example = tokenizer(example['tokens'], is_split_into_words=True, truncation=True, 
        padding="max_length",  # Add padding
    )
    labels = []
    for i, label in enumerate(example['ner_tags']):
        word_ids = tokenized_example.word_ids(batch_index=i)
        prev_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != prev_word_idx:
                # common scenario
                label_ids.append(label[word_idx])
            else:
                # take care of sub-words
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            prev_word_idx = word_idx    
        labels.append(label_ids)
    tokenized_example['labels'] = labels
    return tokenized_example
    

In [90]:
idx = 0
benchmarking = tokenize_and_align_labels(dataset['train'][idx:idx+2])
print(benchmarking)

{'input_ids': [[101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

`labels` is added, represente the aligned values
Objectif was to have a new labels that matches the tokenized input_ids size from ner_tags.

In [91]:
for token, label in zip(benchmarking['input_ids'][0], benchmarking['labels'][0]):
    print(f'{token:_<30} {label}')

101___________________________ -100
7327__________________________ 3
19164_________________________ 0
2446__________________________ 7
2655__________________________ 0
2000__________________________ 0
17757_________________________ 0
2329__________________________ 7
12559_________________________ 0
1012__________________________ 0
102___________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0_____________________________ -100
0__________________

In [None]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

In [93]:
tokenized_dataset['train']

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 14041
})

In [94]:
class ner_dataset(torch.utils.data.Dataset):
    def __init__(self, data):
        super().__init__()
        self.data = data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        sample = self.data[idx]
        return {
            'input_ids': torch.LongTensor(sample['input_ids']),
            'token_type_ids': torch.LongTensor(sample['token_type_ids']),
            'attention_mask': torch.LongTensor(sample['attention_mask']),
            'labels': torch.LongTensor(sample['labels'])
        }    

In [95]:
BATCH = 8

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

train_data = ner_dataset(tokenized_dataset['train'])
validation_data = ner_dataset(tokenized_dataset['validation'])
test_data = ner_dataset(tokenized_dataset['test'])

train_loader = DataLoader(train_data, batch_size=BATCH, shuffle=True)
validation_loader = DataLoader(validation_data, batch_size=BATCH, shuffle=False)
test_loader = DataLoader(test_data, batch_size=BATCH, shuffle=False)

In [96]:
for batch in train_loader:
    print(batch)
    break

{'input_ids': tensor([[  101, 12809,  5158,  ...,     0,     0,     0],
        [  101,  2009,  2097,  ...,     0,     0,     0],
        [  101, 13893,  2920,  ...,     0,     0,     0],
        ...,
        [  101, 28669,  2310,  ...,     0,     0,     0],
        [  101,  2008,  2933,  ...,     0,     0,     0],
        [  101,  2035,  2008,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[-100,    0,    0,  ..., -100, -100, -100],
        [-100,    0,    0,  ..., -100, -100, -100],
        [-100,    0,    0,  ..., -1

In [37]:
input_ids = torch.tensor(benchmarking['input_ids'])
attention_mask = torch.tensor(benchmarking['attention_mask'])
token_type_ids = torch.tensor(benchmarking['token_type_ids'])

# Load the model
model = AutoModel.from_pretrained(checkpoint)

# Forward pass
output = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
print(output[0])

tensor([[[-0.9127, -0.0694,  0.0756,  ..., -0.5373,  0.2770,  0.0182],
         [ 0.3057, -0.3440, -0.1638,  ..., -0.4052,  0.9113,  0.1845],
         [-0.6852, -0.1045,  0.0471,  ..., -0.6598, -0.4638,  0.1579],
         ...,
         [ 1.1856,  0.3611,  0.0776,  ..., -0.1451,  0.0478, -0.0247],
         [-0.6083, -0.7390, -0.3228,  ...,  0.0873, -0.0148, -0.1238],
         [ 0.6173,  0.1620, -0.4447,  ...,  0.0409, -0.4653, -0.1896]]],
       grad_fn=<NativeLayerNormBackward0>)


In [45]:
class BertForNER(torch.nn.Module):
    def __init__(self, checkpoint, n_classes):
        super().__init__()
        self.model = AutoModel.from_pretrained(checkpoint)
        self.classifier = torch.nn.Linear(self.model.config.hidden_size, n_classes)
    
    def forward(self, input_ids, attention_mask = None, token_type_ids = None):
        output = self.model(input_ids, attention_mask, token_type_ids)
        result = output[0]
        logits = self.classifier(result)
        return logits

In [46]:
model = BertForNER(checkpoint,n_classes = 9).to('cuda')

In [97]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
loss = torch.nn.CrossEntropyLoss()
EPOCHS = 3

In [None]:
for epoch in range(EPOCHS):
    model.train()
    for batch in train_loader:
        input_ids = batch['input_ids'].to('cuda')
        attention_mask = batch['attention_mask'].to('cuda')
        token_type_ids = batch['token_type_ids'].to('cuda')
        labels = batch['labels'].to('cuda')
        
        optimizer.zero_grad()
        output = model(input_ids, attention_mask, token_type_ids)
        output = output.view(-1, output.shape[-1])
        labels = labels.view(-1)
        loss_value = loss(output, labels)
        loss_value.backward()
        optimizer.step()
    print(f'Epoch: {epoch + 1}/{epoch}, Loss: {loss_value.item()}')

In [None]:
model.eval()

for batch in validation_loader:
    input_ids = batch['input_ids'].to('cuda')
    attention_mask = batch['attention_mask'].to('cuda')
    token_type_ids = batch['token_type_ids'].to('cuda')
    labels = batch['labels'].to('cuda')
    
    with torch.no_grad():
        output = model(input_ids, attention_mask, token_type_ids)
        output = output.view(-1, output.shape[-1])
        labels = labels.view(-1)
        loss_value = loss(output, labels)
    print(f'Validation Loss: {loss_value.item()}')