# Practice 4: Named Entity Recognition

## Introduction

### Formulation of the problem

In this assignment, you will solve the Named Entity Recognition (NER) problem, one of the most common in NLP, along with the text classification problem.

This task involves classifying each word/token whether it is part of a named entity (an entity may consist of multiple words/tokens) or not.

For example, we want to extract names and organization names. Then for the text

     Yan    Goodfellow  works  for  Google  Brain

The model should extract the following sequence:

     B-PER  I-PER       O      O    B-ORG   I-ORG

where the prefixes *B-* and *I-* denote the beginning and end of the named entity, *O* denotes a word without a tag. This prefix system (*BIO* tagging) was introduced to distinguish between successive named entities of the same type.
There are other types of tagging, such as [*BILUO*](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)), but for this tutorial we will focus on *BIO*.

We will solve the NER problem on the CoNLL-2003 dataset using recurrent networks and models based on the Transformer architecture.

### Libraries

Main libraries:
  - [PyTorch](https://pytorch.org/)
  - [Transformers](https://github.com/huggingface/transformers)

### Data

The data is stored in an archive, which consists of:

- *train.tsv* - training sample. Each line contains: <word / token>, <word / token tag>

- *valid.tsv* - validation sample, which can be used to select hyperparameters and quality measurements. It has an identical structure to train.tsv.

- *test.tsv* - test sample, which is used to evaluate the final quality. It has an identical structure to train.tsv.

You can download the data here: [link](https://drive.google.com/drive/folders/1OKNrfHsBm1ehbG-yM0R1BGshbscf_eue?usp=drive_link)

In [93]:
# !pip install numpy==1.21.6 scikit-learn==1.0.2 tensorboard==2.9.0 torch==1.12.1 tqdm==4.64.0 transformers==4.21.1

In [94]:
import random
from collections import Counter, defaultdict, namedtuple
from typing import Tuple, List, Dict, Any

import torch
import numpy as np

from tqdm import tqdm, trange

Let's fix the seed for reproducibility of the results (it is advisable to do this **always**!):

In [95]:
def set_global_seed(seed: int) -> None:
    """
    Set global seed for reproducibility.
    """

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


set_global_seed(42)

Let’s initialize the device (CPU / GPU) on which we will work (preferably **GPU**):

In [96]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

Initialize *tensorboard* to log metrics during the training process:

In [97]:
%load_ext tensorboard
%tensorboard --logdir ./logs/Practice_4_NER

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6007 (pid 1685183), started 1:43:15 ago. (Use '!kill 1685183' to kill it.)

## Part 1. Data preparation (4 points)

First of all, we need to read the data. Let's write a function that takes as input the path to one of the conll-2003 files and returns two lists:
- a list of lists of words/tokens (and corresponding to it)
- list of lists of tags

P.S. Let's make this function more flexible by supplying a boolean variable as input, whether we read data in *lowercase* or not.

**Exercise. Implement the `read_conll2003` function.** **<font color='red'>(1 point)</font>**

In [98]:
import csv

def read_conll2003(
    path: str,
    lower: bool = True,
) -> Tuple[List[List[str]], List[List[str]]]:
    """
    Prepare data in CoNNL like format.
    """
    
    token_seq = []
    label_seq = []

    with open(path, "r", newline='\n') as csvfile:
        reader = csv.reader(csvfile, delimiter=" ", quotechar="|")
        tok_seq = []
        lab_seq = []
        for row in reader:
            if (len(row) == 0) and lower:
                token_seq.append(tok_seq)
                label_seq.append(lab_seq)
                tok_seq = []
                lab_seq = []
                continue
            if len(row) != 2:
                print(len(row))
                print(row)
            tok_seq.append(row[0].lower() if lower else row[0])
            lab_seq.append(row[1])

    return token_seq, label_seq

Let's read all three files:

- *train.tsv*
- *valid.tsv*
- *test.tsv*

In [99]:
train_token_seq, train_label_seq = read_conll2003("data/CoNLL-2003/train.tsv")
valid_token_seq, valid_label_seq = read_conll2003("data/CoNLL-2003/valid.tsv")
test_token_seq, test_label_seq = read_conll2003("data/CoNLL-2003/test.tsv")

Look at what we got:

In [100]:
for token, label in zip(train_token_seq[0], train_label_seq[0]):
    print(f"{token}\t{label}")

eu	B-ORG
rejects	O
german	B-MISC
call	O
to	O
boycott	O
british	B-MISC
lamb	O
.	O


In [101]:
for token, label in zip(valid_token_seq[0], valid_label_seq[0]):
    print(f"{token}\t{label}")

cricket	O
-	O
leicestershire	B-ORG
take	O
over	O
at	O
top	O
after	O
innings	O
victory	O
.	O


In [102]:
for token, label in zip(test_token_seq[0], test_label_seq[0]):
    print(f"{token}\t{label}")

soccer	O
-	O
japan	B-LOC
get	O
lucky	O
win	O
,	O
china	B-PER
in	O
surprise	O
defeat	O
.	O


In [103]:
assert len(train_token_seq) == len(train_label_seq), "The lengths of the training token_seq and label_seq do not match, an error in the read_conll2003 function"
assert len(valid_token_seq) == len(valid_label_seq), "The lengths of the validation token_seq and label_seq do not match, an error in the read_conll2003 function"
assert len(test_token_seq) == len(test_label_seq), "The lengths of the test token_seq and label_seq do not match, an error in the read_conll2003 function"

assert train_token_seq[0] == ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'], "Error in training token_seq"
assert train_label_seq[0] == ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'], "Error in training label_seq"

assert valid_token_seq[0] == ['cricket', '-', 'leicestershire', 'take', 'over', 'at', 'top', 'after', 'innings', 'victory', '.'], "Error in validation token_seq"
assert valid_label_seq[0] == ['O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], "Error in validation label_seq"

assert test_token_seq[0] == ['soccer', '-', 'japan', 'get', 'lucky', 'win', ',', 'china', 'in', 'surprise', 'defeat', '.'], "Error in test token_seq"
assert test_label_seq[0] == ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O'], "Error in test label_seq"

print("All tests passed!")

All tests passed!


The CoNLL-2003 dataset is presented in the form of **BIO** tagging, where the label is:
- *B-{label}* - beginning of entity *{label}*
- *I-{label}* - continuation of the entity *{label}*
- *O* - no entity

There are also other sequence tagging methods, such as **BILUO**.

### Preparing dictionaries

To train the neural network, we will use two mappings:
- {**token**}→{**token_idx**}: correspondence between word/token and string in *embedding* matrix (starts from 0);
- {**label**}→{**label_idx**}: correspondence between tag and unique index (starts from 0);

Now we need to implement two functions:
- get_token2idx
- get_label2idx

which will return the corresponding dictionaries.

P.S. token2idx dictionary must also contain special tokens:
- `<PAD>` is a special token for padding, since we are going to train the models in batches
- `<UNK>` is a special token for processing words/tokens that are not in the dictionary (relevant for inference)

Let's assign them to idx 0 and 1 respectively for convenience.

P.P.S. You can also add a *min_count* parameter to get_token2idx, which will only include words exceeding a certain frequency.

First let's collect:
- token2cnt - a dictionary from a unique word / token to the number of these words / tokens in the training set (it is important that only in the training set!)
- label_set - a list of unique tags

P.S. You can also use stemming to convert different word forms of the same word into one token, but we will skip this point.

**Exercise. Implement the `get_token2idx` and `get_label2idx` functions.** **<font color='red'>(1 point)</font>**

In [104]:
token2cnt = Counter([token for sentence in train_token_seq for token in sentence])

In [105]:
token2cnt.most_common(10)

[('the', 8390),
 ('.', 7374),
 (',', 7290),
 ('of', 3815),
 ('in', 3621),
 ('to', 3424),
 ('a', 3199),
 ('and', 2872),
 ('(', 2861),
 (')', 2861)]

In [106]:
print(f"Number of unique words in the training dataset: {len(token2cnt)}")
print(f"Number of words occurring only once in the training dataset: {len([token for token, cnt in token2cnt.items() if cnt == 1])}")

Number of unique words in the training dataset: 21010
Number of words occurring only once in the training dataset: 10060


As we can see, we have many words that appear only once in the dataset. Obviously, we won’t be able to learn from them, we will only overfit, so let’s throw out such words when forming our vocabulary.

In [107]:
# use the min_count parameter to cut off words with frequency cnt < min_count

def get_token2idx(
    token2cnt: Dict[str, int],
    min_count: int,
) -> Dict[str, int]:
    """
    Get mapping from tokens to indices to use with Embedding layer.
    """
    token2idx: Dict[str, int] = {}
    token2idx["<PAD>"] = 0
    token2idx["<UNK>"] = 1
    idx = 2
    for token, cnt in token2cnt.items():
        if cnt < min_count:
            continue
        token2idx[token] = idx
        idx += 1

    return token2idx

In [108]:
token2idx = get_token2idx(token2cnt, min_count=2)

In [109]:
# Function for sorting tags so that first there is an O tag,
# then B- tags and only after I- tags (can be set manually)

def sort_labels_func(x: str) -> int:
    if x == "O":
        return 0
    elif x.startswith("B-"):
        return 1
    else:
        return 2

label_set = sorted(
    set(label for sentence in train_label_seq for label in sentence),
    key=lambda x: (sort_labels_func(x), x),
)

In [110]:
label_set

['O', 'B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER']

In [111]:
def get_label2idx(label_set: List[str]) -> Dict[str, int]:
    """
    Get mapping from labels to indices.
    """
    
    label2idx: Dict[str, int] = {}
    idx = 0
    for label in label_set:
        label2idx[label] = idx
        idx += 1
    
    return label2idx

In [112]:
label2idx = get_label2idx(label_set)

Let's look at what we got:

In [113]:
for token, idx in list(token2idx.items())[:10]:
    print(f"{token}\t{idx}")

<PAD>	0
<UNK>	1
eu	2
german	3
call	4
to	5
boycott	6
british	7
lamb	8
.	9


In [114]:
for label, idx in label2idx.items():
    print(f"{label}\t{idx}")

O	0
B-LOC	1
B-MISC	2
B-ORG	3
B-PER	4
I-LOC	5
I-MISC	6
I-ORG	7
I-PER	8


In [115]:
assert len(get_token2idx(token2cnt, min_count=1)) == 21012, "Error in dictionary length, most likely min_count is implemented incorrectly"
assert len(token2idx) == 10952, "Incorrect token2idx length, most likely min_count is implemented incorrectly"
assert len(label2idx) == 9, "Incorrect label2idx length"

assert list(token2idx.items())[:10] == [
    ('<PAD>', 0), ('<UNK>', 1), ('eu', 2), ('german', 3), ('call', 4),
    ('to', 5), ('boycott', 6), ('british', 7), ('lamb', 8), ('.', 9)
], "Wrong format of token2idx"
assert label2idx == {
    'O': 0, 'B-LOC': 1, 'B-MISC': 2, 'B-ORG': 3, 'B-PER': 4,
    'I-LOC': 5, 'I-MISC': 6, 'I-ORG': 7, 'I-PER': 8
}, "Wrong format of label2idx"

print("All tests passed!")

All tests passed!


### Preparing the dataset and loader

Typically, neural networks are trained in batches. This means that each update of the neural network's weights occurs based on multiple sequences. A technical detail is the need to complete all sequences within the batch to the same length.

From the previous practical task, you should know about `Dataset` (`torch.utils.data.Dataset`) - a data structure that stores and can index data for training. The dataset must inherit from the standard PyTorch Dataset class and override the `__len__` and `__getitem__` methods.

The `__getitem__` method must return the indexed sequence and its tags.

**Don't forget** about `<UNK>` special token for unknown words!

Let's write a custom dataset for our task, which will receive as input (the `__init__` method):
- token_seq - list of lists of words/tokens
- label_seq - list of lists of tags
- token2idx
- label2idx

and return from the `__getitem__` method two int64 tensors (`torch.LongTensor`) with the indices of words / tokens in the sample and the indices of the corresponding tags:

**Exercise. Implement the NERDataset class.** **<font color='red'>(1 point)</font>**

In [116]:
class NERDataset(torch.utils.data.Dataset):
    """
    PyTorch Dataset for NER.
    """

    def __init__(
        self,
        token_seq: List[List[str]],
        label_seq: List[List[str]],
        token2idx: Dict[str, int],
        label2idx: Dict[str, int],
    ):
        self.token2idx = token2idx
        self.label2idx = label2idx

        self.token_seq = [self.process_tokens(tokens, token2idx) for tokens in token_seq]
        self.label_seq = [self.process_labels(labels, label2idx) for labels in label_seq]

    def __len__(self):
        return len(self.token_seq)

    def __getitem__(
        self,
        idx: int,
    ) -> Tuple[torch.LongTensor, torch.LongTensor]:
        return torch.LongTensor(self.token_seq[idx]), torch.LongTensor(self.label_seq[idx])

    @staticmethod
    def process_tokens(
        tokens: List[str],
        token2idx: Dict[str, int],
        unk: str = "<UNK>",
    ) -> List[int]:
        """
        Transform list of tokens into list of tokens' indices.
        """
        processed_tokens = []
        for token in tokens:
            processed_tokens.append(token2idx[unk] if token not in token2idx else token2idx[token])
        return processed_tokens

    @staticmethod
    def process_labels(
        labels: List[str],
        label2idx: Dict[str, int],
    ) -> List[int]:
        """
        Transform list of labels into list of labels' indices.
        """
        processed_labels = []
        for label in labels:
            processed_labels.append(label2idx[label])
        return processed_labels

Create three datasets:
- *train_dataset*
- *valid_dataset*
- *test_dataset*

In [117]:
train_dataset = NERDataset(
    token_seq=train_token_seq,
    label_seq=train_label_seq,
    token2idx=token2idx,
    label2idx=label2idx,
)
valid_dataset = NERDataset(
    token_seq=valid_token_seq,
    label_seq=valid_label_seq,
    token2idx=token2idx,
    label2idx=label2idx,
)
test_dataset = NERDataset(
    token_seq=test_token_seq,
    label_seq=test_label_seq,
    token2idx=token2idx,
    label2idx=label2idx,
)

Let's look at what we got:

In [118]:
train_dataset[0]

(tensor([2, 1, 3, 4, 5, 6, 7, 8, 9]), tensor([3, 0, 2, 0, 0, 0, 2, 0, 0]))

In [119]:
valid_dataset[0]

(tensor([1737,  571, 1777,  197,  687,  145,  349,  111, 1819, 1558,    9]),
 tensor([0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0]))

In [120]:
test_dataset[0]

(tensor([1516,  571, 1434, 1729, 4893, 2014,   67,  310,  215, 3157, 3139,    9]),
 tensor([0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0]))

In [121]:
assert len(train_dataset) == 14986, "Incorrect train_dataset length"
assert len(valid_dataset) == 3465, "Incorrect valid_dataset length"
assert len(test_dataset) == 3683, "Incorrect test_dataset length"

assert torch.equal(train_dataset[0][0], torch.tensor([2,1,3,4,5,6,7,8,9])), "Malformed train_dataset"
assert torch.equal(train_dataset[0][1], torch.tensor([3,0,2,0,0,0,2,0,0])), "Malformed train_dataset"

assert torch.equal(
    valid_dataset[0][0],
    torch.tensor([1737,571,1777,197,687,145,349,111,1819,1558,9])
), "Malformed valid_dataset"
assert torch.equal(valid_dataset[0][1], torch.tensor([0,0,3,0,0,0,0,0,0,0,0])), "Malformed valid_dataset"

assert torch.equal(
    test_dataset[0][0],
    torch.tensor([1516,571,1434,1729,4893,2014,67,310,215,3157,3139,9])
), "Malformed test_dataset"
assert torch.equal(test_dataset[0][1], torch.tensor([0,0,1,0,0,0,0,4,0,0,0,0])), "Malformed test_dataset"

print("All tests passed!")

All tests passed!


In order to complete sequences with padding, we will use the `collate_fn` parameter of the `DataLoader` class.

Given a sequence of pairs of tensors for sentences and tags, it is necessary to complete all sequences to the sequence of the maximum length in the batch.

Use the special token `<PAD>` for completion of word/token sequences and -1 for tag sequences.

**hint**: it is convenient to use the `torch.nn.utils.rnn` method. Pay attention to the `batch_first` parameter.

`Collator` can be implemented in two ways:
- class with method `__call__`
- function

We will go the first way.

Initialize an instance of the `Collator` class (the `__init__` method) using two parameters:
- id `<PAD>` special token for word/token sequences
- id `<PAD>` special token for tag sequences (value -1)

The `__call__` method takes a batch as input, namely a list of tuples of what is returned from the `__getitem__` method of our dataset. In our case, this is a list of tuples of two int64 tensors - `List[Tuple[torch.LongTensor, torch.LongTensor]]`.

Ad the output we want to get two tensors:
- Indexes of word/token with paddings
- Indexes of tags with paddings
    
P.S. The `<PAD>` value is needed to easily distinguish pad tokens from others when calculating loss. You can use the `ignore_index` parameter when initializing the loss.

**Exercise. Implement the collator class NERCollator.** **<font color='red'>(1 point)</font>**

In [122]:
from torch.nn.utils.rnn import pad_sequence


class NERCollator:
    """
    Collator that handles variable-size sentences.
    """

    def __init__(
        self,
        token_padding_value: int,
        label_padding_value: int,
    ):
        self.token_padding_value = token_padding_value
        self.label_padding_value = label_padding_value

    def __call__(
        self,
        batch: List[Tuple[torch.LongTensor, torch.LongTensor]],
    ) -> Tuple[torch.LongTensor, torch.LongTensor]:

        tokens, labels = zip(*batch)
        tokens = pad_sequence(tokens, batch_first=True, padding_value=self.token_padding_value)
        labels = pad_sequence(labels, batch_first=True, padding_value=self.label_padding_value)
        return tokens, labels
        # tokens = list(tokens)
        # labels = list(labels)
        # 
        # max_tokens_length = max(map(len, tokens))
        # max_labels_length = max(map(len, labels))
        # 
        # for i in range(len(tokens)):
        #     tokens[i] = torch.cat((tokens[i], torch.full((max_tokens_length - len(tokens[i]),), self.token_padding_value)))
        # 
        # for i in range(len(labels)):
        #     labels[i] = torch.cat((labels[i], torch.full((max_labels_length - len(labels[i]),), self.label_padding_value)))
        # 
        # return torch.stack(tokens), torch.stack(labels)

In [123]:
collator = NERCollator(
    token_padding_value=token2idx["<PAD>"],
    label_padding_value=-1,
)

Now everything is ready to define the loaders.

In [124]:
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=collator,
)
valid_dataloader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False, # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)
test_dataloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False, # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)

In [125]:
torch.full((5,), 3)

tensor([3, 3, 3, 3, 3])

Let's look at what we got:

In [126]:
tokens, labels = next(iter(train_dataloader))

tokens = tokens.to(device)
labels = labels.to(device)

In [127]:
tokens

tensor([[7796, 1162, 2553, 7237, 1342,    0,    0,    0,    0,    0],
        [ 125, 1167,    1,   67, 1349,  489, 1215, 1364, 1365, 1366]],
       device='cuda:0')

In [128]:
labels

tensor([[ 3,  0,  3,  7,  0, -1, -1, -1, -1, -1],
        [ 0,  4,  8,  0,  1,  0,  0,  0,  0,  0]], device='cuda:0')

In [129]:
train_tokens, train_labels = next(iter(
    torch.utils.data.DataLoader(
        train_dataset,
        batch_size=2,
        shuffle=False,
        collate_fn=collator,
    )
))
assert torch.equal(
    train_tokens,
    torch.tensor([[2, 1, 3, 4, 5, 6, 7, 8, 9], [10, 11, 0, 0, 0, 0, 0, 0, 0]])
), "Looks like a bug in the collator"
assert torch.equal(
    train_labels,
    torch.tensor([[3, 0, 2, 0, 0, 0, 2, 0, 0], [4, 8, -1, -1, -1, -1, -1, -1, -1]])
), "Looks like a bug in the collator"

valid_tokens, valid_labels = next(iter(
    torch.utils.data.DataLoader(
        valid_dataset,
        batch_size=2,
        shuffle=False,
        collate_fn=collator,
    )
))
assert torch.equal(
    valid_tokens,
    torch.tensor([
        [1737, 571, 1777, 197, 687, 145, 349, 111,  1819, 1558, 9],
        [248, 10679, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    ])), "Looks like a bug in the collator"
assert torch.equal(
    valid_labels,
    torch.tensor([
        [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    ])), "Looks like a bug in the collator"

test_tokens, test_labels = next(iter(
    torch.utils.data.DataLoader(
        test_dataset,
        batch_size=2,
        shuffle=False,
        collate_fn=collator,
    )
))
assert torch.equal(
    test_tokens,
    torch.tensor([
        [1516, 571, 1434, 1729, 4893, 2014, 67, 310, 215, 3157, 3139, 9],
        [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    ])), "Looks like a bug in the collator"
assert torch.equal(
    test_labels,
    torch.tensor([
        [0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0],
        [4, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    ])), "Looks like a bug in the collator"

print("All tests passed!")

All tests passed!


## Part 2. BiLSTM tagger (6 points)

Define the network architecture using the PyTorch library.

Your architecture at this point should follow the standard tagger:
* Embedding layer at the input
* LSTM (unidirectional or bidirectional) layer for sequence processing
* Dropout (specified separately or built into LSTM) to reduce overfitting
* Linear output layer

To train the network, use an element-wise cross-entropy loss function.

**Please note** that `<PAD>` tokens should not be included in the loss function calculation. It is recommended to use Adam as an optimizer. To obtain prediction values from model outputs, use the `argmax` function.

**Exercise. Implement the BiLSTM model class.** **<font color='red'>(2 points)</font>**

In [130]:
class BiLSTM(torch.nn.Module):
    """
    Bidirectional LSTM architecture.
    """

    def __init__(
        self,
        num_embeddings: int,
        embedding_dim: int,
        hidden_size: int,
        num_layers: int,
        dropout: float,
        bidirectional: bool,
        n_classes: int,
    ):
        super().__init__()

        self.embedding = torch.nn.Embedding(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
        self.rnn = torch.nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_size,
            num_layers=num_layers,
            dropout=dropout,
            bidirectional=bidirectional,
            batch_first=True
        )  
        self.head = torch.nn.Linear(hidden_size * 2 if bidirectional else hidden_size, n_classes)

    def forward(self, tokens: torch.LongTensor) -> torch.Tensor:
        embed = self.embedding(tokens)

        # we use the special function pack_padded_sequence in order to obtain a PackedSequence structure
        # that does not take padding into account when passing rnn
        length = (tokens != 0).sum(dim=1).detach().cpu()
        packed_embed = torch.nn.utils.rnn.pack_padded_sequence(
            embed, length, batch_first=True, enforce_sorted=False
        )

        # we use the special function pad_packed_sequence to get a tensor from PackedSequence
        packed_rnn_output, _ = self.rnn(packed_embed)
        rnn_output, _ = torch.nn.utils.rnn.pad_packed_sequence(
            packed_rnn_output, batch_first=True
        )

        logits = self.head(rnn_output)
        return logits.transpose(1, 2)

In [131]:
model = BiLSTM(
    num_embeddings=len(token2idx),
    embedding_dim=100,
    hidden_size=100,
    num_layers=1,
    dropout=0.0,
    bidirectional=True,
    n_classes=len(label2idx),
).to(device)

In [132]:
model

BiLSTM(
  (embedding): Embedding(10952, 100)
  (rnn): LSTM(100, 100, batch_first=True, bidirectional=True)
  (head): Linear(in_features=200, out_features=9, bias=True)
)

In [133]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = torch.nn.CrossEntropyLoss(ignore_index=-1)

In [134]:
outputs = model(train_tokens.to(device))

In [135]:
assert outputs.shape == torch.Size([2, 9, 9])
assert 2 < criterion(outputs, train_labels.to(device)) < 3

print("All tests passed!")

All tests passed!


### Experiments

Run experiments on the data. Adjust parameters based on the validation set without using the test set. Your goal is to configure the network so that the quality of the model according to the F1-macro measure on the validation and test sets is no less than **0.76**.

Draw conclusions about model quality, overfitting, and sensitivity of the architecture to the choice of hyperparameters. Present the results of your experiments in the form of a mini-report (in the same ipython notebook).

In [136]:
# let's create a SummaryWriter for experimenting with BiLSTMModel

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir=f"logs/BiLSTMModel")

**Exercise. Implement a metric calculation function `compute_metrics`.** **<font color='red'>(1 point)</font>**

In [137]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


def compute_metrics(
    outputs: torch.Tensor,
    labels: torch.LongTensor,
) -> Dict[str, float]:
    """
    Compute NER metrics.
    """

    metrics = {}
    
    outputs = outputs.argmax(dim=1)
    mask = labels != -1

    # YOUR CODE HERE TODO
    # Don't forget to filter the <PAD> token
    y_true = labels[mask].cpu()
    y_pred = outputs[mask].cpu()
    
    # accuracy
    accuracy = accuracy_score(
        y_true=y_true,
        y_pred=y_pred,
    )

    # precision
    precision_micro = precision_score(
        y_true=y_true,
        y_pred=y_pred,
        average="micro",
        zero_division=0,
    )
    precision_macro = precision_score(
        y_true=y_true,
        y_pred=y_pred,
        average="macro",
        zero_division=0,
    )
    precision_weighted = precision_score(
        y_true=y_true,
        y_pred=y_pred,
        average="weighted",
        zero_division=0,
    )

    # recall
    recall_micro = recall_score(
        y_true=y_true,
        y_pred=y_pred,
        average="micro",
        zero_division=0,

    )
    recall_macro = recall_score(
        y_true=y_true,
        y_pred=y_pred,
        average="macro",
        zero_division=0,
    )
    recall_weighted = recall_score(
        y_true=y_true,
        y_pred=y_pred,
        average="weighted",
        zero_division=0,
    )

    # f1
    f1_micro = f1_score(
        y_true=y_true,
        y_pred=y_pred,
        average="micro",
        zero_division=0,
    )
    f1_macro = f1_score(
        y_true=y_true,
        y_pred=y_pred,
        average="macro",
        zero_division=0,
    )
    f1_weighted = f1_score(
        y_true=y_true,
        y_pred=y_pred,
        average="weighted",
        zero_division=0,
    )

    metrics["accuracy"] = accuracy

    metrics["precision_micro"]    = precision_micro
    metrics["precision_macro"]    = precision_macro
    metrics["precision_weighted"] = precision_weighted

    metrics["recall_micro"]    = recall_micro
    metrics["recall_macro"]    = recall_macro
    metrics["recall_weighted"] = recall_weighted

    metrics["f1_micro"]    = f1_micro
    metrics["f1_macro"]    = f1_macro
    metrics["f1_weighted"] = f1_weighted

    return metrics

**Exercise. Implement the training and testing functions `train_epoch` and `evaluate_epoch`. <font color='red'>(2 points)</font>**

In [138]:
def train_epoch(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One training cycle (loop).
    """

    model.train()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    for i, (tokens, labels) in tqdm(
        enumerate(dataloader),
        total=len(dataloader),
        desc="loop over train batches",
    ):

        tokens, labels = tokens.to(device), labels.to(device)
        
        # Loss calculation and optimizer step
        optimizer.zero_grad()
        outputs = model.forward(tokens)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        epoch_loss.append(loss.item())
        writer.add_scalar(
            "batch loss / train", loss.item(), epoch * len(dataloader) + i
        )

        with torch.no_grad():
            model.eval()
            outputs_inference = model(tokens)
            model.train()

        batch_metrics = compute_metrics(
            outputs=outputs_inference,
            labels=labels,
        )

        for metric_name, metric_value in batch_metrics.items():
            batch_metrics_list[metric_name].append(metric_value)
            writer.add_scalar(
                f"batch {metric_name} / train",
                metric_value,
                epoch * len(dataloader) + i,
            )

    avg_loss = np.mean(epoch_loss)
    print(f"Train loss: {avg_loss}\n")
    writer.add_scalar("loss / train", avg_loss, epoch)

    for metric_name, metric_value_list in batch_metrics_list.items():
        metric_value = np.mean(metric_value_list)
        print(f"Train {metric_name}: {metric_value}\n")
        writer.add_scalar(f"{metric_name} / train", metric_value, epoch)

In [139]:
def evaluate_epoch(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One evaluation cycle (loop).
    """

    model.eval()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    with torch.no_grad():

        for i, (tokens, labels) in tqdm(
            enumerate(dataloader),
            total=len(dataloader),
            desc="loop over test batches",
        ):

            tokens, labels = tokens.to(device), labels.to(device)

            # Loss calculation
            outputs = model.forward(tokens)
            loss = criterion(outputs, labels)

            epoch_loss.append(loss.item())
            writer.add_scalar(
                "batch loss / test", loss.item(), epoch * len(dataloader) + i
            )

            batch_metrics = compute_metrics(
                outputs=outputs,
                labels=labels,
            )

            for metric_name, metric_value in batch_metrics.items():
                batch_metrics_list[metric_name].append(metric_value)
                writer.add_scalar(
                    f"batch {metric_name} / test",
                    metric_value,
                    epoch * len(dataloader) + i,
                )

        avg_loss = np.mean(epoch_loss)
        print(f"Test loss:  {avg_loss}\n")
        writer.add_scalar("loss / test", avg_loss, epoch)

        for metric_name, metric_value_list in batch_metrics_list.items():
            metric_value = np.mean(metric_value_list)
            print(f"Test {metric_name}: {metric_value}\n")
            writer.add_scalar(f"{metric_name} / test", np.mean(metric_value), epoch)

In [140]:
def train(
    n_epochs: int,
    model: torch.nn.Module,
    train_dataloader: torch.utils.data.DataLoader,
    test_dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    eval_period: int = 1,
) -> None:
    """
    Training loop.
    """

    for epoch in range(n_epochs):

        print(f"Epoch [{epoch+1} / {n_epochs}]\n")

        train_epoch(
            model=model,
            dataloader=train_dataloader,
            optimizer=optimizer,
            criterion=criterion,
            writer=writer,
            device=device,
            epoch=epoch,
        )
        if eval_period == 1 or epoch != 0 and epoch % eval_period == 0 or epoch == n_epochs - 1:
            evaluate_epoch(
                model=model,
                dataloader=test_dataloader,
                criterion=criterion,
                writer=writer,
                device=device,
                epoch=epoch,
            )

**Exercise. Conduct experiments. <font color='red'>(2 points)</font>**

In [141]:
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=50,
    shuffle=True,
    collate_fn=collator,
)
test_dataloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=1,
    shuffle=False,
    collate_fn=collator,
)

model = BiLSTM(
    num_embeddings=len(token2idx),
    embedding_dim=100,
    hidden_size=100,
    num_layers=2,
    dropout=0.0,
    bidirectional=True,
    n_classes=len(label2idx),
).to(device)
writer = SummaryWriter(log_dir=f"logs/BiLSTMModel")
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = torch.nn.CrossEntropyLoss(ignore_index=-1)
train(5, model, train_dataloader, test_dataloader, optimizer, criterion, writer, device, eval_period=1)

Epoch [1 / 5]



loop over train batches: 100%|██████████| 300/300 [00:12<00:00, 24.16it/s]


Train loss: 0.618713159263134

Train accuracy: 0.8540943613425274

Train precision_micro: 0.8540943613425274

Train precision_macro: 0.30276129732484625

Train precision_weighted: 0.7727601216933763

Train recall_micro: 0.8540943613425274

Train recall_macro: 0.20990125588463915

Train recall_weighted: 0.8540943613425274

Train f1_micro: 0.8540943613425274

Train f1_macro: 0.2257408838418417

Train f1_weighted: 0.8013472341289612



loop over test batches: 100%|██████████| 3683/3683 [00:52<00:00, 70.63it/s]


Test loss:  0.4477899064903003

Test accuracy: 0.8692069005979718

Test precision_micro: 0.8692069005979718

Test precision_macro: 0.6584538429562138

Test precision_weighted: 0.8247358664530526

Test recall_micro: 0.8692069005979718

Test recall_macro: 0.6789678995588228

Test recall_weighted: 0.8692069005979718

Test f1_micro: 0.8692069005979718

Test f1_macro: 0.6612050731435178

Test f1_weighted: 0.8390781811552914

Epoch [2 / 5]



loop over train batches: 100%|██████████| 300/300 [00:11<00:00, 25.02it/s]


Train loss: 0.25097804891566433

Train accuracy: 0.9301465636116321

Train precision_micro: 0.9301465636116321

Train precision_macro: 0.7636945154138495

Train precision_weighted: 0.9247920879726524

Train recall_micro: 0.9301465636116321

Train recall_macro: 0.6253741072151899

Train recall_weighted: 0.9301465636116321

Train f1_micro: 0.9301465636116321

Train f1_macro: 0.6650849844791329

Train f1_weighted: 0.9228089054011074



loop over test batches: 100%|██████████| 3683/3683 [00:51<00:00, 71.05it/s]


Test loss:  0.30208325617901294

Test accuracy: 0.9103569041557942

Test precision_micro: 0.9103569041557942

Test precision_macro: 0.772532385916472

Test precision_weighted: 0.8922305617457318

Test recall_micro: 0.9103569041557942

Test recall_macro: 0.7776788886857227

Test recall_weighted: 0.9103569041557942

Test f1_micro: 0.9103569041557942

Test f1_macro: 0.7694805082176382

Test f1_weighted: 0.8961550218016764

Epoch [3 / 5]



loop over train batches: 100%|██████████| 300/300 [00:12<00:00, 24.71it/s]


Train loss: 0.13810714482019346

Train accuracy: 0.9631377464315688

Train precision_micro: 0.9631377464315688

Train precision_macro: 0.8731911782122127

Train precision_weighted: 0.9631793329389643

Train recall_micro: 0.9631377464315688

Train recall_macro: 0.8090636536034771

Train recall_weighted: 0.9631377464315688

Train f1_micro: 0.9631377464315688

Train f1_macro: 0.8281859127016111

Train f1_weighted: 0.9615098729850495



loop over test batches: 100%|██████████| 3683/3683 [00:51<00:00, 71.10it/s]


Test loss:  0.2912164108950435

Test accuracy: 0.9110859203068519

Test precision_micro: 0.9110859203068519

Test precision_macro: 0.7886913206057127

Test precision_weighted: 0.9028544530641106

Test recall_micro: 0.9110859203068519

Test recall_macro: 0.7896424598076893

Test recall_weighted: 0.9110859203068519

Test f1_micro: 0.9110859203068519

Test f1_macro: 0.7833178646195499

Test f1_weighted: 0.9020498127609006

Epoch [4 / 5]



loop over train batches: 100%|██████████| 300/300 [00:11<00:00, 25.21it/s]


Train loss: 0.08280609607696533

Train accuracy: 0.9794993446884221

Train precision_micro: 0.9794993446884221

Train precision_macro: 0.9314613904754974

Train precision_weighted: 0.9798930870987289

Train recall_micro: 0.9794993446884221

Train recall_macro: 0.8951564565462968

Train recall_weighted: 0.9794993446884221

Train f1_micro: 0.9794993446884221

Train f1_macro: 0.9059566640178601

Train f1_weighted: 0.9789138883032152



loop over test batches: 100%|██████████| 3683/3683 [00:51<00:00, 71.03it/s]


Test loss:  0.29146824920505054

Test accuracy: 0.9113490750094839

Test precision_micro: 0.9113490750094839

Test precision_macro: 0.7913686435385172

Test precision_weighted: 0.923375266227986

Test recall_micro: 0.9113490750094839

Test recall_macro: 0.788279199885201

Test recall_weighted: 0.9113490750094839

Test f1_micro: 0.9113490750094839

Test f1_macro: 0.7844562363919

Test f1_weighted: 0.9122069284429111

Epoch [5 / 5]



loop over train batches: 100%|██████████| 300/300 [00:12<00:00, 23.49it/s]


Train loss: 0.05120561507840951

Train accuracy: 0.9882850707387839

Train precision_micro: 0.9882850707387839

Train precision_macro: 0.9580792730675841

Train precision_weighted: 0.9886270416184529

Train recall_micro: 0.9882850707387839

Train recall_macro: 0.9365413895994812

Train recall_weighted: 0.9882850707387839

Train f1_micro: 0.9882850707387839

Train f1_macro: 0.9434020366483925

Train f1_weighted: 0.9880528663623329



loop over test batches: 100%|██████████| 3683/3683 [00:50<00:00, 72.68it/s]

Test loss:  0.2837645030205308

Test accuracy: 0.9170345376495916

Test precision_micro: 0.9170345376495916

Test precision_macro: 0.8041912361737188

Test precision_weighted: 0.9222604831726753

Test recall_micro: 0.9170345376495916

Test recall_macro: 0.8039598294264774

Test recall_weighted: 0.9170345376495916

Test f1_micro: 0.9170345376495916

Test f1_macro: 0.7989890890855159

Test f1_weighted: 0.9153768388403711






## Report

Model quickly achieves the required quality. Apparently, model quality is not sensitive to the choice of hyperparameters. I have not changed anything except of batch size for train dataloader and learning rate. Learning rate was made bigger in order to compensate the increase in batch size. Batch size was increased in order to make the computations faster. Even though with smaller batches final quality of the model may be better, bigger batches allows to make learning process much faster. Model does not show the signs of overfitting, so dropout may be zero.

## Part 3. Transformers tagger (6 points)

In this part of the task, you need to do the same thing, but using a model based on the Transformer architecture, namely, it is proposed to additionally fine-tune the pre-trained **BERT** model.

This model requires special data preparation, which is where we will start:

The **BERT** model uses a custom WordPiece tokenizer to break sentences into tokens. A pre-trained version of such a tokenizer exists in the `transformers` library. There are two classes: `BertTokenizer` and `BertTokenizerFast`. You can use either one, but the second option works much faster because it is written in C programming language.

Tokenizers can be trained from scratch using your own data corpus, or you can load pre-trained ones. Pre-trained tokenizers typically match a pre-trained model configuration that uses the vocabulary from that tokenizer.

We will use a basic pretrained **BERT** configuration for the model and tokenizer.

P.S. Often you have to experiment with models of different architectures, for example **BERT** and **GPT**, so it is convenient to use the `AutoTokenizer` class, which, based on the name of the model, will determine which class is needed to initialize the tokenizer.

In [51]:
from transformers import AutoTokenizer

In [52]:
model_name = "distilbert-base-cased"

Pretrained models and tokenizers are loaded from `huggingface` using the `from_pretrained` constructor.

In this constructor, you can specify either the path to the pretrained tokenizer, or the name of the pretrained configuration, as in our case. `transformers` will load the necessary parameters itself:

In [53]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Preparing dictionaries

Compared to recurrent models, there is no more need to build a dictionary, since this is already done in advance thanks to tokenizers and the algorithms behind them.

But as before, we will need:
- {**label**}→{**label_idx**}: correspondence between tag and unique index (starts from 0);

We have already implemented this mapping in one of the previous parts of the task.

### Preparing the dataset and loader

We also want to train the model in batches, so we will still need `Dataset`, `Collator` and `DataLoader`.

But we cannot reuse those from the previous parts of the task, since the data processing must be done a little differently using a tokenizer.

Let's write a new custom dataset that will receive as input (the `__init__` method):
- token_seq - list of lists of words/tokens
- label_seq - list of lists of tags

and return two lists from the `__getitem__` method:
- list of text values (`List[str]`) from token indices in the sample
- a list of integer values (`List[int]`) from the indices of the corresponding tags

P.S. Unlike the previous custom dataset, here we return two `Lists` instead of `torch.LongTensor`, since we will transfer the logic for generating a padded batch to `Collator` due to the specifics of the tokenizer - it itself returns an already padded tensor with token indexes, and for tag indexes we will need to do this ourselves, similar to the previous dataset.

**Exercise. Implement the TransformersDataset class. <font color='red'>(1 point)</font>**

In [54]:
class TransformersDataset(torch.utils.data.Dataset):
    """
    Transformers Dataset for NER.
    """

    def __init__(
        self,
        token_seq: List[List[str]],
        label_seq: List[List[str]],
    ):
        self.token_seq = token_seq
        self.label_seq = [self.process_labels(labels, label2idx) for labels in label_seq]

    def __len__(self):
        return len(self.token_seq)

    def __getitem__(
        self,
        idx: int,
    ) -> Tuple[List[str], List[int]]:
         return self.token_seq[idx], self.label_seq[idx]

    @staticmethod
    def process_labels(
        labels: List[str],
        label2idx: Dict[str, int],
    ) -> List[int]:
        """
        Transform list of labels into list of labels' indices.
        """
        processed_labels = []
        for label in labels:
            processed_labels.append(label2idx[label])
        return processed_labels

Create three datasets:
- *train_dataset*
- *valid_dataset*
- *test_dataset*

In [55]:
train_dataset = TransformersDataset(
    token_seq=train_token_seq,
    label_seq=train_label_seq,
)
valid_dataset = TransformersDataset(
    token_seq=valid_token_seq,
    label_seq=valid_label_seq,
)
test_dataset = TransformersDataset(
    token_seq=test_token_seq,
    label_seq=test_label_seq,
)

Let's look at what we got:

In [56]:
train_dataset[0]

(['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'],
 [3, 0, 2, 0, 0, 0, 2, 0, 0])

In [57]:
valid_dataset[0]

(['cricket',
  '-',
  'leicestershire',
  'take',
  'over',
  'at',
  'top',
  'after',
  'innings',
  'victory',
  '.'],
 [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0])

In [58]:
test_dataset[0]

(['soccer',
  '-',
  'japan',
  'get',
  'lucky',
  'win',
  ',',
  'china',
  'in',
  'surprise',
  'defeat',
  '.'],
 [0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0])

In [59]:
assert len(train_dataset) == 14986, "Incorrect train_dataset length"
assert len(valid_dataset) == 3465, "Incorrect valid_dataset length"
assert len(test_dataset) == 3683, "Incorrect test_dataset length"

assert train_dataset[0][0] == ['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'], "Malformed train_dataset"
assert train_dataset[0][1] == [3,0,2,0,0,0,2,0,0], "Malformed train_dataset"

assert valid_dataset[0][0] == ['cricket', '-', 'leicestershire', 'take', 'over', 'at', 'top', 'after', 'innings', 'victory', '.'], "Malformed valid_dataset"
assert valid_dataset[0][1] == [0,0,3,0,0,0,0,0,0,0,0], "Malformed valid_dataset"

assert test_dataset[0][0] == ['soccer', '-', 'japan', 'get', 'lucky', 'win', ',', 'china', 'in', 'surprise', 'defeat', '.'], "Malformed test_dataset"
assert test_dataset[0][1] == [0,0,1,0,0,0,0,4,0,0,0,0], "Malformed test_dataset"

print("All tests passed!")

All tests passed!


Let's implement a new `Collator`.

The collator will be initialized with 3 arguments:
- tokenizer
- tokenizer parameters in the form of a dictionary (then used as `**kwargs`)
- special token id for tag sequences (value -1)

The `__call__` method takes a batch as input, namely a list of tuples of what is returned from the dataset with `__getitem__` method. In our case, this is a list of tuples of two int64 tensors - `List[Tuple[torch.LongTensor, torch.LongTensor]]`.

At the output we want to get two tensors:
- Padded word/token indexes
- Padded tag indexes

**Exercise. Implement the TransformersCollator class. <font color='red'>(2 points)</font>**

In [60]:
from transformers import PreTrainedTokenizer
from transformers.tokenization_utils_base import BatchEncoding


class TransformersCollator:
    """
    Transformers Collator that handles variable-size sentences.
    """

    def __init__(
        self,
        tokenizer: PreTrainedTokenizer,
        tokenizer_kwargs: Dict[str, Any],
        label_padding_value: int,
    ):
        self.tokenizer = tokenizer
        self.tokenizer_kwargs = tokenizer_kwargs

        self.label_padding_value = label_padding_value

    def __call__(
        self,
        batch: List[Tuple[List[str], List[int]]],
    ) -> Tuple[BatchEncoding, torch.LongTensor]:
        tokens, labels = zip(*batch)
        
        # YOUR CODE HERE
        tokens = self.tokenizer(tokens, **self.tokenizer_kwargs)
        labels = self.encode_labels(tokens, labels, self.label_padding_value)
        
        tokens.pop("offset_mapping")

        return tokens, labels

    @staticmethod
    def encode_labels(
        tokens: BatchEncoding,
        labels: List[List[int]],
        label_padding_value: int,
    ) -> torch.LongTensor:

        encoded_labels = []

        for doc_labels, doc_offset in zip(labels, tokens.offset_mapping):

            doc_enc_labels = np.ones(len(doc_offset), dtype=int) * label_padding_value
            arr_offset = np.array(doc_offset)

            doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
            encoded_labels.append(doc_enc_labels.tolist())

        return torch.LongTensor(encoded_labels)

In [61]:
tokenizer_kwargs = {
    "is_split_into_words":    True,
    "return_offsets_mapping": True,
    "padding":                True,
    "truncation":             True,
    "max_length":             512,
    "return_tensors":         "pt",
}

In [62]:
collator = TransformersCollator(
    tokenizer=tokenizer,
    tokenizer_kwargs=tokenizer_kwargs,
    label_padding_value=-1,
)

Now you're ready to define the loaders:

In [78]:
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=2,
    shuffle=True,
    collate_fn=collator,
)
valid_dataloader = torch.utils.data.DataLoader(
    valid_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False, # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)
test_dataloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False, # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)

Let's look at what we got:

In [79]:
tokens, labels = next(iter(train_dataloader))

tokens = tokens.to(device)
labels = labels.to(device)

In [80]:
tokens

{'input_ids': tensor([[  101,   172,  9238,  9129, 17512,  1116,  1105,  1266,  1484,  1112,
          2157,  2495,  6262,   117,  1150,  3042,  1114,  1679,  3329,  1106,
          1246,  1103,  7260,  1111,  1679,  3329,   112,   188,  1710,   117,
          1125,  1500,  1172,  1119,  1156,  5397,  1136,  1322, 18649,  1679,
          3329,   117,  1133,  1152,  1225,  1136,  1221,  2480,  1119,  1156,
          1322, 18649,  1330,  3234,   119,   102],
        [  101,   179, 15677,  1144,  1151,  6825,  1118,  2212,  1851,  8485,
           117,  7032,  1121, 17599,  7301,  4964,  1884, 15615,   119,  1105,
          1835,  1671,  6555,  1884, 15615,   119,  1106,  1103, 27629,  1182,
          5491,  1433,   119,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [81]:
labels

tensor([[-1,  3, -1,  0,  0, -1,  0,  0,  0,  0,  0,  4, -1,  0,  0,  0,  0,  4,
         -1,  0,  0,  0,  0,  0,  4, -1,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0, -1,  4, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,
          0, -1],
        [-1,  2, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  3, -1, -1,  7, -1,
         -1,  0,  3,  7,  7,  7, -1, -1,  0,  0,  1, -1, -1,  0,  0, -1, -1, -1,
         -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
         -1, -1]], device='cuda:0')

In [82]:
train_tokens, train_labels = next(iter(
    torch.utils.data.DataLoader(
        train_dataset,
        batch_size=2,
        shuffle=False,
        collate_fn=collator,
    )
))
assert torch.equal(
    train_tokens['input_ids'],
    torch.tensor([[101, 174, 1358, 22961, 176, 14170, 1840, 1106, 21423, 9304, 10721, 1324, 2495, 12913, 119, 102],
                  [101, 11109, 1200, 1602, 6715, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
                )), "Looks like a bug in the collator"
assert torch.equal(
    train_tokens['attention_mask'],
    torch.tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    ])), "Looks like a bug in the collator"
assert torch.equal(
    train_labels,
    torch.tensor([
        [-1, 3, -1, 0, 2, -1, 0, 0, 0, 2, -1, -1, 0, -1, 0, -1],
        [-1, 4, -1, 8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    ])), "Looks like a bug in the collator"

valid_tokens, valid_labels = next(iter(
    torch.utils.data.DataLoader(
        valid_dataset,
        batch_size=2,
        shuffle=False,
        collate_fn=collator,
    )
))
assert torch.equal(
    valid_tokens['input_ids'],
    torch.tensor([
        [101, 5428, 118, 5837, 18117, 5759, 15189, 1321, 1166, 1120, 1499, 1170, 6687, 2681, 119, 102],
        [101, 25338, 17996, 1820, 118, 4775, 118, 1476, 102, 0, 0, 0, 0, 0, 0, 0]
    ])), "Looks like a bug in the collator"
assert torch.equal(
    valid_tokens['attention_mask'],
    torch.tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
    ])), "Looks like a bug in the collator"
assert torch.equal(
    valid_labels,
    torch.tensor([
        [-1,  0,  0,  3, -1, -1, -1,  0,  0,  0,  0,  0,  0,  0,  0, -1],
        [-1,  1, -1,  0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    ])), "Looks like a bug in the collator"

test_tokens, test_labels = next(iter(
    torch.utils.data.DataLoader(
        test_dataset,
        batch_size=2,
        shuffle=False,
        collate_fn=collator,
    )
))
assert torch.equal(
    test_tokens['input_ids'],
    torch.tensor([
        [101, 5862, 118, 179, 26519, 1179, 1243, 6918, 1782, 117, 5144, 1161, 1107, 3774, 3326, 119, 102],
        [101, 9468, 3309, 1306, 19122, 2293, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    ])), "Looks like a bug in the collator"
assert torch.equal(
    test_tokens['attention_mask'],
    torch.tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    ])), "Looks like a bug in the collator"
assert torch.equal(
    test_labels,
    torch.tensor([
        [-1,  0,  0,  1, -1, -1,  0,  0,  0,  0,  4, -1,  0,  0,  0,  0, -1],
        [-1,  4, -1, -1,  8, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    ])), "Looks like a bug in the collator"

print("All tests passed!")

All tests passed!


The **transformers** library contains classes for the BERT model, already customized to solve specific problems, with corresponding classification heads. For the NER task we will use the `BertForTokenClassification` class.

By analogy with tokenizers, we can use the `AutoModelForTokenClassification` class, which, based on the name of the model, will determine which class is needed to initialize the model.

In [83]:
from transformers import AutoModelForTokenClassification

In [84]:
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label2idx),
).to(device)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [85]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

In [86]:
outputs = model(**tokens)

In [87]:
assert 2 < criterion(outputs["logits"].transpose(1, 2), labels) < 3

print("All tests passed!")

All tests passed!


In [88]:
# let's create a SummaryWriter for experimenting with BiLSTMModel

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir=f"logs/Transformer")

### Experiments

Run experiments on the data. Adjust parameters based on the validation set without using the test set. Your goal is to configure the network so that the quality of the model according to the F1-macro measure on the validation and test sets is no less than **0.9**.

Draw conclusions about model quality, overfitting, and sensitivity of the architecture to the choice of hyperparameters. Present the results of your experiments in the form of a mini-report (in the same ipython notebook).

You can use the same train function as before, except that instead of `model(tokens)` inference you need to do `model(**tokens)`, and instead of `outputs` you use `outputs["logits"].transpose(1, 2)`

**Exercise. Conduct experiments.** **<font color='red'>(2 points)</font>**

In [89]:
def train_epoch_trans(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One training cycle (loop).
    """

    model.train()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    for i, (tokens, labels) in tqdm(
        enumerate(dataloader),
        total=len(dataloader),
        desc="loop over train batches",
    ):

        tokens, labels = tokens.to(device), labels.to(device)
        
        # Loss calculation and optimizer step
        optimizer.zero_grad()
        outputs = model.forward(**tokens)["logits"].transpose(1, 2)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        epoch_loss.append(loss.item())
        writer.add_scalar(
            "batch loss / train", loss.item(), epoch * len(dataloader) + i
        )

        with torch.no_grad():
            model.eval()
            outputs_inference = model(**tokens)["logits"].transpose(1, 2)
            model.train()

        batch_metrics = compute_metrics(
            outputs=outputs_inference,
            labels=labels,
        )

        for metric_name, metric_value in batch_metrics.items():
            batch_metrics_list[metric_name].append(metric_value)
            writer.add_scalar(
                f"batch {metric_name} / train",
                metric_value,
                epoch * len(dataloader) + i,
            )

    avg_loss = np.mean(epoch_loss)
    print(f"Train loss: {avg_loss}\n")
    writer.add_scalar("loss / train", avg_loss, epoch)

    for metric_name, metric_value_list in batch_metrics_list.items():
        metric_value = np.mean(metric_value_list)
        print(f"Train {metric_name}: {metric_value}\n")
        writer.add_scalar(f"{metric_name} / train", metric_value, epoch)

In [90]:
def evaluate_epoch_trans(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    epoch: int,
) -> None:
    """
    One evaluation cycle (loop).
    """

    model.eval()

    epoch_loss = []
    batch_metrics_list = defaultdict(list)

    with torch.no_grad():

        for i, (tokens, labels) in tqdm(
            enumerate(dataloader),
            total=len(dataloader),
            desc="loop over test batches",
        ):

            tokens, labels = tokens.to(device), labels.to(device)

            # Loss calculation
            outputs = model.forward(**tokens)["logits"].transpose(1, 2)
            loss = criterion(outputs, labels)

            epoch_loss.append(loss.item())
            writer.add_scalar(
                "batch loss / test", loss.item(), epoch * len(dataloader) + i
            )

            batch_metrics = compute_metrics(
                outputs=outputs,
                labels=labels,
            )

            for metric_name, metric_value in batch_metrics.items():
                batch_metrics_list[metric_name].append(metric_value)
                writer.add_scalar(
                    f"batch {metric_name} / test",
                    metric_value,
                    epoch * len(dataloader) + i,
                )

        avg_loss = np.mean(epoch_loss)
        print(f"Test loss:  {avg_loss}\n")
        writer.add_scalar("loss / test", avg_loss, epoch)

        for metric_name, metric_value_list in batch_metrics_list.items():
            metric_value = np.mean(metric_value_list)
            print(f"Test {metric_name}: {metric_value}\n")
            writer.add_scalar(f"{metric_name} / test", np.mean(metric_value), epoch)

In [91]:
def train_trans(
    n_epochs: int,
    model: torch.nn.Module,
    train_dataloader: torch.utils.data.DataLoader,
    test_dataloader: torch.utils.data.DataLoader,
    optimizer: torch.optim.Optimizer,
    criterion: torch.nn.Module,
    writer: SummaryWriter,
    device: torch.device,
    eval_period: int = 1,
) -> None:
    """
    Training loop.
    """

    for epoch in range(n_epochs):

        print(f"Epoch [{epoch+1} / {n_epochs}]\n")

        train_epoch_trans(
            model=model,
            dataloader=train_dataloader,
            optimizer=optimizer,
            criterion=criterion,
            writer=writer,
            device=device,
            epoch=epoch,
        )
        if eval_period == 1 or epoch != 0 and epoch % eval_period == 0 or epoch == n_epochs - 1:
            evaluate_epoch_trans(
                model=model,
                dataloader=test_dataloader,
                criterion=criterion,
                writer=writer,
                device=device,
                epoch=epoch,
            )

In [92]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

collator = TransformersCollator(
    tokenizer=tokenizer,
    tokenizer_kwargs=tokenizer_kwargs,
    label_padding_value=-1,
)
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=50,
    shuffle=True,
    collate_fn=collator,
)
test_dataloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=1,  # for correct metrics measurements leave batch_size=1
    shuffle=False, # for correct metrics measurements leave shuffle=False
    collate_fn=collator,
)

model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label2idx),
).to(device)

writer = SummaryWriter(log_dir=f"logs/Transformer")
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
criterion = torch.nn.CrossEntropyLoss(ignore_index=-1)
train_trans(10, model, train_dataloader, test_dataloader, optimizer, criterion, writer, device)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch [1 / 10]



loop over train batches: 100%|██████████| 300/300 [01:29<00:00,  3.33it/s]


Train loss: 0.37359702207148077

Train accuracy: 0.903700886377743

Train precision_micro: 0.903700886377743

Train precision_macro: 0.49229754003133425

Train precision_weighted: 0.8710750556367649

Train recall_micro: 0.903700886377743

Train recall_macro: 0.4675347001625073

Train recall_weighted: 0.903700886377743

Train f1_micro: 0.903700886377743

Train f1_macro: 0.4601097466550333

Train f1_weighted: 0.8804380840125106



loop over test batches: 100%|██████████| 3683/3683 [01:03<00:00, 58.13it/s]


Test loss:  0.15741679780386256

Test accuracy: 0.9527539045226862

Test precision_micro: 0.9527539045226862

Test precision_macro: 0.8779304265277891

Test precision_weighted: 0.9569568273846323

Test recall_micro: 0.9527539045226862

Test recall_macro: 0.8790412129293054

Test recall_weighted: 0.9527539045226862

Test f1_micro: 0.9527539045226862

Test f1_macro: 0.8753078570499775

Test f1_weighted: 0.9529108833523321

Epoch [2 / 10]



loop over train batches: 100%|██████████| 300/300 [01:28<00:00,  3.38it/s]


Train loss: 0.08716369304185112

Train accuracy: 0.9774116496586484

Train precision_micro: 0.9774116496586484

Train precision_macro: 0.8880414006928977

Train precision_weighted: 0.9781209113708373

Train recall_micro: 0.9774116496586484

Train recall_macro: 0.868043946550824

Train recall_weighted: 0.9774116496586484

Train f1_micro: 0.9774116496586484

Train f1_macro: 0.8674387171196652

Train f1_weighted: 0.9767670588294052



loop over test batches: 100%|██████████| 3683/3683 [01:04<00:00, 57.44it/s]


Test loss:  0.14544220132082167

Test accuracy: 0.9586749109796373

Test precision_micro: 0.9586749109796373

Test precision_macro: 0.8911024219821508

Test precision_weighted: 0.9615908381632658

Test recall_micro: 0.9586749109796373

Test recall_macro: 0.8915812799747395

Test recall_weighted: 0.9586749109796373

Test f1_micro: 0.9586749109796373

Test f1_macro: 0.888770768935603

Test f1_weighted: 0.9584442864966641

Epoch [3 / 10]



loop over train batches: 100%|██████████| 300/300 [01:29<00:00,  3.34it/s]


Train loss: 0.056330162640661

Train accuracy: 0.9865896076597778

Train precision_micro: 0.9865896076597778

Train precision_macro: 0.9275980048158885

Train precision_weighted: 0.9874692320901869

Train recall_micro: 0.9865896076597778

Train recall_macro: 0.9228968049145031

Train recall_weighted: 0.9865896076597778

Train f1_micro: 0.9865896076597778

Train f1_macro: 0.9186835974592589

Train f1_weighted: 0.9864937654547684



loop over test batches: 100%|██████████| 3683/3683 [01:03<00:00, 58.26it/s]


Test loss:  0.15952471500311802

Test accuracy: 0.9599912933218697

Test precision_micro: 0.9599912933218697

Test precision_macro: 0.8998151926811003

Test precision_weighted: 0.9679225093488029

Test recall_micro: 0.9599912933218697

Test recall_macro: 0.8990151466221148

Test recall_weighted: 0.9599912933218697

Test f1_micro: 0.9599912933218697

Test f1_macro: 0.8969413561112791

Test f1_weighted: 0.9623129800596857

Epoch [4 / 10]



loop over train batches: 100%|██████████| 300/300 [01:29<00:00,  3.35it/s]


Train loss: 0.0394978830528756

Train accuracy: 0.9916320533769825

Train precision_micro: 0.9916320533769825

Train precision_macro: 0.953067619701626

Train precision_weighted: 0.9922309136039018

Train recall_micro: 0.9916320533769825

Train recall_macro: 0.9552561196874042

Train recall_weighted: 0.9916320533769825

Train f1_micro: 0.9916320533769825

Train f1_macro: 0.9502143154411484

Train f1_weighted: 0.9916256607716097



loop over test batches: 100%|██████████| 3683/3683 [01:04<00:00, 57.50it/s]


Test loss:  0.15110479844481675

Test accuracy: 0.964965521925726

Test precision_micro: 0.964965521925726

Test precision_macro: 0.909744548418788

Test precision_weighted: 0.9677796329424537

Test recall_micro: 0.964965521925726

Test recall_macro: 0.9089062530655475

Test recall_weighted: 0.964965521925726

Test f1_micro: 0.964965521925726

Test f1_macro: 0.9070850343286011

Test f1_weighted: 0.9648865187348923

Epoch [5 / 10]



loop over train batches: 100%|██████████| 300/300 [01:29<00:00,  3.37it/s]


Train loss: 0.028272132199878494

Train accuracy: 0.9948073728448734

Train precision_micro: 0.9948073728448734

Train precision_macro: 0.970895138320108

Train precision_weighted: 0.9952396291559296

Train recall_micro: 0.9948073728448734

Train recall_macro: 0.9727026010588758

Train recall_weighted: 0.9948073728448734

Train f1_micro: 0.9948073728448734

Train f1_macro: 0.9693906877593367

Train f1_weighted: 0.9948329726206996



loop over test batches: 100%|██████████| 3683/3683 [01:04<00:00, 57.37it/s]


Test loss:  0.1640733233094481

Test accuracy: 0.9644307144179602

Test precision_micro: 0.9644307144179602

Test precision_macro: 0.9103068762269049

Test precision_weighted: 0.967897868086859

Test recall_micro: 0.9644307144179602

Test recall_macro: 0.9097354522965331

Test recall_weighted: 0.9644307144179602

Test f1_micro: 0.9644307144179602

Test f1_macro: 0.9078356600660601

Test f1_weighted: 0.9646799700393738

Epoch [6 / 10]



loop over train batches:  31%|███       | 92/300 [00:27<01:01,  3.36it/s]


KeyboardInterrupt: 

## Report

It can be seen that model quality significatly increased in comparison with non-transformer model. Apparently, model quality is not sensitive to the choice of hyperparameters. Like with the previous model, I have not changed anything except of batch size for train dataloader. This change was made in order to make the computations faster. Even though with smaller batches final quality of the model may be better, bigger batches allows to make learning process much faster. 

## Part 4 - Bonus. BiLSTMAttention-tagger (2 points)

You need to carry out the same experiments as in part 2, but using the improved BiLSTM tagger architecture with the Attention mechanism.

**Please note** that you do not need to implement Attention yourself; you can use `torch.nn.MultiheadAttention`.

Also draw conclusions about model quality, overfitting, sensitivity of the architecture to the choice of hyperparameters, and do a little comparative analysis with the previous architecture. Present the results of your experiments in the form of a mini-report (in the same ipython notebook).

**Exercise. Implement the model class BiLSTMAttn.** **<font color='red'>(1 point)</font>**

In [None]:
# YOUR CODE HERE

**Exercise. Conduct experiments and beat the metric value from part 2.** **<font color='red'>(1 point)</font>**

P.S. If quality didn't increase, this needs to be justified.

In [None]:
# YOUR CODE HERE