# Nested Named Entities

In this notebook we will dive into idea how to solve Nested Named Entities with use of Bert-like models and **Neural Netwroks**. Here we will go throught the whole process of data preparation, data transformations from raw predictions to human-readable format.

NOTE: This solution is not considered to be working in any sort, my macro f1-score is 0.28% (with mean IOU of 38% on eval set). Nevertheless, I am happy with my results and would love to explain basics about NER models and how to make them Nested.

In [1]:
from datasets import load_dataset

# this dataset is not required, it was used in the original competition
dataset = load_dataset("iluvvatar/RuNNE", trust_remote_code=True)

In [2]:
import json
import pandas as pd

# required to translate data to 'iluvvatar/RuNNE' format
texts = []
entities = []

# load train.jsonl
with open('../../data/train.jsonl', 'r', encoding='UTF-8') as file:
    for line in file.readlines():
        cur_json = json.loads(line)
        texts.append(cur_json['sentences'])
        entities.append([f"{s} {e} {t}" for s, e, t in cur_json['ners']])

# save to dataframe (you can add dataset.train if you want)
train = pd.DataFrame()
train['text'] = texts  # + dataset['train'].to_pandas()['text'].to_list()
train['entities'] = entities # + dataset['train'].to_pandas()['entities'].to_list()

In [3]:
# train = dataset['train'].to_pandas().set_index('id')
eval_df = dataset['test'].to_pandas().set_index('id')
train

Unnamed: 0,text,entities
0,Бостон взорвали Тамерлан и Джохар Царнаевы из ...,"[0 5 CITY, 16 23 PERSON, 34 41 PERSON, 46 62 L..."
1,Умер избитый до комы гитарист и сооснователь г...,"[21 28 PROFESSION, 53 67 ORGANIZATION, 100 148..."
2,Путин подписал распоряжение о выходе России из...,"[0 4 PERSON, 37 42 COUNTRY, 47 76 ORGANIZATION..."
3,Бенедикт XVI носил кардиостимулятор\nПапа Римс...,"[0 11 PERSON, 36 47 PROFESSION, 49 60 PERSON, ..."
4,Обама назначит в Верховный суд латиноамериканк...,"[0 4 PERSON, 17 29 ORGANIZATION, 48 56 PROFESS..."
...,...,...
514,Глава Малайзии: мы не хотим противостоять Кита...,"[42 46 COUNTRY, 82 87 COUNTRY, 104 123 LOCATIO..."
515,«Союз» впервые пристыковался к МКС за 6 часов\...,"[1 4 PRODUCT, 31 33 FACILITY, 35 44 TIME, 48 6..."
516,Трамп и Путин сделали совместное заявление к 7...,"[0 4 PERSON, 8 12 PERSON, 45 52 AGE, 72 80 PRO..."
517,Российский магнат устроил самую дорогую свадьб...,"[0 9 NATIONALITY, 58 72 PERSON, 101 115 PERSON..."


### Data Preparation

First of all let's discuss data preparation. In target we have slices of data that represent some entity. If we are going to tokenize data, ideally we should make sure that tokenization will not change our boudaries. But it is hard to implement, especially with already existing tokenizers. 

To overcome this problem, I will suppose that target slices **do not include** subwords. With this assumption we can use `DeepPavlov/rubert-base-cased` tokenizer and many others if we do work tokenization first. However some words would split into two or more, so for consistency we would consider all subword tokens that same entity as original word.

This solution is not the best, and some other models could be consideres (such as char-tokenizers). However ideally it should not influence the predictions that much.

In [4]:
from transformers import AutoTokenizer, AutoModel
import torch

# determine model & device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert_checkpoint = 'DeepPavlov/rubert-base-cased'

# load all components
tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)
bert_model = AutoModel.from_pretrained(bert_checkpoint).to(device)

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
from transformers import AutoTokenizer


class NestedNerDataset(Dataset):
    """
    Pytoch-Dataset for Nested NER 
    """

    def __init__(
            self,
            raw_df: pd.DataFrame,
            tokenizer: AutoTokenizer,
            train: bool = True,
            label2idx: dict = None

    ):
        """
        Initialize dataset via dataframe
        
        :param raw_df:     pandas dataframe with `text` column
        :param tokenizer:  tokenizer to use
        :param train:      whether dataset contains `entities` column
        :param label2idx:  mapping from label to index
        """
        
        # save data descriptors
        self.tokenizer = tokenizer  # how to tokenize data
        self.train = train  # whether it is train data
        self.label2id = dict() if label2idx is None else label2idx  # dict is mutable

        # align tokenized text & entities
        self.tokenized_texts = []
        self.aligned_entities = []

        for idx in tqdm(range(raw_df['text'].shape[0])):
            # get raw text & entities
            ents = raw_df.entities.iloc[idx] if self.train else []
            sentence = raw_df.text.iloc[idx]

            # word tokenization
            new_sent, new_ents = self.get_word_tokens_ents(sentence, ents)

            # usual tokenization
            inputs = self.tokenizer(
                new_sent,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=512,
                is_split_into_words=True
            )

            # align labels with tokens
            aligned_ents = []
            if self.train:
                aligned_ents = self.align_labels_with_tokens(new_ents, inputs.word_ids())

            # save preprocessed data
            self.tokenized_texts.append(inputs)
            self.aligned_entities.append(aligned_ents)

        # compute length of dataset
        self.length = len(self.aligned_entities)

    def get_word_tokens_ents(self, sentence: str, ents: list[str]) -> tuple[list[str], list[list[str]]]:
        """
        Word tokenization for sentence with alignment of entities
        
        :param sentence:  sentence to split to words
        :param ents:      raw entities in given sentence
        :return:          (word tokenized sentence, aligned entities)
        """
        
        # ent index to post-tokenized index
        idx_to_post_process_idx = dict()

        # new word tokens (basically split by ' \t\n')
        word_tokens = ['']
        len_tokens = 0
        for idx, symb in enumerate(sentence):
            if symb in {' ', '\t', '\n'}:

                # if new word is not already created
                if word_tokens and word_tokens[-1] != '':
                    word_tokens.append('')
                    len_tokens += 1
                continue

            # add symbol to last word
            word_tokens[-1] += symb
            # map symbol index to new token index
            idx_to_post_process_idx[idx] = len_tokens

        # remove empty symbol from the back (if needed)
        if word_tokens[-1] == '':
            word_tokens.pop()

        # new entetis
        new_ents = [[] for _ in range(len(word_tokens))]

        for ent in ents:
            # parse raw ents
            start_idx, end_idx, ent_name = ent.split()
            start_idx, end_idx = int(start_idx), int(end_idx)
            ent_name = self.label2id.get(ent_name, ent_name)

            # control number of adds
            placed_idxs = dict()

            for idx in range(start_idx, end_idx):
                # transform symb index to token index
                new_idx = idx_to_post_process_idx.get(idx)

                # if present & not already placed, place it
                if (new_idx is not None) and (not placed_idxs.get(new_idx, False)):
                    new_ents[new_idx].append(ent_name)
                    placed_idxs[new_idx] = True

        return word_tokens, new_ents

    @staticmethod
    def align_labels_with_tokens(labels: list[list[str]], word_ids: list[int]) -> list[list[str]]:
        """
        Align entities labels with post-tokenized words
        
        :param labels:    entities from word tokenization
        :param word_ids:  word id's from `self.tokenizer`
        :return:          entities aligned to `word_ids` 
        """
        
        # aligned labels
        new_labels = []
        current_word = None

        for word_id in word_ids:
            if word_id != current_word:
                # start of a new word => add new label
                current_word = word_id
                label = [] if word_id is None else labels[word_id]
                new_labels.append(label)
            elif word_id is None:
                # Special token => add empty label
                new_labels.append([])
            else:
                # Same word as previous token => copy label
                label = labels[word_id]
                new_labels.append(label)

        return new_labels

    def __len__(self) -> int:
        return self.length

    def __getitem__(self, idx: int):
        return self.tokenized_texts[idx], self.aligned_entities[idx]

In [6]:
def extract_ent_labels(raw_df: pd.DataFrame) -> set[str]:
    """
    Extract entities names from dataframe
    
    :param raw_df:  dataframe with `entities` column
    :return:        set of entities names
    """
    
    ent_names = set()
    for ents in raw_df['entities']:
        for ent in ents:
            ent_names.add(ent.split()[-1])
    return ent_names


# entities encoding/decoding
label2idx = dict()
idx2label = dict()

# for each entity make an index
for idx, label in enumerate(sorted(extract_ent_labels(train))):
    label2idx[label] = idx
    idx2label[idx] = label


# define train / eval dataloaders
train_dataloader = NestedNerDataset(train.copy(), tokenizer=tokenizer, label2idx=label2idx)
eval_dataloader = NestedNerDataset(eval_df.copy(), tokenizer=tokenizer, label2idx=label2idx)

100%|███████████████████████████████████████████████████████████████████████████████| 519/519 [00:02<00:00, 239.39it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [00:00<00:00, 166.36it/s]


In [7]:
sample = train_dataloader[0]
sample[0]['input_ids'].shape

torch.Size([1, 299])

### Model architecture

Now when we have data preprocessed let's discuss the architecture.

It is obvious that we will use bert (as we already preprocessed data for them), but how? The idea I want to implement is - get token embiddings with bert-like models, and then train classifiers for token classification. As we have 29 classes of data, and we want "nested" dependencies, we will train 29 binary classifiers (one for each entity type). 

This architecture is simmilar to one developed by [SinaLab](https://github.com/SinaLab/ArabicNER) for Arabic. 

The main difference is that they had several classes and each class had different types (Like entity "Human" could be "Name", "Surname", "Age" e.t.c). So my solution is simplification of their architecture. However my solution could be simply rewritten if required. 

Also they trained their own bert model, however I am interested how raw bert (not even fine-tuned) would perform in such task. So in implementation below I freeze all bert-layers. This boosts training process of the model from 15 minutes per epoch, to just 30 seconds.

Now let's discuss classifiers a bit. In SinaLab they used two-layer NN (786->hid->1), but I found out that multi-layers with dropout works better. So you can adjust number of layers (at least 2) and dropout rate.

In [8]:
import torch
import torch.nn as nn
from transformers import BertModel


class Classifier(nn.Module):
    """
    Binary classifier for word embeddings
    """

    def __init__(
            self,
            in_dim: int,
            hidden_dim: int,
            n_layers: int = 2,
            dropout: float = 0.1,
            *args,
            **kwargs
    ):
        """
        Define classifier by key-parameters

        :param in_dim:      embedding dimension
        :param hidden_dim:  dimension of hidden layers
        :param n_layers:    number of hidden layers (at least 2)
        :param dropout:     dropout probability
        :param args:        additional arguments
        :param kwargs:      additional keyword arguments
        """

        super().__init__(*args, **kwargs)

        # define linear & dropout layers
        self.n_layers = n_layers
        linear_layers = [nn.Linear(in_dim, hidden_dim)]  # from in_dim
        dropout_layers = [nn.Dropout(dropout)]

        for i in range(n_layers - 2):
            linear_layers.append(nn.Linear(hidden_dim, hidden_dim))
            dropout_layers.append(nn.Dropout(dropout))

        linear_layers.append(nn.Linear(hidden_dim, 1))  # to prediction

        # save layers as sequential models (for gradient flow)
        self.linear_layers = torch.nn.Sequential(*linear_layers)
        self.dropouts = torch.nn.Sequential(*dropout_layers)

        # activation function & output
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        for i in range(len(self.linear_layers) - 1):
            # dropout -> linear -> relu
            x = self.dropouts[i](x)
            x = self.relu(self.linear_layers[i](x))

        # linear -> sigmoid
        x = self.sigmoid(self.linear_layers[-1](x))
        return x


class BertNestedTagger(nn.Module):
    """
    Bert-Based Nested Tagger
    """

    def __init__(
            self,
            bert_model: BertModel,
            device,
            dropout: float = 0.1,
            n_classes: int = 2,
            n_layers: int = 4,
            hid_dim: int = 512,
            *args,
            **kwargs
    ):
        """
        Define model by key parameters

        :param bert_model:  instance of bert-like model
        :param device:      device to run the model on
        :param dropout:     dropout probability (for classifiers)
        :param n_classes:   number of output classes (number of classifiers)
        :param n_layers:    number of hidden layers (for classifiers)
        :param hid_dim:     dimension of hidden layers (for classifiers)
        :param args:        additional arguments
        :param kwargs:      additional keyword arguments
        """
        
        super().__init__(*args, **kwargs)

        # save bert model
        self.bert_embeddings = bert_model

        # define and save all classifiers
        self.n_classes = n_classes
        classifiers = [
            Classifier(in_dim=768, hidden_dim=hid_dim, n_layers=n_layers, dropout=dropout).to(device)
            for _ in range(self.n_classes)
        ]
        self.classifiers = torch.nn.Sequential(*classifiers)

    def forward(self, x):
        # y [batch_size x max_batch_token_len x 786]
        y = self.bert_embeddings(**x)['last_hidden_state']
        output = list()

        for i, classifier in enumerate(self.classifiers):
            # logits [batch_size x max_batch_token_len x 1]
            logits = classifier(y)
            output.append(logits)

        # [batch_size x max_batch_token_len x n_classes x 1]
        output = torch.stack(output).permute((1, 2, 0, 3))
        return output


In [9]:
def train_epoch(dataloader: NestedNerDataset, model: BertNestedTagger, optimizer, criterion):
    """
    Train model for 1 epoch

    :param dataloader:  NestedNerDataset class instance
    :param model:       BertNestedTagger class instance
    :param optimizer:   pytorch optimizer
    :param criterion:   loss function
    :return:
    """
    model.train()

    total_loss = 0
    for data in tqdm(dataloader, total=len(dataloader)):
        # get labels & tokens
        inputs, labels = data
        inputs = inputs.to(device)

        optimizer.zero_grad()

        # get predictions (for each token 29 classes)
        # as batch_size is always 1, we can squeeze
        # preds [n_tokens, 29]
        preds = model(inputs).squeeze().to('cpu')

        # flatten for simpler loss calculations 
        # preds [n_tokens * 29 x 1]
        preds = preds.flatten()

        # transform targets
        transformed_labels = []
        for ind in range(len(labels)):
            label_ids = labels[ind]
            # for each class index check whether it's in the labels
            transformed_labels.extend([float(class_idx in labels[ind]) for class_idx in range(model.n_classes)])

        # transform to torch.tensor and calculate loss
        torch_labels = torch.tensor(transformed_labels).to('cpu')
        loss = criterion(preds, torch_labels)

        # update weights
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)


def get_loss_epoch(dataloader: NestedNerDataset, model: BertNestedTagger, criterion):
    """
    Get loss for 1 epoch
    
    :param dataloader:  NestedNerDataset class instance
    :param model:       BertNestedTagger class instance
    :param criterion:   loss function
    :return:            average loss
    """
    model.eval()

    total_loss = 0
    for data in tqdm(dataloader, total=len(dataloader)):
        # get labels & tokens
        inputs, labels = data
        inputs = inputs.to(device)

        # preds [n_tokens, 29]
        preds = model(inputs).squeeze().to('cpu')
        preds = preds.flatten()

        # transform targets
        transformed_labels = []
        for ind in range(len(labels)):
            label_ids = labels[ind]
            # for each class index check whether it's in the labels
            transformed_labels.extend([float(class_idx in labels[ind]) for class_idx in range(model.n_classes)])

        # transform to torch.tensor and calculate loss
        torch_labels = torch.tensor(transformed_labels).to('cpu')
        loss = criterion(preds, torch_labels)

        total_loss += loss.item()

    return total_loss / len(dataloader)

### From Predictions -> Human-Readable

And finally when we have predictions, it make sence to discuss how wo transform raw token prediction back to slices & labels format. 

This was one of the most confusing steps for me, as we are not working with full words but subwords tokenization. 

The first thing is data we are getting is just a list of lists where each `ans[i]` correspond to `i`'th token, and `ans[i]` contains indexes of classes given token corresponds to. With this information we could potentially get tokens start/end indexes in original sentence. This is what is implemented in `map_results_to_ids` function. Basically it takes `token_id`, via tokenizer get the string representation and finds string map in the original sentence. This implementation is pretty nice as it do not cares about word tokenization (we don't need to consider skipped spaces/tabs/new-lines e.t.c.). Algorithm is modified a bit for `DeepPavlov/rubert-base-cased` as it strips special tokens that start with `#`. I tested this method on train/eval data and it resulted in perfect map. 

The second step is to combine continiouts entities. As an input we have (start, end, entities_list) and we want to combine tokens that we consider `continious` (e.g. `end[i] == start[i]`). For this we can use scanline algorithm implemented in `from_sequential_to_readable` function. Worth mentioning that we can define `error` of combination, where tokens on disstance less than `error` could be considered as one, this adds variability to result where results and could boost performance if error is tuned as needed. 

In [10]:
def inference(texts: list[str], model: BertNestedTagger, idx2label: dict[int, str], print_res: bool = False):
    """
    Get model predictions for given texts

    :param texts:      texts for inference
    :param model:      BertNestedTagger class instance
    :param idx2label:  mapping from index to entity label
    :param print_res:  whether to print predictions to console
    :return:           model predictions
    """

    # transform list data to NestedNerDataset
    inf_df = pd.DataFrame()
    inf_df['text'] = texts
    inference_loader = NestedNerDataset(
        inf_df,
        tokenizer=tokenizer,
        label2idx=label2idx,
        train=False)

    model.eval()

    # get inference results
    results = []
    for data in tqdm(inference_loader, total=len(inference_loader)):
        results.append([])

        inputs, _ = data
        inputs = inputs.to(model.device)

        # preds [n_tokens, 29]
        preds = model(inputs).squeeze().to('cpu')

        # for each token find out what indexes have > 0.5 threshold & save them
        for ind in range(preds.shape[0]):
            # determine indexes of positive classes
            ners = (preds[ind] > 0.5).nonzero(as_tuple=True)[0].tolist()

            # transform indexes to entities labels
            ner_tags = [idx2label[ner_idx] for ner_idx in ners]

            # get token string representation
            token_str = model.tokenizer.decode(inputs['input_ids'][0][ind])

            if print_res:
                print(token_str, '\t:\t', ' + '.join(ner_tags))

            # save token_id, token_str and ner labels
            results[-1].append((
                inputs['input_ids'][0][ind].item(),
                token_str,
                ner_tags
            ))

        if print_res:
            print('=' * 50)
    return results


def map_results_to_ids(origs: list[str], results: list[list[tuple[int, str, list[str]]]]) -> list[(int, int, list[str])]:
    """
    Get tokens start_idx, end_idx from the original text

    :param origs:    original pre-tokenized texts
    :param results:  raw predictions of model
    :return:         tuples with indexes of tokens in original text
    """

    answers = []
    for orig, preds in zip(origs, results):
        orig_ind = 0
        ans = []

        # remove <BOS> <EOS>
        for token_id, str_token, labels in preds[1:-1]:
            if str_token != '#':  # DeepPavlov/rubert-base-cased specifics
                str_token = str_token.lstrip('#')
            
            # find start index of str_token in original text
            find_idx = orig[orig_ind:].find(str_token)
            
            # (if token was <UNK>)
            if find_idx == -1:
                continue
            
            # update indexes 
            start_ind = orig_ind + find_idx
            end_ind = start_ind + len(str_token)
            orig_ind = end_ind
            
            # always true, but just in case smth is wrong
            if str_token == orig[start_ind:end_ind]:  
                ans.append((start_ind, end_ind, labels))
            
        answers.append(ans)
    return answers


def from_sequential_to_readable(
        sequential_results: list[list[int, int, list[int]]],
        error: int = 0
) -> list[tuple[int, int, str]]:
    """
    Combine continuous tokens into one with some error

    :param sequential_results:  outputs of `map_results_to_ids`
    :param error:               extra  between tokens to consider 'same'
    :return:                    combined tokens
    """

    answers = []
    for sequence in sequential_results:
        points = []
        possible_labels = set()
        
        # split several labels to several data points
        # example (0, 10, [PERSON, DATE] -> [(0, 10, PERSON), (0, 10, DATE)])
        for start_idx, end_idx, labels in sequence:
            for label in labels:
                points.append((start_idx, end_idx, label))
                possible_labels.add(label)
        
        # to remove data cleaning further
        for label in possible_labels:
            points.append((int(1e9), int(1e9), label))
        
        ans = []
        prev_encounters = dict()
        # point already sorted, so now we can implement scanline
        for start_idx, end_idx, label in points:
            # if too far from previous token, make new one
            if prev_encounters.get(label, [-int(1e9), -int(1e9)])[1] + error < start_idx:
                # if not first, add to answer
                if prev_encounters.get(label) is not None:
                    ans.append([*prev_encounters[label], label])
                
                prev_encounters[label] = [start_idx, end_idx]
            # if not too far, change end_idx
            else:  
                prev_encounters[label][1] = end_idx
        answers.append(tuple(ans))

    return answers


def get_answers(texts: list[str], model: BertNestedTagger, idx2label: dict[int, str]) -> list[tuple[int, int, str]]:
    """
    Full pipeline that runs `map_results_to_ids` -> `mapped` -> `from_sequential_to_readable` sequentially

    :param texts:      texts for inference
    :param model:      BertNestedTagger class instance
    :param idx2label:  mapping from index to entity label
    :return:
    """

    raw_predictions = inference(texts, model, idx2label, print_res=False)
    mapped = map_results_to_ids(texts, raw_predictions)
    sequential = from_sequential_to_readable(mapped, error=2)
    return sequential

### Metric calculation 

Even though model will be measured by f1-macro, I wanted to see how model performs in general (something like accuracy of some sort). As we are working with sets (both ranges and class type), the simplest thing would be to consider set's IOU. This metric is not interpretable and do not shows insides where model makes mistakes, but it is nice to track how model performs "in general". 

In [11]:
import numpy as np


def get_iou(texts, labels, model, idx2label):
    ious = []
    for model_prediction, target_label in zip(get_answers(texts, model, idx2label), labels):
        # transform to original format
        model_prediction = [f"{s} {e} {t}" for s, e, t in model_prediction]
        
        iou = round(100*len(set(model_prediction) & set(target_label)) / len(set(model_prediction) | set(target_label)), 3)
        ious.append(iou)
    return np.array(ious)

### Train, Evaluate and Inference

Basically combination of everything above to test solution

In [12]:
def train(train_dataloader, eval_dataloader, eval_df, model, n_epochs, learning_rate=3e-5):
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    criterion = nn.BCELoss()

    for epoch in range(n_epochs):
        loss_train = train_epoch(train_dataloader, model, optimizer, criterion)
        loss_eval = get_loss_epoch(eval_dataloader, model, criterion)
        
        print("Epoch:", epoch, "loss:", loss_train, 'eval_loss:', loss_eval)
        ious = get_iou(eval_df.text, eval_df.entities, model, idx2label)
        print('Mean IOU:', ious.mean(), 'IOU std:', ious.std())


In [15]:
model = BertNestedTagger(
    bert_model=bert_model, 
    device=device, 
    n_classes=len(label2idx), 
    dropout=0.1,
    hid_dim=256,
    n_layers=4
).to(device)

model

BertNestedTagger(
  (bert_embeddings): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

In [16]:
# freeze bert layers
for name, param in model.named_parameters():
    if 'bert_embeddings' in name:
        param.requires_grad = False

In [19]:
# 0.0249 : 4x256 [11, 3e-5] + dropout=0.1
# 0.0246 : 10x512 [7, 3e-5] + dropout=0.1
# 0.0257 : 4x256 [14, 3e-5]
# 0.0258 : 4x256 [6, 1e-4]
# 0.0258 : 4x256 [13, 1e-5]
# 0.0262 : 4x126 [9, 1e-5]
# 0.0265 : 5x512 [4, 1e-4]
# 0.0267 : 4x128 [3, 1e-3]
train(train_dataloader, eval_dataloader, eval_df, model, n_epochs=11)

100%|████████████████████████████████████████████████████████████████████████████████| 519/519 [00:27<00:00, 19.17it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 93/93 [00:02<00:00, 34.64it/s]


Epoch: 0 loss: 0.02223345344914895 eval_loss: 0.024261629519363243


100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [00:00<00:00, 325.23it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 93/93 [00:08<00:00, 10.64it/s]


Mean IOU: 37.7764623655914 IOU std: 9.091316663646205


100%|████████████████████████████████████████████████████████████████████████████████| 519/519 [00:28<00:00, 18.42it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 93/93 [00:02<00:00, 33.90it/s]


Epoch: 1 loss: 0.021580167082434445 eval_loss: 0.024081445449302272


100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [00:00<00:00, 325.17it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 93/93 [00:08<00:00, 11.01it/s]

Mean IOU: 37.653021505376344 IOU std: 9.285391946005046





In [28]:
import json

# make predictions on test data
input_jsons = []
with open('../../data/test.jsonl', 'r', encoding='UTF-8') as file:
    for line in file.readlines():
        input_jsons.append(json.loads(line))

with open('test.jsonl', 'w') as file:
    for test_json in input_jsons:
        ans_json = {
            'id': test_json['id'],
            'ners': get_answers([test_json['senences']], model, idx2label)[0]
        }
        
        file.write(json.dumps(ans_json) + '\n')

100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 499.44it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.20it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 503.46it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.63it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 333.17it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.87it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 251.85it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.85it/s]
100%|███████████████████████████████████

100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 336.86it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.91it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 249.94it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.65it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 329.97it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.08it/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 249.69it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.41it/s]
100%|███████████████████████████████████