# Nested Named Entities

In this notebook we will dive into idea how to solve Nested Named Entities with use of Bert-like models and **Catboost**. Here we will go throught the whole process of data preparation, network definition and data transformations from raw predictions to human-readable format.

NOTE: This solution is not considered to be working in any sort, they takes too much time to inference even 1 sentence. Nevertheless, I am happy with my results and would love to explain basics about NER models token classification problem.

In [1]:
from datasets import load_dataset

# this dataset is not required, it was used in the original competition
dataset = load_dataset("iluvvatar/RuNNE", trust_remote_code=True)

In [2]:
import json
import pandas as pd

# required to translate data to 'iluvvatar/RuNNE' format
texts = []
entities = []

# load train.jsonl
with open('../../data/train.jsonl', 'r', encoding='UTF-8') as file:
    for line in file.readlines():
        cur_json = json.loads(line)
        texts.append(cur_json['sentences'])
        entities.append([f"{s} {e} {t}" for s, e, t in cur_json['ners']])

# save to dataframe (you can add dataset.train if you want)
train = pd.DataFrame()
train['text'] = texts  # + dataset['train'].to_pandas()['text'].to_list()
train['entities'] = entities # + dataset['train'].to_pandas()['entities'].to_list()

In [3]:
# train = dataset['train'].to_pandas().set_index('id')
eval_df = dataset['test'].to_pandas().set_index('id')
train

Unnamed: 0,text,entities
0,Бостон взорвали Тамерлан и Джохар Царнаевы из ...,"[0 5 CITY, 16 23 PERSON, 34 41 PERSON, 46 62 L..."
1,Умер избитый до комы гитарист и сооснователь г...,"[21 28 PROFESSION, 53 67 ORGANIZATION, 100 148..."
2,Путин подписал распоряжение о выходе России из...,"[0 4 PERSON, 37 42 COUNTRY, 47 76 ORGANIZATION..."
3,Бенедикт XVI носил кардиостимулятор\nПапа Римс...,"[0 11 PERSON, 36 47 PROFESSION, 49 60 PERSON, ..."
4,Обама назначит в Верховный суд латиноамериканк...,"[0 4 PERSON, 17 29 ORGANIZATION, 48 56 PROFESS..."
...,...,...
514,Глава Малайзии: мы не хотим противостоять Кита...,"[42 46 COUNTRY, 82 87 COUNTRY, 104 123 LOCATIO..."
515,«Союз» впервые пристыковался к МКС за 6 часов\...,"[1 4 PRODUCT, 31 33 FACILITY, 35 44 TIME, 48 6..."
516,Трамп и Путин сделали совместное заявление к 7...,"[0 4 PERSON, 8 12 PERSON, 45 52 AGE, 72 80 PRO..."
517,Российский магнат устроил самую дорогую свадьб...,"[0 9 NATIONALITY, 58 72 PERSON, 101 115 PERSON..."


### Data Preparation

First of all let's discuss data preparation. In target we have slices of data that represent some entity. If we are going to tokenize data, ideally we should make sure that tokenization will not change our boudaries. But it is hard to implement, especially with already existing tokenizers. 

To overcome this problem, I will suppose that target slices **do not include** subwords. With this assumption we can use `DeepPavlov/rubert-base-cased` tokenizer and many others if we do work tokenization first. However some words would split into two or more, so for consistency we would consider all subword tokens that same entity as original word.

This solution is not the best, and some other models could be consideres (such as char-tokenizers). However ideally it should not influence the predictions that much.

In [6]:
from transformers import AutoTokenizer, AutoModel
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
bert_checkpoint = 'DeepPavlov/rubert-base-cased'

tokenizer = AutoTokenizer.from_pretrained(bert_checkpoint)
bert_model = AutoModel.from_pretrained(bert_checkpoint).to(device)

Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
import pandas as pd
from tqdm import tqdm
from torch.utils.data import Dataset
from transformers import AutoTokenizer


class NestedNerDataset(Dataset):
    """
    Pytoch-Dataset for Nested NER 
    """

    def __init__(
            self,
            raw_df: pd.DataFrame,
            tokenizer: AutoTokenizer,
            train: bool = True,
            label2idx: dict = None

    ):
        """
        Initialize dataset via dataframe
        
        :param raw_df:     pandas dataframe with `text` column
        :param tokenizer:  tokenizer to use
        :param train:      whether dataset contains `entities` column
        :param label2idx:  mapping from label to index
        """
        
        # save data descriptors
        self.tokenizer = tokenizer  # how to tokenize data
        self.train = train  # whether it is train data
        self.label2id = dict() if label2idx is None else label2idx  # dict is mutable

        # align tokenized text & entities
        self.tokenized_texts = []
        self.aligned_entities = []

        for idx in tqdm(range(raw_df['text'].shape[0])):
            # get raw text & entities
            ents = raw_df.entities.iloc[idx] if self.train else []
            sentence = raw_df.text.iloc[idx]

            # word tokenization
            new_sent, new_ents = self.get_word_tokens_ents(sentence, ents)

            # usual tokenization
            inputs = self.tokenizer(
                new_sent,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=512,
                is_split_into_words=True
            )

            # align labels with tokens
            aligned_ents = []
            if self.train:
                aligned_ents = self.align_labels_with_tokens(new_ents, inputs.word_ids())

            # save preprocessed data
            self.tokenized_texts.append(inputs)
            self.aligned_entities.append(aligned_ents)

        # compute length of dataset
        self.length = len(self.aligned_entities)

    def get_word_tokens_ents(self, sentence: str, ents: list[str]) -> tuple[list[str], list[list[str]]]:
        """
        Word tokenization for sentence with alignment of entities
        
        :param sentence:  sentence to split to words
        :param ents:      raw entities in given sentence
        :return:          (word tokenized sentence, aligned entities)
        """
        
        # ent index to post-tokenized index
        idx_to_post_process_idx = dict()

        # new word tokens (basically split by ' \t\n')
        word_tokens = ['']
        len_tokens = 0
        for idx, symb in enumerate(sentence):
            if symb in {' ', '\t', '\n'}:

                # if new word is not already created
                if word_tokens and word_tokens[-1] != '':
                    word_tokens.append('')
                    len_tokens += 1
                continue

            # add symbol to last word
            word_tokens[-1] += symb
            # map symbol index to new token index
            idx_to_post_process_idx[idx] = len_tokens

        # remove empty symbol from the back (if needed)
        if word_tokens[-1] == '':
            word_tokens.pop()

        # new entetis
        new_ents = [[] for _ in range(len(word_tokens))]

        for ent in ents:
            # parse raw ents
            start_idx, end_idx, ent_name = ent.split()
            start_idx, end_idx = int(start_idx), int(end_idx)
            ent_name = self.label2id.get(ent_name, ent_name)

            # control number of adds
            placed_idxs = dict()

            for idx in range(start_idx, end_idx):
                # transform symb index to token index
                new_idx = idx_to_post_process_idx.get(idx)

                # if present & not already placed, place it
                if (new_idx is not None) and (not placed_idxs.get(new_idx, False)):
                    new_ents[new_idx].append(ent_name)
                    placed_idxs[new_idx] = True

        return word_tokens, new_ents

    @staticmethod
    def align_labels_with_tokens(labels: list[list[str]], word_ids: list[int]) -> list[list[str]]:
        """
        Align entities labels with post-tokenized words
        
        :param labels:    entities from word tokenization
        :param word_ids:  word id's from `self.tokenizer`
        :return:          entities aligned to `word_ids` 
        """
        
        # aligned labels
        new_labels = []
        current_word = None

        for word_id in word_ids:
            if word_id != current_word:
                # start of a new word => add new label
                current_word = word_id
                label = [] if word_id is None else labels[word_id]
                new_labels.append(label)
            elif word_id is None:
                # Special token => add empty label
                new_labels.append([])
            else:
                # Same word as previous token => copy label
                label = labels[word_id]
                new_labels.append(label)

        return new_labels

    def __len__(self) -> int:
        return self.length

    def __getitem__(self, idx: int):
        return self.tokenized_texts[idx], self.aligned_entities[idx]

In [8]:
def extract_ent_labels(raw_df: pd.DataFrame) -> set[str]:
    """
    Extract entities names from dataframe
    
    :param raw_df:  dataframe with `entities` column
    :return:        set of entities names
    """
    
    ent_names = set()
    for ents in raw_df['entities']:
        for ent in ents:
            ent_names.add(ent.split()[-1])
    return ent_names


# entities encoding/decoding
label2idx = dict()
idx2label = dict()

# for each entity make an index
for idx, label in enumerate(sorted(extract_ent_labels(train))):
    label2idx[label] = idx
    idx2label[idx] = label


# define train / eval dataloaders
train_dataloader = NestedNerDataset(train.copy(), tokenizer=tokenizer, label2idx=label2idx)
eval_dataloader = NestedNerDataset(eval_df.copy(), tokenizer=tokenizer, label2idx=label2idx)

100%|███████████████████████████████████████████████████████████████████████████████| 519/519 [00:01<00:00, 375.81it/s]
100%|█████████████████████████████████████████████████████████████████████████████████| 93/93 [00:00<00:00, 359.04it/s]


### Model architecture

Now when we have data preprocessed let's discuss the architecture.

It is obvious that we will use bert (as we already preprocessed data for them), but how? The idea I want to implement is - get token embiddings with bert-like models, and then train classifiers for token classification. As we have 29 classes of data, and we want "nested" dependencies, we will train 29 catboost classifiers (one for each entity type). 

This architecture is simmilar to one developed by [SinaLab](https://github.com/SinaLab/ArabicNER) for Arabic. 

The main difference is that they had several classes and each class had different types (Like entity "Human" could be "Name", "Surname", "Age" e.t.c). So my solution is simplification of their architecture. However my solution could be simply rewritten if required. 

Also they trained their own bert model, however I am interested how raw bert (not even fine-tuned) would perform in such task. So in implementation below I would just use bert for data preprocessing.

Now let's discuss classifiers a bit. In SinaLab they used two-layer NN (786->hid->1), but now I will use Catboosts as different approach.

In [9]:
def get_bert_vectors_labels(dataloader: NestedNerDataset, model):
    """
    Transforms dataset samples to BERT vectors
    
    :param dataloader:  NestedNerDataset class instance
    :param model:       bert-model class instance
    :return:            vectors of tokens
    """
    
    model.eval()
    
    # output containers
    combined_ids = []
    combined_preds = []
    combined_labels = []
    
    # for each data sample get bert embeddings & save them
    for data in tqdm(dataloader, total=len(dataloader)):
        inputs, labels = data

        inputs = inputs.to(model.device)
        
        preds = model(**inputs)['last_hidden_state'].squeeze().to('cpu')
        
        # save results
        combined_preds.extend(preds.tolist())
        combined_labels.extend(labels)
        combined_ids.extend(inputs['input_ids'].tolist()[0])
    
    return combined_ids, combined_preds, combined_labels


combined_ids, combined_preds, combined_labels = get_bert_vectors_labels(train_dataloader, bert_model)

100%|████████████████████████████████████████████████████████████████████████████████| 519/519 [00:12<00:00, 42.36it/s]


In [10]:
len(combined_ids), len(combined_preds), len(combined_labels)

(147100, 147100, 147100)

In [11]:
# # free gpu & models from bert (takes massive part of my GPU :3)
# bert_model.to('cpu')
# None

In [12]:
# make custom dataframe
train_df = pd.DataFrame(combined_preds)

# original y (list of labels)
train_df['y'] = combined_labels

# for each y value make one-hot-encoded column
for class_i in range(len(label2idx)):
    mask = train_df['y'].apply(lambda ls: class_i in ls)
    train_df[f'y_{class_i}'] = mask.astype(int)

# save original ids of tokens
train_df['ids'] = combined_ids


train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,y_20,y_21,y_22,y_23,y_24,y_25,y_26,y_27,y_28,ids
0,-0.275032,0.131663,-0.064355,-0.474492,0.53166,0.138446,-0.288945,0.198151,0.009812,0.098127,...,0,0,0,0,0,0,0,0,0,101
1,-0.356811,-0.303299,0.140634,-0.851324,0.468359,0.70196,-0.557334,0.238785,0.365653,-0.379581,...,0,0,0,0,0,0,0,0,0,37104
2,-1.11096,0.517242,-1.426707,-0.313037,0.358637,0.459469,-0.035206,0.565469,0.536585,0.471379,...,0,0,0,0,0,0,0,0,0,65193
3,0.433637,-0.57271,-0.172019,-0.589095,-0.09372,1.895635,-0.060754,0.039899,0.228753,-1.412677,...,0,0,1,0,0,0,0,0,0,82820
4,-0.433975,-0.431674,-1.391628,-0.638494,0.111096,0.128717,-1.348709,0.649881,0.045734,-1.525975,...,0,0,0,0,0,0,0,0,0,851


In [13]:
from catboost import CatBoostClassifier


# define all cats
cats = [
    CatBoostClassifier(iterations=1000, task_type="GPU", devices='0')
    for _ in range(len(label2idx))
]


emb_columns = [column for column in train_df.columns if type(column) == int]
X = train_df[emb_columns]
for class_i in tqdm(range(len(label2idx))):
    cats[class_i].fit(X, train_df[f'y_{class_i}'], verbose=False)

100%|██████████████████████████████████████████████████████████████████████████████████| 29/29 [11:56<00:00, 24.70s/it]


In [14]:
# # same as catboost , but for KNN and RandomForest, deprecated due to speed problems :3


# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.ensemble import RandomForestClassifier

# knn_cats = [
# #     RandomForestClassifier(n_estimators=10)
#     KNeighborsClassifier(n_neighbors=5, metric='cosine', algorithm='brute')
#     for _ in range(len(label2idx))
# ]


# emb_columns = [column for column in train_df.columns if type(column) == int]
# X = train_df[emb_columns]
# for class_i in tqdm(range(len(label2idx))):
#     knn_cats[class_i].fit(X, train_df[f'y_{class_i}'])

### From Predictions -> Human-Readable

And finally when we have predictions, it make sence to discuss how wo transform raw token prediction back to slices & labels format. 

This was one of the most confusing steps for me, as we are not working with full words but subwords tokenization. 

The first thing is data we are getting is just a list of lists where each `ans[i]` correspond to `i`'th token, and `ans[i]` contains indexes of classes given token corresponds to. With this information we could potentially get tokens start/end indexes in original sentence. This is what is implemented in `map_results_to_ids` function. Basically it takes `token_id`, via tokenizer get the string representation and finds string map in the original sentence. This implementation is pretty nice as it do not cares about word tokenization (we don't need to consider skipped spaces/tabs/new-lines e.t.c.). Algorithm is modified a bit for `DeepPavlov/rubert-base-cased` as it strips special tokens that start with `#`. I tested this method on train/eval data and it resulted in perfect map. 

The second step is to combine continiouts entities. As an input we have (start, end, entities_list) and we want to combine tokens that we consider `continious` (e.g. `end[i] == start[i]`). For this we can use scanline algorithm implemented in `from_sequential_to_readable` function. Worth mentioning that we can define `error` of combination, where tokens on disstance less than `error` could be considered as one, this adds variability to result where results and could boost performance if error is tuned as needed. 

In [15]:
def inference(texts: list[str], cats: list[CatBoostClassifier], idx2label: dict, tokenizer, print_res: bool = False):
    """
    Get model predictions for given texts
    
    :param texts:      texts for inference
    :param cats:       list of catboost classifiers
    :param idx2label:  mapping from index to entity label
    :param tokenizer:  tokenizer used to tokenize texts
    :param print_res:  whether to print predictions to console
    :return:           model predictions
    """
    
    # transform list data to NestedNerDataset
    inf_df = pd.DataFrame()
    inf_df['text'] = texts
    inference_loader = NestedNerDataset(
        inf_df,
        tokenizer=tokenizer,
        label2idx=label2idx,
        train=False
    )

    # get bert vectors
    combined_ids, combined_preds, combined_labels = get_bert_vectors_labels(inference_loader, bert_model)

    # make custom dataframe
    combined_df = pd.DataFrame(combined_preds)
    emb_columns = [column for column in train_df.columns if type(column) == int]
    combined_df['ids'] = combined_ids

    # get predictions for each data sample
    results = []
    for _, row in tqdm(combined_df.iterrows(), total=combined_df.shape[0]):
        results.append([])

        row_logits = []
        for class_id in range(len(idx2label)):
            # for each class get predictions
            row_df = row.to_frame().transpose()[emb_columns]
            row_logits.append(cats[class_id].predict(row_df))

        # determine indexes of positive classes
        row_logits = torch.tensor(row_logits)
        ners = (row_logits > 0.5).nonzero(as_tuple=True)[0].tolist()

        # transform indexes to entities labels
        ner_tags = [idx2label[ner_idx] for ner_idx in ners]

        # get token string representation
        token_str = tokenizer.decode(int(row['ids']))

        # save token_id, token_str and ner labels
        results[-1].append((
            int(row['ids']),
            token_str,
            ner_tags
        ))

        if print_res:
            print(token_str, '\t:\t', ' + '.join(ner_tags))

    return results


def map_results_to_ids(origs: list[str], results: list[list[tuple[int, str, list[str]]]]) -> list[(int, int, list[str])]:
    """
    Get tokens start_idx, end_idx from the original text

    :param origs:    original pre-tokenized texts
    :param results:  raw predictions of model
    :return:         tuples with indexes of tokens in original text
    """

    answers = []
    for orig, preds in zip(origs, results):
        orig_ind = 0
        ans = []

        # remove <BOS> <EOS>
        for token_id, str_token, labels in preds[1:-1]:
            if str_token != '#':  # DeepPavlov/rubert-base-cased specifics
                str_token = str_token.lstrip('#')
            
            # find start index of str_token in original text
            find_idx = orig[orig_ind:].find(str_token)
            
            # (if token was <UNK>)
            if find_idx == -1:
                continue
            
            # update indexes 
            start_ind = orig_ind + find_idx
            end_ind = start_ind + len(str_token)
            orig_ind = end_ind
            
            # always true, but just in case smth is wrong
            if str_token == orig[start_ind:end_ind]:  
                ans.append((start_ind, end_ind, labels))
            
        answers.append(ans)
    return answers


def from_sequential_to_readable(
        sequential_results: list[list[int, int, list[int]]],
        error: int = 0
) -> list[tuple[int, int, str]]:
    """
    Combine continuous tokens into one with some error

    :param sequential_results:  outputs of `map_results_to_ids`
    :param error:               extra  between tokens to consider 'same'
    :return:                    combined tokens
    """

    answers = []
    for sequence in sequential_results:
        points = []
        possible_labels = set()
        
        # split several labels to several data points
        # example (0, 10, [PERSON, DATE] -> [(0, 10, PERSON), (0, 10, DATE)])
        for start_idx, end_idx, labels in sequence:
            for label in labels:
                points.append((start_idx, end_idx, label))
                possible_labels.add(label)
        
        # to remove data cleaning further
        for label in possible_labels:
            points.append((int(1e9), int(1e9), label))
        
        ans = []
        prev_encounters = dict()
        # point already sorted, so now we can implement scanline
        for start_idx, end_idx, label in points:
            # if too far from previous token, make new one
            if prev_encounters.get(label, [-int(1e9), -int(1e9)])[1] + error < start_idx:
                # if not first, add to answer
                if prev_encounters.get(label) is not None:
                    ans.append([*prev_encounters[label], label])
                
                prev_encounters[label] = [start_idx, end_idx]
            # if not too far, change end_idx
            else:  
                prev_encounters[label][1] = end_idx
        answers.append(tuple(ans))

    return answers


def get_answers(texts: list[str], cats: list[CatBoostClassifier], idx2label: dict[int, str], tokenizer) -> list[tuple[int, int, str]]:
    """
    Full pipeline that runs `map_results_to_ids` -> `mapped` -> `from_sequential_to_readable` sequentially

    :param texts:      texts for inference
    :param cats:       list of catboost classifiers
    :param idx2label:  mapping from index to entity label
    :param tokenizer:  tokenizer used to tokenize texts
    :return:
    """

    raw_predictions = inference(texts, cats, idx2label, tokenizer, print_res=False)
    mapped = map_results_to_ids(texts, raw_predictions)
    sequential = from_sequential_to_readable(mapped, error=2)
    return sequential

### Metric calculation 

Even though model will be measured by f1-macro, I wanted to see how model performs in general (something like accuracy of some sort). As we are working with sets (both ranges and class type), the simplest thing would be to consider set's IOU. This metric is not interpretable and do not shows insides where model makes mistakes, but it is nice to track how model performs "in general". 

In [16]:
import numpy as np


def get_iou(texts, labels, cats, idx2label, tokenizer):
    ious = []
    for model_prediction, target_label in zip(get_answers(texts, cats, idx2label, tokenizer), labels):
        # transform to original format
        model_prediction = [f"{s} {e} {t}" for s, e, t in model_prediction]
        
        iou = round(100*len(set(model_prediction) & set(target_label)) / len(set(model_prediction) | set(target_label)), 3)
        ious.append(iou)
    return np.array(ious)

### Train, Evaluate and Inference

Basically combination of everything above to test solution

In [19]:
iou = get_iou(eval_df.text.iloc[:2].tolist(), eval_df.entities.iloc[:2].tolist(), cats, idx2label, tokenizer)
print(iou.mean(), iou.std())

100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 333.25it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.44it/s]
100%|████████████████████████████████████████████████████████████████████████████████| 584/584 [17:19<00:00,  1.78s/it]

0.0 0.0





In [20]:
# import json

# # make predictions on test data
# input_jsons = []
# with open('../../data/test.jsonl', 'r', encoding='UTF-8') as file:
#     for line in file.readlines():
#         input_jsons.append(json.loads(line))

# with open('test.jsonl', 'w') as file:
#     for test_json in input_jsons:
#         ans_json = {
#             'id': test_json['id'],
#             'ners': get_answers([test_json['senences']], cats, idx2label, tokenizer)[0]
#         }
        
#         file.write(json.dumps(ans_json) + '\n')