## Token Classification

---

Within this task token classification model shall be applied on provided dataset.
Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.


| Abbreviation |Description  |
|---|---|
| O     | Outside of a named entity                                      |
| B-ORG | Beginning of an organization right after another organization  |
| I-ORG | organization                                                   |
| B-LOC | Beginning of a location right after another location           |
| B-PER | Beginning of a person’s name right after another person’s name |
| I-PER | Person’s name                                                  |
| I-LOC | Location                                                       |

Source: https://huggingface.co/dslim/bert-base-NER?text=My+name+is+Wolfgang+and+I+live+in+Berlin

In [99]:
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [92]:
import gc

from tqdm import tqdm
import numpy as np
import pandas as pd

import torch

In [93]:
# Cuda maintenance
gc.collect()
torch.cuda.empty_cache()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Torch device: ", device)

Torch device:  cuda


### 0. Load data

---

In [122]:
from read_data import create_df_from_text_file

df_dev = create_df_from_text_file('data/dev.txt')
df_train = create_df_from_text_file('data/train.txt')
df_test = create_df_from_text_file('data/test.txt')

In [123]:
df_dev.head()

Unnamed: 0,tokens,label
0,"[как, акционерный, коммерческий, Московский, м...","[O, O, O, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I..."
1,"[Управлять, ЦАО, и, САО, вместо, Алексея, Алек...","[O, B-LOC, O, B-LOC, O, B-PER, I-PER, O, B-PER..."
2,"[О, задержании, Шакирьянова, стало, известно, ...","[O, O, B-PER, O, O, O, O, O, O, O]"
3,"[После, майского, ухода, вице-премьера, Владис...","[O, O, O, O, B-PER, I-PER, O, O, O, O, O, B-PE..."
4,"[Армяне, со, мной, согласились, ,, с, Ильхамом...","[O, O, O, O, O, O, B-PER, I-PER, O, O, O, O, O..."


In [124]:
def labels2idx(labels):
    return [label_names.index(label) for label in labels]

df_dev['label'] = df_dev['label'].apply(lambda labels: labels2idx(labels))
df_train['label'] = df_train['label'].apply(lambda labels: labels2idx(labels))
df_test['label'] = df_test['label'].apply(lambda labels: labels2idx(labels))

In [125]:
from datasets import Dataset

train_dataset = Dataset.from_dict(df_train.to_dict('list'))
eval_dataset = Dataset.from_dict(df_dev.to_dict('list'))
test_dataset = Dataset.from_dict(df_test.to_dict('list'))

### 1. Choose model

----

**Model Description**
<br/>

**Summary**: <br/>
mBERT model fine-tuned for 3 epochs on the recently-introduced WikiNEuRal dataset for Multilingual NER. The system supports the 9 languages covered by WikiNEuRal (de, en, es, fr, it, nl, pl, pt, ru), and it was trained on all 9 languages jointly. For a stronger baseline system (mBERT + Bi-LSTM + CRF) look at the official repository.

<br/>

Official Repository: https://github.com/Babelscape/wikineural <br/>
Paper: https://aclanthology.org/wikineural

<br/>

In [126]:
checkpoint = "Babelscape/wikineural-multilingual-ner"

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [98]:
# Show tokenizer output

train_example = df_train.tokens[0]
train_example_labels = df_train.label[0]
tokenized_input = tokenizer(train_example, is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(train_example)
print(tokens)

['"', 'Если', 'Миронов', 'занял', 'столь', 'оппозиционную', 'позицию', ',', 'то', 'мне', 'представляется', ',', 'что', 'для', 'него', 'было', 'бы', 'порядочным', 'и', 'правильным', 'уйти', 'в', 'отставку', 'с', 'занимаемого', 'им', 'поста', ',', 'поста', ',', 'который', 'предоставлен', 'ему', 'сегодня', '"', 'Единой', 'Россией', "''", 'и', 'никем', 'больше', "''", ',', '-', 'заключает', 'Исаев', '.']
['[CLS]', '"', 'Если', 'Мир', '##онов', 'занял', 'сто', '##ль', 'о', '##п', '##по', '##зици', '##он', '##ную', 'позицию', ',', 'то', 'мне', 'представляет', '##ся', ',', 'что', 'для', 'него', 'было', 'бы', 'пор', '##яд', '##о', '##чным', 'и', 'правил', '##ьным', 'у', '##йти', 'в', 'отставку', 'с', 'за', '##нима', '##емого', 'им', 'поста', ',', 'поста', ',', 'который', 'пред', '##ост', '##ав', '##лен', 'ему', 'сегодня', '"', 'Е', '##дино', '##й', 'Р', '##ос', '##сией', "'", "'", 'и', 'ни', '##ке', '##м', 'больше', "'", "'", ',', '-', 'за', '##кл', '##ю', '##чает', 'И', '##са', '##ев', '.', '

In [19]:
from tokenizer_utils import align_labels_with_tokens, tokenize_and_align_labels
from transformers import DataCollatorForTokenClassification

# As a next step we have to align tokens generated 
# by retrained AutoTokenizer with labels_names (ner_tagss))

tokenized_train_dataset = train_dataset.map(
    tokenize_and_align_labels, batched=True,
    remove_columns=["tokens", "label"],
)

tokenized_eval_dataset = eval_dataset.map(
    tokenize_and_align_labels, batched=True,
    remove_columns=["tokens", "label"]
)

tokenized_test_dataset = test_dataset.map(
    tokenize_and_align_labels, batched=True,
    remove_columns=["tokens", "label"]
)

# Data Collator is used to create a batch of examples.
# It will dynamically pad text to the length of the longest element in a batch
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

In [129]:
# Test data collator
batch_example = data_collator([tokenized_train_dataset[i] for i in range(2)])
batch_example["input_ids"], batch_example["labels"]

(tensor([[   101,    107,  33463,  53275,  45568,  40555, 108804,  12118,    555,
           11078,  53204,  92522,  11579,  12719, 110028,    117,  11663,  67251,
           36932,  10625,    117,  10791,  10520,  13981,  11582,  22504,  41436,
           35528,  10316,  27819,    549,  75529,  30982,    560,  37756,    543,
           75376,    558,  10234,  97582,  91024,  13327,  80765,    117,  80765,
             117,  12968,  23807,  33580,  18197,  16173,  16929,  72166,    107,
             514, 105088,  10384,    525,  17969, 106801,    112,    112,    549,
           19544,  11557,  10241,  26368,    112,    112,    117,    118,  10234,
           53869,  10593,  52928,    517,  12016,  13292,    119,    102,      0,
               0,      0,      0,      0,      0,      0,      0,      0,      0,
               0,      0,      0,      0,      0,      0,      0,      0,      0,
               0,      0,      0,      0,      0,      0,      0,      0,      0,
               0

In [20]:
from trainer_utils import compute_metrics
import evaluate

metric = evaluate.load("seqeval")

In [22]:
from transformers import AutoModelForTokenClassification


id2label = {str(i): label for i, label in enumerate(labels_names)}
label2id = {v: k for k, v in id2label.items()}

model = AutoModelForTokenClassification.from_pretrained(checkpoint,
                                                       id2label=id2label,
                                                       label2id=label2id,
                                                       ignore_mismatched_sizes=True).to(device)

In [24]:
from transformers import TrainingArguments

output_dir = "./results"

args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
)

In [25]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

***** Running training *****
  Num examples = 7746
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 5811


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0378,0.039813,0.948772,0.955871,0.952308,0.990404
2,0.0176,0.037016,0.96323,0.975058,0.969108,0.992298
3,0.0064,0.036634,0.964136,0.969685,0.966903,0.99275


***** Running Evaluation *****
  Num examples = 2582
  Batch size = 4


<transformers.trainer_utils.EvalPrediction object at 0x7f7ea4ad41f0>


Saving model checkpoint to ./results/checkpoint-1937
Configuration saved in ./results/checkpoint-1937/config.json
Model weights saved in ./results/checkpoint-1937/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1937/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1937/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2582
  Batch size = 4


<transformers.trainer_utils.EvalPrediction object at 0x7f7ea476ad30>


Saving model checkpoint to ./results/checkpoint-3874
Configuration saved in ./results/checkpoint-3874/config.json
Model weights saved in ./results/checkpoint-3874/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-3874/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-3874/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 2582
  Batch size = 4


<transformers.trainer_utils.EvalPrediction object at 0x7f7ea476a8e0>


Saving model checkpoint to ./results/checkpoint-5811
Configuration saved in ./results/checkpoint-5811/config.json
Model weights saved in ./results/checkpoint-5811/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-5811/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-5811/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=5811, training_loss=0.02255878651989366, metrics={'train_runtime': 937.663, 'train_samples_per_second': 24.783, 'train_steps_per_second': 6.197, 'total_flos': 2674150607863404.0, 'train_loss': 0.02255878651989366, 'epoch': 3.0})

### 2. Evaluate on test dataset

---

In [29]:
trainer.evaluate(tokenized_test_dataset)

***** Running Evaluation *****
  Num examples = 2582
  Batch size = 4


<transformers.trainer_utils.EvalPrediction object at 0x7f7ea47b75b0>


{'eval_loss': 0.03453731909394264,
 'eval_precision': 0.9672398190045249,
 'eval_recall': 0.974115931461903,
 'eval_f1': 0.9706656979384253,
 'eval_accuracy': 0.9930352501624431,
 'eval_runtime': 29.1135,
 'eval_samples_per_second': 88.688,
 'eval_steps_per_second': 22.189,
 'epoch': 3.0}

### 3. Load trained model and check on random sample

---

In [29]:
from transformers import AutoModelForTokenClassification

trained_model = AutoModelForTokenClassification.from_pretrained("./results/checkpoint-5811",
                                                       id2label=id2label,
                                                       label2id=label2id,
                                                       ignore_mismatched_sizes=True)

In [133]:
#test_string = "Больше всего новых машин купили в Казани (57%), Набережных Челнах (54%), Тюмени (47%), Ульяновске и Самаре (по 38%)."
test_string = "Молчаливое утопическе представление о Вавилоне в Поднебесной."
tokenized_input = tokenizer(test_string.split(), is_split_into_words=True, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"][0])
predictions = model(tokenized_input["input_ids"])
predictions = np.argmax(predictions["logits"].detach().numpy(), axis=-1)
result = format_result(test_string, tokens, predictions, tokenized_input, labels_names)
result

{'Молчаливое': {'Tokens': ['М', '##ол', '##чали', '##вое'],
  'Labels': ['O', 'O', 'O', 'O']},
 'утопическе': {'Tokens': ['у', '##то', '##пи', '##че', '##ске'],
  'Labels': ['O', 'O', 'O', 'O', 'O']},
 'представление': {'Tokens': ['представлен', '##ие'], 'Labels': ['O', 'O']},
 'о': {'Tokens': ['о'], 'Labels': ['O']},
 'Вавилоне': {'Tokens': ['В', '##ави', '##лон', '##е'],
  'Labels': ['B-LOC', 'I-LOC', 'I-LOC', 'I-LOC']},
 'в': {'Tokens': ['в'], 'Labels': ['O']},
 'Поднебесной.': {'Tokens': ['Под', '##не', '##бе', '##сной', '.'],
  'Labels': ['B-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'O']}}