### FactRuEval example (uncased model)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
import sys

sys.path.append("../")

warnings.filterwarnings("ignore")

### 0. Download pretrained bert model
Download pretrained bert [uncased](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip), [cased](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) (recommended) and unzip.

Use the following code for convert tensorflow model to pytorch:


```export BERT_BASE_DIR=/path/to/bert/multilingual_L-12_H-768_A-12```

```python3 convert_tf_checkpoint_to_pytorch.py \```

```  --tf_checkpoint_path $BERT_BASE_DIR/bert_model.ckpt \ ```

```  --bert_config_file $BERT_BASE_DIR/bert_config.json \```

```  --pytorch_dump_path $BERT_BASE_DIR/pytorch_model.bin```

In [2]:
import os


data_path = "/home/lis/ner/ulmfit/data/factrueval/"
train_path = os.path.join(data_path, "train_with_pos.csv")
valid_path = os.path.join(data_path, "valid_with_pos.csv")
model_dir = "/datadrive/models/multilingual_L-12_H-768_A-12/"
init_checkpoint_pt = "/datadrive/models/multilingual_L-12_H-768_A-12/pytorch_model.bin"
bert_config_file = os.path.join(model_dir, "bert_config.json")
vocab_file = os.path.join(model_dir, "vocab.txt")

In [3]:
import torch
torch.cuda.set_device(1)
torch.cuda.is_available(), torch.cuda.current_device()

(True, 1)

### 1. Data preparation

Data train and validation should be presented in the following format.

In [4]:
import pandas as pd


df = pd.read_csv(train_path)
df.head()

Unnamed: 0,0,1,3
0,O O O O O O O O O O O O O O O O O O O O,"Мифология солнцеворота , собственно , и сводит...",NOUN NOUN PNCT ADVB PNCT CONJ VERB PREP NOUN N...
1,O O O O O O B_ORG I_ORG O B_ORG I_ORG O O O O ...,"По его словам , с покупкой Caramba TV « СТС Ме...",PREP NPRO NOUN PNCT PREP NOUN <unk> <unk> PNCT...
2,O O O O O O O O O O O O O O O O B_LOC O,"Такое десятилетие , по его словам « необходимо...",ADJF NOUN PNCT PREP NPRO NOUN PNCT ADJS ADJF P...
3,O O O O O O O O O O O O O O,"Правительство уволило часть врачей , обвинив и...",NOUN VERB NOUN NOUN PNCT GRND NPRO PREP NOUN N...
4,O O O B_PER I_PER O O O O O O B_ORG I_ORG I_OR...,Министр сельского хозяйства Николай Федоров пр...,NOUN ADJF NOUN NOUN NOUN VERB PNCT CONJ PRTF V...


Train and valid .csv files must have columns named (0, 1). Column 3 is't necessary (does not used now).
* Column 0 contains labels in IOB format.
* Column 1 contains tokenized and separated (by whitespace) text.

For using data in model we need to create `NerData` object.

* `train_path` - path to train .csv file
* `valid_path` - path to valid .csv file
* `vocab_file` - path to google bert pretrained vocab
* `batch_size` - batch size (default `16`)
* `cuda` - using cuda or cpu (default `True`)
* `is_cls` - create data for joint model (default `False`)
* `data_type` - type of input embeddings (default `bert`)
* `max_seq_len` - max sequence len for BERT tokens (default `424`)

In [5]:
from modules import BertNerData as NerData

INFO:summarizer.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English


In [6]:
data = NerData.create(train_path, valid_path, vocab_file)

HBox(children=(IntProgress(value=0, max=3728), HTML(value='')))



HBox(children=(IntProgress(value=0, max=415), HTML(value='')))



For factrueval we use the following sample of labels:

In [7]:
print(data.label2idx)

{'<pad>': 0, '[CLS]': 1, '[SEP]': 2, 'B_O': 3, 'I_O': 4, 'B_ORG': 5, 'I_ORG': 6, 'B_LOC': 7, 'B_PER': 8, 'I_PER': 9, 'I_LOC': 10}


### 2. Create model
For creating pytorch model we need to create `NerModel` object.

* `label_size` - number of labels: `len(data.label2idx)`,

BertEmbedder params
* `bert_config_file` - path to google bert pretrained config
* `init_checkpoint_pt` - path to google bert pretrained weights
* `embedding_dim` - output dim from bert model (default `768`)
* `bert_mode` - mode of how bert output will be returned. If `last` return the output of last layer. If `weighted` return weighted sum of all bert output layers, weights are learnable (aka ELMO).
* `freeze` - freezing bert model (default `True`)

BertBiLSTMEncoder params
* `enc_hidden_dim` - dim of rnn layer or hidden layer (default `128`)
* `rnn_layers` - number of rnn layers in encoder

CRFDecoder params
* `input_dropout` - dropout param (default `0.5`),

Gpu or cpu:
* `use_cuda` - use cuda or cpu (default `True`).

In [8]:
from modules.models import BertBiLSTMCRF

In [11]:
model = BertBiLSTMCRF.create(len(data.label2idx), bert_config_file, init_checkpoint_pt, enc_hidden_dim=128, freeze=True)

In [12]:
model.get_n_trainable_params()

436161

### 3. Create learner

For training our pytorch model we need to create `NerLearner` object.

* `model: NerModel` - pytorch model
* `data: NerData` - train and valid dataloaders
* `best_model_path` - path for store best model
* `lr` - starting learning rate (default `0.001`)
* `betas` - params for default optimizer (default `[0.8, 0.9]`)
* `clip` - grad clipping (default `5`)
* `verbose` - printing to console reports (default `True`)
* `sup_labels` - list of supported labels for calculating `target_metric` metric. For FactRuEval use: `['B_LOC', 'I_LOC', 'B_ORG', 'I_ORG', 'B_PER', 'I_PER']` (default `None`)
* `t_total` - total optimization steps, used for lr scheduler, if -1, don't scale lr after batch iteration (default `-1`), usally t_total = num_epochs * train_size / batch_size
* `warmup` - portion of t_total for the warmup, -1  means no warmup (default `0.1`)
* `weight_decay` - weight decay (default `0.01`)

In [13]:
from modules import NerLearner

In [18]:
num_epochs = 1

learner = NerLearner(model, data,
                     best_model_path="/datadrive/models/factrueval/exp_final.cpt",
                     lr=0.001, clip=5.0, sup_labels=data.id2label[5:],
                     t_total=num_epochs * len(data.train_dl))

### 4. Learn your NER model
Call `learner.fit`
* `epochs` - number of train iterations (default `100`)
* `resume_history` - resuming appending results to history or create new (default `True`)
* `target_metric` - mean metric, that want you see to pick best epochs (default `f1`).

In [19]:
learner.fit(num_epochs, target_metric='f1')

INFO:root:Resuming train... Current epoch 0.


HBox(children=(IntProgress(value=0, max=233), HTML(value='')))

INFO:root:
epoch 1, average train epoch loss=14.153





HBox(children=(IntProgress(value=0, max=26), HTML(value='')))



INFO:root:on epoch 0 by max_f1: 0.342
INFO:root:Saving new best model...


              precision    recall  f1-score   support

       B_ORG      0.435     0.181     0.256       259
       I_ORG      0.657     0.458     0.540       506
       B_LOC      0.320     0.286     0.302       192
       B_PER      0.407     0.314     0.354       188
       I_PER      0.527     0.640     0.578       136
       I_LOC      0.500     0.012     0.023        84

   micro avg      0.509     0.352     0.416      1365
   macro avg      0.474     0.315     0.342      1365
weighted avg      0.511     0.352     0.399      1365



Fit for the best model

In [None]:
learner.fit(num_epochs - 1, target_metric='f1')

### 5. Predict on new data
Create new data loader from existing path.

In [269]:
from modules.data.bert_data import get_bert_data_loader_for_predict

In [270]:
dl = get_bert_data_loader_for_predict(data_path + "valid.csv", learner)

Load our best model.

In [296]:
learner.load_model()

Call predict from learner.

In [297]:
preds = learner.predict(dl)

HBox(children=(IntProgress(value=0, max=26), HTML(value='')))

### 6. Transform predictions to tokens and spans

In [298]:
from modules.utils.utils import bert_labels2tokens, tokens2spans


tp, lp = bert_labels2tokens(dl, preds)
print(tp[0])
print(lp[0])

['Сделка', 'состоится', ',', 'если', 'будет', 'одобрена', 'регуляторами', ',', 'из-за', 'которых', 'в', 'начале', 'года', 'сорвалось', 'слияние', 'NYSE', 'Euronext', 'с', 'Deutsche', 'Börse']
['B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_O', 'B_ORG', 'I_ORG', 'B_O', 'B_ORG', 'I_ORG']


In [299]:
sp = tokens2spans(tp, lp)

In [300]:
print(sp[0])

[('Сделка', 'O'), ('состоится', 'O'), (',', 'O'), ('если', 'O'), ('будет', 'O'), ('одобрена', 'O'), ('регуляторами', 'O'), (',', 'O'), ('из-за', 'O'), ('которых', 'O'), ('в', 'O'), ('начале', 'O'), ('года', 'O'), ('сорвалось', 'O'), ('слияние', 'O'), ('NYSE Euronext', 'ORG'), ('с', 'O'), ('Deutsche Börse', 'ORG')]


### 7. Evaluate

IOB precision

In [306]:
from modules.train.train import validate_step
print(validate_step(learner.data.valid_dl, learner.model, learner.data.id2label, learner.sup_labels))

HBox(children=(IntProgress(value=0, max=26), HTML(value='')))

              precision    recall  f1-score   support

       B_ORG      0.873     0.826     0.849       259
       I_ORG      0.931     0.808     0.865       898
       B_LOC      0.935     0.901     0.918       192
       I_LOC      0.929     0.856     0.891       277
       B_PER      0.972     0.936     0.954       188
       I_PER      0.978     0.949     0.964       613

   micro avg      0.941     0.869     0.903      2427
   macro avg      0.937     0.879     0.907      2427
weighted avg      0.940     0.869     0.902      2427



Span precision

In [305]:
from modules.utils.plot_metrics import get_bert_span_report
clf_report = get_bert_span_report(dl, preds)
print(clf_report)

              precision    recall  f1-score   support

         LOC      0.870     0.839     0.854       192
         ORG      0.833     0.788     0.810       259
         PER      0.945     0.910     0.927       188

   micro avg      0.877     0.839     0.858       639
   macro avg      0.883     0.845     0.863       639
weighted avg      0.877     0.839     0.857       639



### 8. Try cased bert model

In [24]:
import os


data_path = "/home/lis/ner/ulmfit/data/factrueval/"
train_path = os.path.join(data_path, "train_with_pos.csv")
valid_path = os.path.join(data_path, "valid_with_pos.csv")
model_dir = " /datadrive/models/multi_cased_L-12_H-768_A-12/"
init_checkpoint_pt = os.path.join("/datadrive/models/multi_cased_L-12_H-768_A-12/", "pytorch_model.bin")
bert_config_file = os.path.join("/datadrive/bert/multi_cased_L-12_H-768_A-12/", "bert_config.json")
vocab_file = os.path.join("/datadrive/bert/multi_cased_L-12_H-768_A-12/", "vocab.txt")

In [22]:
from modules import BertNerData as NerData


data = NerData.create(train_path, valid_path, vocab_file)

HBox(children=(IntProgress(value=0, max=3728), HTML(value='')))



HBox(children=(IntProgress(value=0, max=415), HTML(value='')))



In [23]:
from modules.models import BertBiLSTMCRF


model = BertBiLSTMCRF.create(len(data.label2idx), bert_config_file, init_checkpoint_pt, enc_hidden_dim=256)

In [320]:
from modules import NerLearner


learner = NerLearner(model, data,
                     best_model_path="/datadrive/models/factrueval/exp_final_cased.cpt",
                     lr=0.001, clip=1.0, sup_labels=data.id2label[5:],
                     t_total=num_epochs * len(data.train_dl))

INFO:root:Use lr OneCycleScheduler...


In [None]:
learner.fit(100, target_metric='prec')

IOB report

In [322]:
learner.load_model()

In [323]:
from modules.train.train import validate_step


print(validate_step(learner.data.valid_dl, learner.model, learner.data.id2label, learner.sup_labels))

HBox(children=(IntProgress(value=0, max=26), HTML(value='')))

              precision    recall  f1-score   support

       B_ORG      0.845     0.842     0.843       259
       I_ORG      0.920     0.836     0.876      1000
       B_LOC      0.927     0.865     0.895       192
       I_LOC      0.915     0.818     0.864       303
       B_PER      0.973     0.957     0.965       188
       I_PER      0.984     0.957     0.970       649

   micro avg      0.933     0.876     0.903      2591
   macro avg      0.927     0.879     0.902      2591
weighted avg      0.932     0.876     0.903      2591



Span report

In [324]:
from modules.data.bert_data import get_bert_data_loader_for_predict
from modules.utils.plot_metrics import get_bert_span_report


dl = get_bert_data_loader_for_predict(data_path + "valid.csv", learner)

preds = learner.predict(dl)

clf_report = get_bert_span_report(dl, preds)
print(clf_report)

HBox(children=(IntProgress(value=0, max=26), HTML(value='')))

              precision    recall  f1-score   support

         LOC      0.850     0.797     0.823       192
         ORG      0.812     0.815     0.813       259
         PER      0.908     0.894     0.901       188

   micro avg      0.851     0.833     0.842       639
   macro avg      0.857     0.835     0.845       639
weighted avg      0.852     0.833     0.842       639

