### FactRuEval example (uncased model)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
import sys

sys.path.append("../")

warnings.filterwarnings("ignore")

### 0. Download pretrained elmo model
Download pretrained ELMo for [russian](http://vectors.nlpl.eu/repository/11/170.zip) from [implementation](https://github.com/HIT-SCIR/ELMoForManyLangs) of elmo in pytorch and unzip.


In [2]:
import os


data_path = "/home/lis/ner/ulmfit/data/factrueval/"
train_path = os.path.join(data_path, "train_with_pos.csv")
valid_path = os.path.join(data_path, "valid_with_pos.csv")
model_dir = "/datadrive/elmo/"
config_name = "cnn_50_100_512_4096_sample.json"

In [3]:
import torch
torch.cuda.set_device(1)
torch.cuda.is_available(), torch.cuda.current_device()

(True, 1)

### 1. Data preparation

Data train and validation should be presented in the following format.

In [4]:
import pandas as pd


df = pd.read_csv(train_path)
df.head()

Unnamed: 0,0,1,3
0,O O O O O O O O O O O O O O O O O O O O,"Мифология солнцеворота , собственно , и сводит...",NOUN NOUN PNCT ADVB PNCT CONJ VERB PREP NOUN N...
1,O O O O O O B_ORG I_ORG O B_ORG I_ORG O O O O ...,"По его словам , с покупкой Caramba TV « СТС Ме...",PREP NPRO NOUN PNCT PREP NOUN <unk> <unk> PNCT...
2,O O O O O O O O O O O O O O O O B_LOC O,"Такое десятилетие , по его словам « необходимо...",ADJF NOUN PNCT PREP NPRO NOUN PNCT ADJS ADJF P...
3,O O O O O O O O O O O O O O,"Правительство уволило часть врачей , обвинив и...",NOUN VERB NOUN NOUN PNCT GRND NPRO PREP NOUN N...
4,O O O B_PER I_PER O O O O O O B_ORG I_ORG I_OR...,Министр сельского хозяйства Николай Федоров пр...,NOUN ADJF NOUN NOUN NOUN VERB PNCT CONJ PRTF V...


Train and valid .csv files must have columns named (0, 1). Column 3 is't necessary (does not used now).
* Column 0 contains labels in IOB format.
* Column 1 contains tokenized and separated (by whitespace) text.

For using data in model we need to create `NerData` object.

* `train_path` - path to train .csv file
* `valid_path` - path to valid .csv file
* `model_dir` - path to ELMo pretrained model's dir
* `config_name` - name of config in `model_dir` folder
* `batch_size` - batch size (default `16`)
* `cuda` - using cuda or cpu (default `True`)
* `is_cls` - create data for joint model (default `False`)
* `oov` - default unknown in ELMo model (default `'<oov>'`)
* `pad` - default pad sym in ELMo model (default `'<pad>'`)

In [4]:
from modules.data.elmo_data import ElmoNerData as NerData

INFO:summarizer.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English


In [5]:
data = NerData.create(train_path, valid_path, model_dir, config_name)

For factrueval we use the following sample of labels:

In [6]:
print(data.label2idx)

{'<pad>': 0, '<bos>': 1, '<eos>': 2, 'O': 3, 'B_ORG': 4, 'I_ORG': 5, 'B_LOC': 6, 'B_PER': 7, 'I_PER': 8, 'I_LOC': 9}


### 2. Create model
For creating pytorch model we need to create `NerModel` object.

* `label_size` - number of labels: `len(data.label2idx)`,

ElmoEmbedder params
* `model_dir` - path to ELMo pretrained model's dir
* `config_name` - name of config in `model_dir` folder
* `embedding_dim` - output dim from bert model (default `768`)
* `elmo_mode` - mode of how bert output will be returned. If `avg` return mean of all outputs from ELMo. If `weighted` return weighted sum of all bert output layers, weights are learnable (aka ELMO).
* `freeze` - freezing bert model (default `True`)

ElmoBiLSTMEncoder params
* `enc_hidden_dim` - dim of rnn layer or hidden layer (default `128`)
* `rnn_layers` - number of rnn layers in encoder

CRFDecoder params
* `input_dropout` - dropout param (default `0.5`),

Gpu or cpu:
* `use_cuda` - use cuda or cpu (default `True`).

In [11]:
from modules.models.elmo_models import ElmoBiLSTMCRF

In [9]:
model = ElmoBiLSTMCRF.create(len(data.label2idx), model_dir, config_name, enc_hidden_dim=128)

INFO:root:char embedding size: 3896
INFO:root:word embedding size: 329681


In [10]:
model.decoder

CRFDecoder(
  (input_dropout): Dropout(p=0.5)
  (linear): Linears(
    (linears): ModuleList(
      (0): Linear(in_features=128, out_features=64, bias=True)
    )
    (output_linear): Linear(in_features=64, out_features=10, bias=True)
  )
  (crf): CRF()
)

### 3. Create learner

For training our pytorch model we need to create `NerLearner` object.

* `model: NerModel` - pytorch model
* `data: NerData` - train and valid dataloaders
* `best_model_path` - path for store best model
* `base_lr` - starting learning rate (default `0.001`)
* `lr_max` - max learning rate for scheduler (default `0.01`)
* `betas` - params for default optimizer (default `[0.8, 0.9]`)
* `clip` - grad clipping (default `0.25`)
* `verbose` - printing to console reports (default `True`)
* `use_lr_scheduler` - using of learning rate Cycle scheduler (default `True`)
* `sup_labels` - list of supported labels for calculating `target_metric` metric. For FactRuEval use: `['B_LOC', 'I_LOC', 'B_ORG', 'I_ORG', 'B_PER', 'I_PER']` (default `None`)

In [12]:
from modules import NerLearner

In [13]:
learner = NerLearner(model, data,
                     best_model_path="/datadrive/models/factrueval/elmo_bilmcrf.cpt",
                     base_lr=0.001, lr_max=0.005, clip=5.0, use_lr_scheduler=False, sup_labels=data.id2label[4:])

INFO:root:Don't use lr scheduler...


### 4. Learn your NER model
Call `learner.fit`
* `epochs` - number of train iterations (default `100`)
* `resume_history` - resuming appending results to history or create new (default `True`)
* `target_metric` - mean metric, that want you see to pick best epochs (default `f1`).

In [14]:
learner.fit(1, target_metric='prec')

INFO:root:Resuming train... Current epoch 0.


HBox(children=(IntProgress(value=0, max=233), HTML(value='')))

INFO:root:
epoch 1, average train epoch loss=4.9233





HBox(children=(IntProgress(value=0, max=26), HTML(value='')))



INFO:root:on epoch 0 by max_prec: 0.748
INFO:root:Saving new best model...


              precision    recall  f1-score   support

       B_ORG      0.777     0.765     0.771       260
       I_ORG      0.692     0.721     0.706       283
       B_LOC      0.836     0.836     0.836       195
       B_PER      0.917     0.921     0.919       191
       I_PER      0.935     0.885     0.909       130
       I_LOC      0.333     0.171     0.226        35

   micro avg      0.800     0.789     0.794      1094
   macro avg      0.748     0.717     0.728      1094
weighted avg      0.794     0.789     0.791      1094



Fit for the best model

In [None]:
learner.fit(100, target_metric='prec')

### 5. Predict on new data
Create new data loader from existing path.

In [16]:
from modules.data.elmo_data import get_elmo_data_loader_for_predict

In [17]:
dl = get_elmo_data_loader_for_predict(data_path + "valid_with_pos.csv", learner)

Load our best model.

In [18]:
learner.load_model()

Call predict from learner.

In [19]:
preds = learner.predict(dl)

HBox(children=(IntProgress(value=0, max=26), HTML(value='')))

### 6. Transform predictions to tokens and spans

In [20]:
from modules.utils.utils import tokens2spans


tp, lp = [x.tokens for x in dl.dataset], preds
print(tp[0])
print(lp[0])

['<bos>', 'Сделка', 'состоится', ',', 'если', 'будет', 'одобрена', 'регуляторами', ',', 'из-за', 'которых', 'в', 'начале', 'года', 'сорвалось', 'слияние', 'NYSE', 'Euronext', 'с', 'Deutsche', 'Börse', '<eos>']
['<bos>', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B_ORG', 'I_ORG', 'O', 'B_ORG', 'I_ORG', '<eos>']


In [21]:
sp = tokens2spans(tp, lp)

In [22]:
print(sp[0])

[('<bos>', '<bos>'), ('Сделка', 'O'), ('состоится', 'O'), (',', 'O'), ('если', 'O'), ('будет', 'O'), ('одобрена', 'O'), ('регуляторами', 'O'), (',', 'O'), ('из-за', 'O'), ('которых', 'O'), ('в', 'O'), ('начале', 'O'), ('года', 'O'), ('сорвалось', 'O'), ('слияние', 'O'), ('NYSE Euronext', 'ORG'), ('с', 'O'), ('Deutsche Börse', 'ORG')]


### 7. Evaluate

IOB precision

In [23]:
from modules.train.train import validate_step
print(validate_step(learner.data.valid_dl, learner.model, learner.data.id2label, learner.sup_labels))

HBox(children=(IntProgress(value=0, max=26), HTML(value='')))

              precision    recall  f1-score   support

       B_ORG      0.875     0.808     0.840       260
       I_ORG      0.906     0.749     0.820       283
       B_LOC      0.907     0.897     0.902       195
       B_PER      0.953     0.963     0.958       191
       I_PER      0.969     0.962     0.965       130
       I_LOC      0.808     0.600     0.689        35

   micro avg      0.913     0.847     0.879      1094
   macro avg      0.903     0.830     0.862      1094
weighted avg      0.911     0.847     0.877      1094



Span precision

In [25]:
from modules.utils.plot_metrics import get_elmo_span_report
clf_report = get_elmo_span_report(dl, preds)
print(clf_report)

              precision    recall  f1-score   support

         PER      0.881     0.890     0.885       191
         LOC      0.850     0.841     0.845       195
         ORG      0.824     0.773     0.798       260

   micro avg      0.849     0.828     0.839       646
   macro avg      0.851     0.835     0.843       646
weighted avg      0.848     0.828     0.838       646

