[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ajpar94/embeddings-comparison/blob/master/notebooks/named_entity_recognition.ipynb)
# Training a Sequence Labeling Model (Named-Entity-Recognition)

Example code for training a NER model with [flair](https://github.com/zalandoresearch/flair). See also [Tutorial: Training a Model](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md).

This notebook will show how to train a NER model with a **Train/Validation/Test split** and with **K-Fold Cross-Validation** .



### Google Colab setup
First, make sure to turn on 'Hardware accelerator: GPU' in *Edit > Notebook Settings*. Next, we will we mount our google drive to easily access corpora, datasets, embeddings and store models. Finally, install flair and configure your paths.

In [0]:
# Mount google drive
from google.colab import drive
drive.mount('/gdrive')

In [0]:
!pip install flair --quiet

In [0]:
# PATHS
base_path = "/gdrive/My Drive/WordEmbeddings-Comparison/"
lm_path = f"{base_path}Language-Modeling/resources/"
ner_path = f"{base_path}Named-Entity-Recognition/resources/"

---
## Variant 1 - Training/Validation/Test Split

### Sequence Labeling Dataset (Corpus)
A *ColumnCorpus* consists out of tagged sentences and is constructed by a file in column format where each line has one word together with its linguistic annotation. Sentences are seperated by blank line. Example:

```text
James B-person
Watson I-person
visited O
Germany B-country
in O
2019 B-year
. O

Sam B-person
was O
not B-negation
there O
. O

```

In our example the second column represents a ner tag in BIO format. You need three of those files: train, dev, test, which correspond to the training, validation and testing split during model training. You can also split one file by percentages using this [*build_dataset_ner.py*](https://github.com/ajpar94/embeddings-comparison/blob/master/Named-Entity-Recognition/preprocessing/build_dataset_NER.py). Alternatively, use one of flair's prepared datasets. Define the *ColumnCorpus*, define what *tag* to predict and create a *tag_dictionary*. See also [Tutorial: Loading Training Data](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md)






In [0]:
# PREPARE CORPUS
# alternative: from flair.datasets import WIKINER_ENGLISH
from flair.datasets import ColumnCorpus

# define columns (multiple possible: ... 2: 'pos')
columns = {0: 'text', 1: 'ner'}

# this is the folder in which train, test and dev files reside
corpus_folder = f"{ner_path}datasets/tagged/"

# init a corpus using column format, data folder 
# alternative: corpus = WIKINER_ENGLISH()
corpus = ColumnCorpus(corpus_folder, columns,
                      train_file='train.txt',
                      test_file='test.txt',
                      dev_file='dev.txt')
print(corpus)

# what tag do we want to predict?
tag_type = 'ner'

# make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

### Embeddings
flair comes with many embeddings out of the box (see: [Tutorial: List of All Word Embeddings](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md)). Or point to your own custom embeddings. If you want to know how to train your own embeddings, check [Notebook: Training a Flair Language Model](#) and [Tutorial: Training your own Flair Embeddings](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md).

*StackedEmbeddings* can be used combine multiple embeddings, which makes sense when you have a forward and a backward language model.

In [0]:
# INITIALIZE EMBEDDINGS
from flair.embeddings import FlairEmbeddings, StackedEmbeddings

# path to embeddings
lm_fwd = f"{lm_path}models/FLAIR/clean_fwd/best-lm.pt"
lm_bwd = f"{lm_path}models/FLAIR/clean_bwd/best-lm.pt"

embeddings = StackedEmbeddings([FlairEmbeddings(lm_fwd), FlairEmbeddings(lm_bwd)])
# alternative: embeddings = StackedEmbeddings([FlairEmbeddings('news-forward'), FlairEmbeddings('news-backward')])

### Sequence Tagger (NER Model)
The *SequenceTagger* can take a lot more parameter (e.g. dropout). For a full list, check [here](https://github.com/zalandoresearch/flair/blob/master/flair/models/sequence_tagger_model.py#L68).

In [0]:
# INITIALIZE SEQUENCE TAGGER
from flair.models import SequenceTagger

tagger = SequenceTagger(hidden_size=512,
                        embeddings=embeddings,
                        tag_dictionary=tag_dictionary,
                        tag_type=tag_type,)

### Model Trainer
Define the path to the output/model folder. After training, this folder will usually contain:


*   final-model.pt
*   checkpoint.pt
*   weights.txt
*   loss.tsv
*   test.tsv
*   training.log

Depending on whether or not you *train_with_dev* there will a **best-model.pt** aswell. *ModelTrainer.train()* can take a lot of optional parameters. For a full list of parameters, check [here](https://github.com/zalandoresearch/flair/blob/master/flair/trainers/trainer.py#L61).

At the end of the *training.log* you will see the relevant metrics including the final F1 score a classification report. For further details on how to perform an evaluation for such a model, check [Notebook: Evaluating a Sequence Labeling Model](#).


In [0]:
# INITIALIZE TRAINER
from flair.trainers import ModelTrainer

# define output path
model_folder = "ner-model-test"
model_path = f"{ner_path}models/{model_folder}/"

# option to continue from checkpoint
continue_training = False

if continue_training:
    checkpoint = tagger.load_checkpoint(model_path + 'checkpoint.pt')
    trainer = ModelTrainer.load_from_checkpoint(checkpoint, corpus)
else:
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# Training
trainer.train(model_path,
              learning_rate=0.5,
              anneal_factor=0.5,
              mini_batch_size=8,
              patience=5,
              max_epochs=2,
              train_with_dev=True,
              monitor_test=True,
              shuffle=True,
              checkpoint=True)

---
## Variant 2 - K-Fold Cross-Validation
This section explains how to simulate 10-Fold Cross-Validation (CV) while training the ner model. CV is useful, when you want to reliably evaluate how a specific model configuration performs, when you do not have a specific test dataset.


### Sequence Labeling Dataset (Corpus)
Instead of using train/dev/test files, we will initialize our *ColumnCorpus*
with the complete file as train, and empty files for dev and test. Lastly, we will save *corpus.train* as numpy array called *TRAIN*.

In [0]:
# PREPARE CORPUS
from flair.datasets import ColumnCorpus

# define columns (multiple possible: ... 2: 'pos')
columns = {0: 'text', 1: 'ner'}

# this is the folder in which train, test and dev files reside
corpus_folder = f"{ner_path}datasets/tagged/"

# init a corpus using column format, data folder 
corpus = ColumnCorpus(corpus_folder, columns,
                      train_file='tagged.txt',
                      test_file='empty.txt',
                      dev_file='empty.txt')
print(corpus)

# what tag do we want to predict?
tag_type = 'ner'

# make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
#print(tag_dictionary.idx2item)

import numpy as np
TRAIN = np.array(corpus.train)

### Embeddings
Same as before.

In [0]:
# INITIALIZE EMBEDDINGS
from flair.embeddings import FlairEmbeddings, StackedEmbeddings

# path to embeddings
lm_fwd = f"{lm_path}models/FLAIR/clean_fwd/best-lm.pt"
lm_bwd = f"{lm_path}models/FLAIR/clean_bwd/best-lm.pt"

embeddings = StackedEmbeddings([FlairEmbeddings(lm_fwd), FlairEmbeddings(lm_bwd)])

### Sequence Tagger, Model Trainer & Evaluation
Since we use CV to evaluate model performance, we will first define a function that takes the *result* from  *model.evaluate()* and returns three pandas dataframes


*   ***tag_tfpn***: true-positive, false-postive, false-negative counts for each tag

*   ***tag_metrics***: values for precision, recall, accuracy and f1-scores for each tag
*   ***metrics***: values for precision, recall and f1-score


In [0]:
import pandas as pd
import numpy as np
import re

def result_summary(result):
    scores = []
    lines = result.detailed_results.split('\n')
    for line in lines[3:]:
        split_line = re.split('\ -\ |\ +|:\ ', line)
        scores.append(split_line)    
    scores = np.array(scores)
    tags = scores[:,0].tolist()
    scores_ = scores[:, 2::2]
    tag_tfpn = scores_[:, :3].astype(int)
    tag_metrics = scores_[:, 4:].astype(float)
    metrics = np.array(result.log_line.split('\t')).astype(float).reshape(1,3)
  
    df_tag_tfpn = pd.DataFrame(data=tag_tfpn,index=tags,columns=['true-positive','false-positive', 'false-negative'])
    df_tag_metrics = pd.DataFrame(data=tag_metrics,index=tags,columns=['precision','recall', 'accuracy','f1-score'])
    df_metrics = pd.DataFrame(data=metrics, index=None,columns=['precision','recall','f1-score'])
  
    return df_tag_tfpn, df_tag_metrics, df_metrics

Next, we will perform the actual Cross-Validation: For every Fold, we will set *corpus.test* and *corpus.train* to the respective subset of *TRAIN*. 

**WARNING:** This operation changes the type for *corpus.test* and *corpus.train* from *ColumnDataset* to *List of Sentences*. Training the model will not be affedted by this, but other corpus specifiv functions might not work.

We initialize *SequenceTagger* and *ModelTrainer* like before, and set train_with_dev=True. After each fold, we evaluate the model on the current test set. In the end, we get three dataframes:

*   ***tag_tfpn_sum***: the sum of the true-false-positive-negative values for each tag
*   ***tag_metrics_avg***: the average precision-recall-accuracy-f1score values for each tag
*   ***metrics_avg***: the average precision-recall-f1score

These dataframes are stored as pickle files.



In [0]:
from flair.trainers import ModelTrainer
from flair.datasets import DataLoader
from flair.models import SequenceTagger
from sklearn.model_selection import KFold

# Set number of splits
kf = KFold(n_splits=3)

# Cross-Validation
i=1
for train_index, test_index in kf.split(TRAIN):
    
    print("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("Fold:", i)
    corpus._test = (TRAIN[test_index]).tolist()
    corpus._train = (TRAIN[train_index]).tolist()
    print(corpus)
  
    # Initialize Sequence Tagger
    tagger = SequenceTagger(hidden_size=512,
                            embeddings=embeddings,
                            tag_dictionary=tag_dictionary,
                            tag_type=tag_type)
  
    # Initialize ModelTrainer
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    # Define output path
    model_folder = "ner-model-CV-test"
    model_path = f"{ner_path}models/{model_folder}/"
  
    # Training
    trainer.train(model_path,
                  learning_rate=0.5,
                  anneal_factor=0.5,
                  mini_batch_size=8,
                  patience=50,
                  max_epochs=1,
                  train_with_dev=True,
                  monitor_test=True,
                  shuffle=False,
                  save_final_model=False,)
  
    # Evaluation
    result, eval_loss = trainer.model.evaluate(DataLoader(trainer.corpus.test,
                                                          batch_size=8,
                                                          num_workers=4))
    # tag_tfpn, tag_metrics, metrics
    if i==1:
        tt, tm, m = result_summary(result)
    else:
        tt_, tm_, m_ = result_summary(result)
        tt = tt.append(tt_)
        tm = tm.append(tm_)
        m = m.append(m_)
    i+=1  

tag_tfpn_sum = tt.groupby(tt.index).sum()
tag_metrics_avg = tm.groupby(tm.index).mean()
metrics_avg = m.mean()


# pickle dump
import pickle
pickle.dump(tag_tfpn_sum,open(model_path+'tag_tfpn_sum.pkl', 'wb'))
pickle.dump(tag_metrics_avg,open(model_path+'tag_metrics_avg.pkl', 'wb'))
pickle.dump(metrics_avg,open(model_path+'metrics_avg.pkl', 'wb'))

In [0]:
print(metrics_avg)
print('\n-------------------------------------------------------------------\n')
print(tag_tfpn_sum)
print('\n-------------------------------------------------------------------\n')
print(tag_metrics_avg)