# Named Entity Recognition on the CoNLL++ Dataset

---

[Article](https://news.machinelearning.sg/posts/train_a_named_entity_recognition_model_using_flair) | [Github](https://github.com/eugenesiow/practical-ml/blob/master/notebooks/Named_Entity_Recognition_CoNLLpp.ipynb) | More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml)

---

Notebook to train a [flair](https://github.com/flairNLP/flair) model using stacked embeddings (with word and flair contextual embeddings) to perform named entity recognition (NER). The dataset used is the [CoNLL 2003](https://www.aclweb.org/anthology/W03-0419.pdf) dataset for NER (train, dev) with a manually corrected (improved/cleaned) test set from the [CrossWeigh paper](https://arxiv.org/abs/1909.01441) called [CoNLL++](https://github.com/ZihanWangKi/CrossWeigh#data). The current state-of-the-art model on this dataset is from the CrossWeigh paper (also using flair) by [Wang et al. (2019)](https://www.aclweb.org/anthology/D19-1519/) with F1-score of [94.3%](http://nlpprogress.com/english/named_entity_recognition.html). Without using pooled-embeddings, CrossWeigh and training to a max 50 instead of 150 epochs, we get a micro F1-score of 93.5%, within 0.7 of a percentage point of the SOTA.

The notebook is structured as follows:
* Setting up the GPU Environment
* Getting Data
* Training and Testing the Model
* Using the Model (Running Inference)

## Task Description

> Named entity recognition (NER) is the task of tagging entities in text with their corresponding type. Approaches typically use BIO notation, which differentiates the beginning (B) and the inside (I) of entities. O is used for non-entity tokens.

# Setting up the GPU Environment

#### Ensure we have a GPU runtime

If you're running this notebook in Google Colab, select `Runtime` > `Change Runtime Type` from the menubar. Ensure that `GPU` is selected as the `Hardware accelerator`. This will allow us to use the GPU to train the model subsequently.

#### Install Dependencies

In [1]:
pip install -q flair

[K     |████████████████████████████████| 450kB 8.3MB/s 
[K     |████████████████████████████████| 19.7MB 1.2MB/s 
[K     |████████████████████████████████| 1.1MB 58.4MB/s 
[K     |████████████████████████████████| 1.3MB 56.4MB/s 
[K     |████████████████████████████████| 798kB 57.2MB/s 
[K     |████████████████████████████████| 71kB 13.0MB/s 
[K     |████████████████████████████████| 983kB 49.0MB/s 
[K     |████████████████████████████████| 890kB 52.8MB/s 
[K     |████████████████████████████████| 2.9MB 55.0MB/s 
[?25h  Building wheel for sqlitedict (setup.py) ... [?25l[?25hdone
  Building wheel for mpld3 (setup.py) ... [?25l[?25hdone
  Building wheel for ftfy (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Building wheel for segtok (setup.py) ... [?25l[?25hdone
  Building wheel for overrides (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


# Getting Data

We download the data (train, test and dev sets) in the BIO format (each token in each of the sentences are tagged with Begin, Inside or Outside labels) as text files from the [CoNLL++ repository](https://github.com/ZihanWangKi/CrossWeigh) and save them to the `/content/data/` folder.

In [2]:
import urllib.request
from pathlib import Path

def download_file(url, output_file):
  Path(output_file).parent.mkdir(parents=True, exist_ok=True)
  urllib.request.urlretrieve (url, output_file)

download_file('https://raw.githubusercontent.com/ZihanWangKi/CrossWeigh/master/data/conllpp_train.txt', '/content/data/conllpp_train.txt')
download_file('https://raw.githubusercontent.com/ZihanWangKi/CrossWeigh/master/data/conllpp_dev.txt', '/content/data/conllpp_dev.txt')
download_file('https://raw.githubusercontent.com/ZihanWangKi/CrossWeigh/master/data/conllpp_test.txt', '/content/data/conllpp_test.txt')

Now we will use flair's built in `ColumnCorpus` object to load in our `conllpp_train.txt`, `conllpp_test.txt` and `conllpp_dev.txt` files in the `/content/data/` folder.

In [8]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus
columns = {0: 'text', 3: 'ner'}
corpus: Corpus = ColumnCorpus('/content/data/', columns,
                              train_file='conllpp_train.txt',
                              test_file='conllpp_test.txt',
                              dev_file='conllpp_dev.txt')

2020-12-22 08:56:56,625 Reading data from /content/data
2020-12-22 08:56:56,627 Train: /content/data/conllpp_train.txt
2020-12-22 08:56:56,629 Dev: /content/data/conllpp_dev.txt
2020-12-22 08:56:56,631 Test: /content/data/conllpp_test.txt


To check that the sentences/size of the train, test and development set tally exactly with the Table 1 (train: 14987, test: 3684, dev: 3466) in the [CoNLL 2003 paper](https://www.aclweb.org/anthology/W03-0419.pdf), we will get the `len()` of the `.train`, `.test` and `.dev` sets from the `ColumnCorpus` object and print it out as a table.

In [9]:
import pandas as pd
data = [[len(corpus.train), len(corpus.test), len(corpus.dev)]]
# Prints out the dataset sizes of train test and development in a table.
pd.DataFrame(data, columns=["Train", "Test", "Development"])

Unnamed: 0,Train,Test,Development
0,14987,3684,3466


# Training and Testing the Model

#### Train the Model

To train the flair `SequenceTagger`, we use the `ModelTrainer` object with the corpus and the tagger to be trained. We use flair's sensible default options while specifying the output folder for the `SequenceTagger` model to be `/content/model/conllpp`. We also set the `embeddings_storage_mode` to be `gpu` to utilise the GPU. Note that if you run this with a larger dataset than CoNLL++ and you run out of GPU memory, be sure to set this option to `cpu`.

Be prepared to allow the training to run for a few hours.

In [17]:
import flair
from typing import List
from flair.trainers import ModelTrainer
from flair.models import SequenceTagger
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings

tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# For faster training and smaller models, we can comment out the flair embeddings.
# This will significantly affect the performance though.
embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('glove'),
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

trainer.train('/content/model/conllpp',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=50,
              embeddings_storage_mode='gpu')

2020-12-22 09:46:18,939 epoch 39 - iter 322/469 - loss 0.20560896 - samples/sec: 262.66 - lr: 0.025000
2020-12-22 09:46:24,727 epoch 39 - iter 368/469 - loss 0.20580710 - samples/sec: 254.50 - lr: 0.025000
2020-12-22 09:46:30,152 epoch 39 - iter 414/469 - loss 0.20611729 - samples/sec: 271.58 - lr: 0.025000
2020-12-22 09:46:35,687 epoch 39 - iter 460/469 - loss 0.20844374 - samples/sec: 266.10 - lr: 0.025000
2020-12-22 09:46:36,807 ----------------------------------------------------------------------------------------------------
2020-12-22 09:46:36,809 EPOCH 39 done: loss 0.2097 - lr 0.0250000
2020-12-22 09:46:41,879 DEV : loss 0.3880603313446045 - score 0.9572
2020-12-22 09:46:41,898 BAD EPOCHS (no improvement): 1
2020-12-22 09:46:41,899 ----------------------------------------------------------------------------------------------------
2020-12-22 09:46:47,401 epoch 40 - iter 46/469 - loss 0.18646228 - samples/sec: 267.78 - lr: 0.025000
2020-12-22 09:46:52,816 epoch 40 - iter 92/469

{'dev_loss_history': [0.8842470645904541,
  0.6657153367996216,
  0.6118857860565186,
  0.5008137226104736,
  0.4860352575778961,
  0.4302758574485779,
  0.4089862108230591,
  0.42747071385383606,
  0.42218390107154846,
  0.3961328864097595,
  0.40525907278060913,
  0.418643593788147,
  0.39092108607292175,
  0.39777839183807373,
  0.37682512402534485,
  0.3839797377586365,
  0.3861132562160492,
  0.4215836226940155,
  0.3822997510433197,
  0.3758755326271057,
  0.3812239468097687,
  0.3922078609466553,
  0.38810276985168457,
  0.39218172430992126,
  0.37520965933799744,
  0.3797110617160797,
  0.3728858232498169,
  0.3682883679866791,
  0.37996965646743774,
  0.3890150487422943,
  0.374748557806015,
  0.3783147931098938,
  0.3799239993095398,
  0.39099475741386414,
  0.3870539367198944,
  0.3900368809700012,
  0.3956044018268585,
  0.41062259674072266,
  0.3880603313446045,
  0.40031811594963074,
  0.3976965546607971,
  0.3934156596660614,
  0.3969409465789795,
  0.3988956809043884,
 

We see that the output accuracy (F1-score) for our new model is **93.5%** (F1-score (micro) 0.9354).

## Using the Model (Running Inference)

Running the model to do some predictions/inference is as simple as calling `tagger.predict(sentence)`.

In [20]:
from flair.data import Sentence
from flair.models import SequenceTagger

input_sentence = 'My name is Eugene, I currently live in Singapore, I work for DSO.'
tagger: SequenceTagger = SequenceTagger.load("/content/model/conllpp/final-model.pt")
sentence: Sentence = Sentence(input_sentence)
tagger.predict(sentence)
print(sentence.to_tagged_string())

2020-12-22 10:02:26,649 loading file /content/model/conllpp/final-model.pt
My name is Eugene <B-PER> , I currently live in Singapore <B-LOC> , I work for DSO <B-ORG> .


We can connect to Google Drive with the following code to save any files you want to persist. You can also click the `Files` icon on the left panel and click `Mount Drive` to mount your Google Drive.

The root of your Google Drive will be mounted to `/content/drive/My Drive/`. If you have problems mounting the drive, you can check out this [tutorial](https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166).

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

You can move the model files from our local directory to your Google Drive.

In [None]:
import shutil
shutil.move('/content/model/conllpp/', "/content/drive/My Drive/model/")

More Notebooks @ [eugenesiow/practical-ml](https://github.com/eugenesiow/practical-ml) and do drop us some feedback on how to improve the notebooks on the [Github repo](https://github.com/eugenesiow/practical-ml/).