Resume NER Part 4: Working with Flair NLP

---

In this part we will use flair NLP to train a model on our data and evaluate the results. Please make sure you have set up your Google account and uploaded your files to Google drive. This Notebook should run on Google Colab.

Let's change the working directory to the Google drive where our training data is, and install flair nlp. 

In [0]:
import os
os.chdir("/content/gdrive/My Drive/SAKI_2019/dataset") 

In [2]:
# download flair library #
! pip install flair



In the next section, we will train a NER model with flair. This code is taken from the flair nlp tutorials section 7. "Training a model" 
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md



In [0]:
# imports 
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from typing import List

# columns of "gold standard" ner annotations and text
columns = {3: 'text', 1: 'ner'}

# folder where training and test data are
data_folder = '/content/gdrive/My Drive/SAKI_2019/dataset/flair'

# 2. what tag do we want to predict?
tag_type = 'ner'



In [4]:
downsample = 1.0 # 1.0 is full data, try a much smaller number like 0.01 to test run the code
# 1. get the corpus
corpus: TaggedCorpus = NLPTaskDataFetcher.load_column_corpus(data_folder, columns,
                                                              train_file='train_res_bilou.txt',
                                                              test_file='test_res_bilou.txt',
                                                              dev_file=None).downsample(downsample)
print(corpus)

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)


2019-05-29 06:40:00,461 Reading data from /content/gdrive/My Drive/SAKI_2019/dataset/flair
2019-05-29 06:40:00,465 Train: /content/gdrive/My Drive/SAKI_2019/dataset/flair/train_res_bilou.txt
2019-05-29 06:40:00,468 Dev: None
2019-05-29 06:40:00,471 Test: /content/gdrive/My Drive/SAKI_2019/dataset/flair/test_res_bilou.txt
TaggedCorpus: 250339 train + 27816 dev + 123300 test sentences
[b'<unk>', b'O', b'ner', b'"B-Companies', b'"L-Companies', b'B-Designation', b'L-Designation', b'B-Degree', b'I-Degree', b'L-Degree', b'"I-Companies', b'U-Degree', b'U-Designation', b'-', b'I-Designation', b'"U-Companies', b'<START>', b'<STOP>']


In [7]:

# 4. initialize embeddings. Experiment with different embedding types to see what gets the best results
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings,FlairEmbeddings
embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('glove'),
    # comment in this line to use character embeddings
    # CharacterEmbeddings(),

    # comment in these lines to use flair embeddings (needs a LONG time to train :-)
    FlairEmbeddings('news-forward'),
    FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)




2019-05-29 06:45:35,331 this function is deprecated, use smart_open.open instead
2019-05-29 06:45:38,247 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-forward--h2048-l1-d0.05-lr30-0.25-20/news-forward-0.4.1.pt not found in cache, downloading to /tmp/tmpxwei3ktf


100%|██████████| 73034624/73034624 [00:01<00:00, 52411692.82B/s]

2019-05-29 06:45:39,802 copying /tmp/tmpxwei3ktf to cache at /root/.flair/embeddings/news-forward-0.4.1.pt





2019-05-29 06:45:39,895 removing temp file /tmp/tmpxwei3ktf
2019-05-29 06:45:40,252 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings-v0.4.1/big-news-backward--h2048-l1-d0.05-lr30-0.25-20/news-backward-0.4.1.pt not found in cache, downloading to /tmp/tmpti48vaf0


100%|██████████| 73034575/73034575 [00:02<00:00, 34685957.84B/s]

2019-05-29 06:45:42,580 copying /tmp/tmpti48vaf0 to cache at /root/.flair/embeddings/news-backward-0.4.1.pt





2019-05-29 06:45:42,700 removing temp file /tmp/tmpti48vaf0


In [0]:
# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
trainer.train('resources/taggers/resume-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150)

# 8. plot training curves (optional)
#from flair.visual.training_curves import Plotter
#plotter = Plotter()
#plotter.plot_training_curves('resources/taggers/example-ner/loss.tsv')
#plotter.plot_weights('resources/taggers/example-ner/weights.txt')


2019-05-29 06:47:44,438 ----------------------------------------------------------------------------------------------------
2019-05-29 06:47:44,444 Evaluation method: MICRO_F1_SCORE
2019-05-29 06:47:44,460 ----------------------------------------------------------------------------------------------------
2019-05-29 06:47:44,928 epoch 1 - iter 0/7824 - loss 1.92333972
2019-05-29 06:49:35,002 epoch 1 - iter 782/7824 - loss 0.19444647
2019-05-29 06:51:25,299 epoch 1 - iter 1564/7824 - loss 0.18598820
2019-05-29 06:53:17,041 epoch 1 - iter 2346/7824 - loss 0.18227333
2019-05-29 06:55:06,784 epoch 1 - iter 3128/7824 - loss 0.17781647
2019-05-29 06:56:55,693 epoch 1 - iter 3910/7824 - loss 0.17597449
2019-05-29 06:58:44,580 epoch 1 - iter 4692/7824 - loss 0.17390317
2019-05-29 07:00:35,751 epoch 1 - iter 5474/7824 - loss 0.17285590
2019-05-29 07:02:27,766 epoch 1 - iter 6256/7824 - loss 0.17155321
2019-05-29 07:04:19,223 epoch 1 - iter 7038/7824 - loss 0.17102515
2019-05-29 07:06:11,540 ep