# Flair Experiment

This notebook uses the Flair library to train and annotate the corpus of this project. 

In [2]:
from flair.data import Corpus
from flair.datasets import ColumnCorpus

## Load corpus 

The following cell loads the corpus

In [3]:
corpus: Corpus = ColumnCorpus('../data/iob/', 
                              {0: 'text', 2: 'ner'},
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt',
                              in_memory=False)
print(corpus)
#print(corpus.obtain_statistics())
#print(corpus.make_tag_dictionary(tag_type='ner'))

2021-06-23 16:38:28,501 Reading data from ../data/iob
2021-06-23 16:38:28,504 Train: ../data/iob/train.txt
2021-06-23 16:38:28,507 Dev: ../data/iob/dev.txt
2021-06-23 16:38:28,510 Test: ../data/iob/test.txt
Corpus: 32250 train + 4031 dev + 4102 test sentences


## Prepare training
Based on the corpus, the following cell prepares the training. It uses FlairEmbeddings

In [4]:
from flair.data import Corpus
from flair.datasets import UD_ENGLISH
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings, BytePairEmbeddings, CharacterEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=[
#    WordEmbeddings('de'),
#    CharacterEmbeddings(),
#    BytePairEmbeddings('de'),
    FlairEmbeddings('de-historic-rw-forward'),
    FlairEmbeddings('de-historic-rw-backward')
])
    
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)
    
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

2021-06-23 16:38:42,591 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/redewiedergabe_lm_forward.pt not found in cache, downloading to /tmp/tmp8_pwenh3


100%|██████████| 72819063/72819063 [00:32<00:00, 2229617.09B/s]

2021-06-23 16:39:15,415 copying /tmp/tmp8_pwenh3 to cache at /home/app/.flair/embeddings/redewiedergabe_lm_forward.pt





2021-06-23 16:39:15,477 removing temp file /tmp/tmp8_pwenh3
2021-06-23 16:39:15,837 https://flair.informatik.hu-berlin.de/resources/embeddings/flair/redewiedergabe_lm_backward.pt not found in cache, downloading to /tmp/tmpm6luyj2y


100%|██████████| 72819073/72819073 [00:33<00:00, 2182591.58B/s]

2021-06-23 16:39:49,460 copying /tmp/tmpm6luyj2y to cache at /home/app/.flair/embeddings/redewiedergabe_lm_backward.pt





2021-06-23 16:39:49,508 removing temp file /tmp/tmpm6luyj2y


## Train model


In [None]:
trainer.train('../resources/ner_models/ner-experiment_xx',
              learning_rate=0.1,
              mini_batch_size=32,
              embeddings_storage_mode="cpu",
              shuffle=False,
              max_epochs=150)