# Training a NER Model with Flair

In this notebook we will train a NER Model using Flair (using Google Colab GPU), let's start by installing flair and importing the necessary libraires

In [None]:
!pip install flair

In [None]:
import pandas as pd
from flair.data import Sentence
from flair.models import SequenceTagger
import logging
import json
#Import flair modules
from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings, BytePairEmbeddings, TransformerWordEmbeddings

from flair.models import SequenceTagger
from flair.trainers import ModelTrainer


We will then load the data and convert the data in the BIO scheme in csv file to txt files

In [None]:
# Data preparation
data_path = '/content/drive/MyDrive/DS Projects/brand-ner/data/bio/data_bio_clean.csv'
data_df = pd.read_csv(data_path)
#data_df['labels'] = data_df.labels.replace('I-BRAND','B-BRAND')

In [None]:
# Creating a BILUO tagging scheme 
# Flair accept ner data format with word label in each line of text file with empty line for new sentence
path = '/content/drive/MyDrive/DS Projects/brand-ner/data/bio/'
def to_biluo(data,fn):
  sentence_df = data.groupby('sentence_id')
  f = open(path+fn,'w')
  for name, sentence_grp in sentence_df:
    for i,item in sentence_grp.iterrows():
      word = item['words']
      tag = item['labels']
      f.write(f"{word} {tag}\n")
    f.write('\n')


In [None]:
data_df.sentence_id.max()

98828

Let's split the data into train, dev and test sets that will be used to train and evaluate the model respectively

In [None]:
# Creating train.txt test.txt and dev.txt

idx_train = 88000
idx_dev = 95000
idx_test = 98828

df_train = data_df[data_df.sentence_id <= idx_train]
df_dev = data_df[(data_df.sentence_id > idx_train) & (data_df.sentence_id <= idx_dev)]
df_test = data_df[(data_df.sentence_id > idx_dev) & (data_df.sentence_id <= idx_test)]

In [None]:
df_train.shape

(1011431, 3)

In [None]:
to_biluo(df_train,'train.txt')
to_biluo(df_test,'test.txt')
to_biluo(df_dev,'dev.txt')

## Building the model

We first start by converting our txt files into a `Corpus`object that will be used by Flair to train the model, this code sinppet is grabbed from Flair documentation

In [None]:
# Creating a corpus object
from flair.data import Corpus
from flair.datasets import ColumnCorpus

# define columns
columns = {0: 'text', 1: 'ner',}

# this is the folder in which train, test and dev files reside
data_folder = path

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt')

2022-01-13 14:18:43,877 Reading data from /content/drive/MyDrive/DS Projects/brand-ner/data/bio
2022-01-13 14:18:43,878 Train: /content/drive/MyDrive/DS Projects/brand-ner/data/bio/train.txt
2022-01-13 14:18:43,885 Dev: /content/drive/MyDrive/DS Projects/brand-ner/data/bio/dev.txt
2022-01-13 14:18:43,886 Test: /content/drive/MyDrive/DS Projects/brand-ner/data/bio/test.txt


Let's check the size of the corpus as a basic sanity check

In [None]:
print(corpus)

Corpus: 87924 train + 6991 dev + 3826 test sentences


The tag type in our case is `ner`

In [None]:
# tag to predict
tag_type = 'ner'
# make tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

  after removing the cwd from sys.path.


Now let's build the model : 
- We will use Glove 6B and Flair embeddings on news articles in one stacked embedding
- We use a BI-LSTM model with a crf layer at the top as explained in the paper
- We then use the `SequenceTagger`object to train a sequence tagging model for 40 epochs using an initial learning rate of 0.1 an then reducing the learning rate every 4 bad epochs (meaning epochs with no improvement on the test set)

In [None]:
# 4. initialize fine-tuneable transformer embeddings WITH document context

embeddings = [
              WordEmbeddings('glove'),
              FlairEmbeddings('news-forward-fast'),
              FlairEmbeddings('news-backward-fast')
]
embeddings = StackedEmbeddings(embeddings)

# 5. initialize sequence tagger
tagger = SequenceTagger(hidden_size=256,
                        embeddings=embeddings,
                        tag_dictionary=tag_dictionary,
                        tag_type=tag_type,
                        use_crf=True
                        )

# 6. initialize trainer
trainer = ModelTrainer(tagger, corpus)



In [None]:
# Train the model and save in drive
# 7. start training
model_path = '/content/drive/MyDrive/DS Projects/brand-ner/models/flair/'
model_name = 'flair-ner-new-amazon_40'
# 7. start training
trainer.train(model_path+model_name,
              learning_rate=0.03,
              mini_batch_size=32,
              max_epochs=10,
              embeddings_storage_mode='none'
              )


2022-01-13 18:07:34,090 ----------------------------------------------------------------------------------------------------
2022-01-13 18:07:34,094 Model: "SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings(
      'glove'
      (embedding): Embedding(400001, 100)
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
    (list_embedding_2): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.25, inplace=False)
        (encoder): Embedding(275, 100)
        (rnn): LSTM(100, 1024)
        (decoder): Linear(in_features=1024, out_features=275, bias=True)
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=2148, out_features=2148, b

{'dev_loss_history': [tensor(0.0695, device='cuda:0'),
  tensor(0.0689, device='cuda:0'),
  tensor(0.0691, device='cuda:0'),
  tensor(0.0675, device='cuda:0'),
  tensor(0.0683, device='cuda:0'),
  tensor(0.0675, device='cuda:0'),
  tensor(0.0667, device='cuda:0'),
  tensor(0.0673, device='cuda:0'),
  tensor(0.0669, device='cuda:0'),
  tensor(0.0663, device='cuda:0')],
 'dev_score_history': [0.825299890948746,
  0.8294460641399417,
  0.8276462901118049,
  0.8288561525129983,
  0.8290814482908145,
  0.8269986893840106,
  0.8321242019733025,
  0.8329950710350826,
  0.8331889081455806,
  0.832900057770075],
 'test_score': 0.8346331791143424,
 'train_loss_history': [0.08673060131197435,
  0.08505247087363507,
  0.08484259411130574,
  0.08429876052265409,
  0.08368341124656554,
  0.08316223712903896,
  0.08086903610179365,
  0.08055416628803765,
  0.0801453448163603,
  0.07973682733006898]}

## Testing the model

Nice, the model is trained and is saved in our Google Drive (or Laptop), let's test it on some real world examples

In [None]:
# Testing the model

texts = [
    "Laptop Dell Inspiron X546",
    'Black Pelikan Pencil 16mm',
    'Battery Smart Energy by Energizer',
    'Fijutsu DSLR Camera 156p',
    "Genuine Paul Smith Men's Belt-Leather Woven Plait Belt/BNWT/Sz: 36'/RRP:110.00",
    'Computer HP X80 Intel Xeon',
    'Smart Watch Apple',
    'Computer Big Hewelett-Packard Intel Xeon',
    '24 Buttermilk oz Oroweat Bread,',
    'Black pencil Vertex',
    'Wireless mouse MacTech',
    'Smart Mouse Logitech',
    'Brother Printer V167 Black Ink',

]


trained_model_path = '/content/drive/MyDrive/DS Projects/brand-ner/models/flair/flair-ner-new-amazon_40/best-model.pt'
flair_model =  SequenceTagger.load(trained_model_path)
#flair_model = SequenceTagger.load('/content/flair-ner-base-transf/final-model.pt')
for t in texts : 
  # create example sentence
  sentence = Sentence(t)
  # predict the tags
  flair_model.predict(sentence)
  print(sentence.to_tagged_string())


2022-01-13 19:26:33,450 loading file /content/drive/MyDrive/DS Projects/brand-ner/models/flair/flair-ner-new-amazon_40/best-model.pt
Laptop Dell <B-BRAND> Inspiron X546
Black Pelikan <B-BRAND> Pencil 16mm
Battery Smart Energy by Energizer <B-BRAND>
Fijutsu <B-BRAND> DSLR Camera 156p
Genuine Paul <B-BRAND> Smith <I-BRAND> Men 's Belt-Leather Woven Plait Belt / BNWT / Sz : 36 '/ RRP : 110.00
Computer HP X80 Intel <B-BRAND> Xeon
Smart <B-BRAND> Watch <I-BRAND> Apple
Computer Big Hewelett-Packard <B-BRAND> Intel Xeon
24 Buttermilk oz Oroweat <B-BRAND> Bread ,
Black pencil Vertex <B-BRAND>
Wireless mouse MacTech <B-BRAND>
Smart Mouse Logitech <B-BRAND>
Brother <B-BRAND> Printer <I-BRAND> V167 Black Ink


The model gives nice results indeed, we can now deploy it, but before doing so we should write a function that parse the B-BRAND and I-BRAND tag in each sentence and return it as an end results. There are several ways to achieve this so we will not include it in this notebook.