**BILSTM-CRF**

This is a sample implementation of Named Entity recognition for doing state-of-the-art recognition of enitities based on deep neural networks. We have used BILSTM to learn features and train on our dataset. CRF or conditional random fields are most widely used technique to learn custom entities(machine learned). We have used pretrained word2vec  model to provide embeddings and word vocabulary.

In [24]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
!pip install -q tensorflow_gpu>=2.0
!pip install ktrain
%reload_ext autoreload
%autoreload 2
%matplotlib inline


**For enabling sequence tagging in BILSTM-crf we have to disable tensorflow 2.0**

In [25]:
import os
os.environ['DISABLE_V2_BEHAVIOR'] = '1'

In [26]:
import tensorflow as tf; print(tf.__version__)

2.1.0


In [48]:
import pandas as pd
input_data = pd.read_csv('./../input/entity-annotated-corpus/ner_dataset.csv', encoding="latin1")
input_data = input_data.fillna(method="ffill")
input_data.tail(10)

Unnamed: 0,Sentence #,Word,POS,Tag
1048565,Sentence: 47958,impact,NN,O
1048566,Sentence: 47958,.,.,O
1048567,Sentence: 47959,Indian,JJ,B-gpe
1048568,Sentence: 47959,forces,NNS,O
1048569,Sentence: 47959,said,VBD,O
1048570,Sentence: 47959,they,PRP,O
1048571,Sentence: 47959,responded,VBD,O
1048572,Sentence: 47959,to,TO,O
1048573,Sentence: 47959,the,DT,O
1048574,Sentence: 47959,attack,NN,O


In [49]:
import matplotlib.pyplot as plt
plt.figure(figsize = (20, 10))
klass = input_data.Tag.value_counts().to_dict()
klass.pop('O')
pd.Series(klass).plot.bar();

NameError: name 'plt' is not defined

In [28]:
import ktrain
from ktrain import text

**Loading and extracting data**
* The training file must of specific format, one column for text and one column for labels. 
* entities_from_text, is a pre-built function in ktrain which loads data from file and extract necessary information and provides a summary of the same.

In [30]:
DATAFILE = './../input/entity-annotated-corpus/ner_dataset.csv'
(trn, val, preproc) = text.entities_from_txt(DATAFILE,
                                             embeddings='word2vec',
                                             sentence_column='Sentence #',
                                             word_column='Word',
                                             tag_column='Tag', 
                                             data_format='gmb')

Number of sentences:  47959
Number of words in the dataset:  35178
Tags: ['O', 'B-per', 'B-gpe', 'I-org', 'I-art', 'I-per', 'B-nat', 'I-eve', 'I-nat', 'B-eve', 'B-tim', 'I-gpe', 'B-art', 'I-tim', 'I-geo', 'B-geo', 'B-org']
Number of Labels:  17
Longest sentence: 104 words


**Confirming the supported sequence taggers**

In [33]:
text.print_sequence_taggers()

bilstm-crf: Bidirectional LSTM-CRF  (https://arxiv.org/abs/1603.01360)


**sequence_tagger** will ensure, pre-trained word2vec embeddings is used to provide categorical embeddings for raw text and labels.

In [32]:
model = text.sequence_tagger('bilstm-crf', preproc)

pretrained word2vec word embeddings will be used with bilstm-crf
Loading pretrained word vectors...this may take a few moments...
Done.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: LIVE_VARS_IN


In [34]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val)

In [36]:
# find good learning rate
#learner.lr_find()             # briefly simulate training to find good learning rate
#learner.lr_plot()             # visually identify best learning rate

**Currently trained for 1 iteration and we have achieved a decent F-1 score, training with more epochs
will ensure better validation loss and F1 Score**

In [37]:
learner.fit(1e-3, 1)



<tensorflow.python.keras.callbacks.History at 0x7f4f19f0f8d0>

In [38]:
learner.validate(class_names=preproc.get_classes())

   F1: 83.93
           precision    recall  f1-score   support

      org       0.79      0.65      0.71      2007
      geo       0.84      0.92      0.88      3821
      tim       0.88      0.85      0.87      1969
      art       0.00      0.00      0.00        39
      per       0.77      0.77      0.77      1666
      gpe       0.98      0.92      0.95      1577
      eve       0.60      0.20      0.30        30
      nat       0.47      0.35      0.40        26

micro avg       0.85      0.83      0.84     11135
macro avg       0.84      0.83      0.83     11135



0.8392832841914267

In [39]:
learner.view_top_losses(n=1)

total incorrect: 12
Word            True : (Pred)
Religious      :O     (O)
councils       :O     (O)
in             :O     (O)
Saudi          :B-per (B-geo)
Arabia         :I-per (I-geo)
,              :O     (O)
the            :O     (O)
United         :B-org (B-org)
Arab           :I-org (I-org)
Emirates       :I-org (I-org)
,              :O     (O)
Kuwait         :B-org (B-geo)
and            :O     (O)
other          :O     (O)
Arab           :B-gpe (B-gpe)
states         :O     (O)
said           :O     (O)
the            :O     (O)
moon           :O     (O)
's             :O     (O)
crescent       :O     (O)
was            :O     (O)
not            :O     (O)
sighted        :O     (O)
after          :O     (O)
nightfall      :B-tim (O)
Wednesday      :I-tim (B-tim)
,              :O     (O)
meaning        :O     (O)
there          :O     (O)
will           :O     (O)
be             :O     (O)
one            :B-tim (O)
more           :I-tim (O)
day            :I-tim (O)
of      

In [40]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [42]:
predictor.predict('As of 2019,Narendra modi has been prime minister of india.')


[('As', 'O'),
 ('of', 'O'),
 ('2019', 'B-tim'),
 (',', 'O'),
 ('Narendra', 'B-geo'),
 ('modi', 'O'),
 ('has', 'O'),
 ('been', 'O'),
 ('prime', 'O'),
 ('minister', 'O'),
 ('of', 'O'),
 ('india', 'B-geo'),
 ('.', 'O')]