# Named Entity Recognition

### Approach

The neuroner class implements named entity reognition using bi-directional LSTM's. The paper generates character enhances token embeddings as follows-

- Character embeddings are passed through BLSTM
- The output of BLSTM is combined with word embeddings to get character enhanced token embeddings

The character enhanced token embeddings are passed through another BLSTM to generate predictions for each token

Let's import the `neuroner` package

In [1]:
from neuroner import neuroner

### Instantiate object of class

In [2]:
ner = neuroner(parameters_filepath='./parameters.ini')

{'train_model': 1, 'use_pretrained_model': 0, 'pretrained_model_folder': './trained_models/conll_2003_en/', 'dataset_text_folder': './data/conll2003/en', 'main_evaluation_mode': 'bio', 'output_folder': './output', 'use_character_lstm': 1, 'character_embedding_dimension': 25, 'character_lstm_hidden_state_dimension': 25, 'token_pretrained_embedding_filepath': './data/word_vectors/glove.6B.100d.txt', 'token_embedding_dimension': 100, 'token_lstm_hidden_state_dimension': 1, 'use_crf': 1, 'patience': 10, 'maximum_number_of_epochs': 1, 'optimizer': 'sgd', 'learning_rate': 0.005, 'gradient_clipping_value': 5.0, 'dropout_rate': 0.5, 'number_of_cpu_threads': 8, 'number_of_gpus': 0, 'experiment_name': 'test', 'output_scores': 0, 'tagging_format': 'bio', 'tokenizer': 'spacy', 'spacylanguage': 'en', 'remap_unknown_tokens_to_unk': 1, 'load_only_pretrained_token_embeddings': 0, 'load_all_pretrained_token_embeddings': 'False', 'check_for_lowercase': 1, 'check_for_digits_replaced_with_zeros': 1, 'free

### Loading and Preprocessing Dataset

Give the path for train, dev and test files

In [3]:
inputFiles = {'train': '/Users/lakshya/Desktop/CSCI-548/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs-master/conll/train.txt',
              'dev': '/Users/lakshya/Desktop/CSCI-548/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs-master/conll/valid.txt',
              'test': '/Users/lakshya/Desktop/CSCI-548/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs-master/conll/test.txt'}

Read and load the entire dataset

In [4]:
data = ner.read_dataset(inputFiles)

### Preview Data

Let's look at some data from train, dev and test

In [5]:
data['train'][0:12]

[['-DOCSTART-', '-X-', '-X-', 'O'],
 [],
 ['EU', 'NNP', 'B-NP', 'B-ORG'],
 ['rejects', 'VBZ', 'B-VP', 'O'],
 ['German', 'JJ', 'B-NP', 'B-MISC'],
 ['call', 'NN', 'I-NP', 'O'],
 ['to', 'TO', 'B-VP', 'O'],
 ['boycott', 'VB', 'I-VP', 'O'],
 ['British', 'JJ', 'B-NP', 'B-MISC'],
 ['lamb', 'NN', 'I-NP', 'O'],
 ['.', '.', 'O', 'O'],
 []]

In [6]:
data['dev'][0:14]

[['-DOCSTART-', '-X-', '-X-', 'O'],
 [],
 ['CRICKET', 'NNP', 'B-NP', 'O'],
 ['-', ':', 'O', 'O'],
 ['LEICESTERSHIRE', 'NNP', 'B-NP', 'B-ORG'],
 ['TAKE', 'NNP', 'I-NP', 'O'],
 ['OVER', 'IN', 'B-PP', 'O'],
 ['AT', 'NNP', 'B-NP', 'O'],
 ['TOP', 'NNP', 'I-NP', 'O'],
 ['AFTER', 'NNP', 'I-NP', 'O'],
 ['INNINGS', 'NNP', 'I-NP', 'O'],
 ['VICTORY', 'NN', 'I-NP', 'O'],
 ['.', '.', 'O', 'O'],
 []]

In [7]:
data['test'][0:15]

[['-DOCSTART-', '-X-', '-X-', 'O'],
 [],
 ['SOCCER', 'NN', 'B-NP', 'O'],
 ['-', ':', 'O', 'O'],
 ['JAPAN', 'NNP', 'B-NP', 'B-LOC'],
 ['GET', 'VB', 'B-VP', 'O'],
 ['LUCKY', 'NNP', 'B-NP', 'O'],
 ['WIN', 'NNP', 'I-NP', 'O'],
 [',', ',', 'O', 'O'],
 ['CHINA', 'NNP', 'B-NP', 'B-PER'],
 ['IN', 'IN', 'B-PP', 'O'],
 ['SURPRISE', 'DT', 'B-NP', 'O'],
 ['DEFEAT', 'NN', 'I-NP', 'O'],
 ['.', '.', 'O', 'O'],
 []]

### Train the model

In [8]:
ner.train(data)

Checking the validity of BRAT-formatted train set... Done.
Checking compatibility between CONLL and BRAT for train_compatible_with_brat set ... Done.
Checking the validity of BRAT-formatted valid set... Done.
Checking compatibility between CONLL and BRAT for valid_compatible_with_brat set ... Done.
Checking the validity of BRAT-formatted test set... Done.
Checking compatibility between CONLL and BRAT for test_compatible_with_brat set ... Done.
Preprocessing dataset... done (32.40 seconds)
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Load token embeddings... done (1.01 seconds)
number_of_token_original_case_found: 400000
number_of_token_lowercase_found: 8770
number_of_token_digits_replaced_with_zeros_found: 68
number_of_token_lowercase_and_digits_replaced_with_zeros_found: 5
number_of_loaded_word_vectors: 408843
dataset.vocabulary_size: 409977

Starting epoch 0
Training completed in 0.00 seconds
Evaluate model on the train set
              precision    recall  f1-score   support

       B-LOC     0.0000    0.0000    0.0000      3143
       I-LOC     0.0032    0.0047    0.0038       427
      B-MISC     0.0390    0.0061    0.0106      1466
      I-MISC     0.0000    0.0000    0.0000       532
       B-ORG     0.0262    0.3036    0.0482      2777
       I-ORG     0.2500    0.0007    0.0013      1518
       B-PER     0.0000    0.0000    0.0000      3018
       I-PER     0.0201    0.1989    0.0365      2162

   micro avg     0.0236    0.0854    0.0370     15043
   macro avg     0.0423    0.0642    0.0126     15043
wei

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

       B-LOC     0.0000    0.0000    0.0000      1668
       I-LOC     0.0134    0.0195    0.0158       257
      B-MISC     0.0374    0.0100    0.0157       702
      I-MISC     0.0000    0.0000    0.0000       216
       B-ORG     0.0234    0.2414    0.0426      1661
       I-ORG     1.0000    0.0024    0.0048       835
       B-PER     0.0000    0.0000    0.0000      1617
       I-PER     0.0235    0.2569    0.0431      1156

   micro avg     0.0235    0.0878    0.0370      8112
   macro avg     0.1372    0.0663    0.0153      8112
weighted avg     0.1147    0.0878    0.0172      8112


Starting epoch 1
Training completed in 1093.55 seconds
Evaluate model on the train set
              precision    recall  f1-score   support

       B-LOC     0.6245    0.5514    0.5857      3143
       I-LOC     0.0000    0.0000    0.0000       427
      B-MISC     0.0000    0.0000    0.0000      1466
      I-MISC     0.0000    0.0000    0.0000 

### Model Evaluation

Extract the ground truth from test data

In [9]:
ground = ner.convert_ground_truth(data)

Make predictions on test data

In [10]:
predictions = ner.predict(data)

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./output/conll_2019-04-27_10-46-23-543390/output/model.ckpt


### Results

Calculate precision, recall and f-1 score

In [11]:
P,R,F1 = ner.evaluate(predictions, ground)

print('Precision: %s, Recall: %s, F1: %s'%(P,R,F1))

              precision    recall  f1-score   support

       B-LOC     0.5862    0.5486    0.5667      1668
       I-LOC     0.0000    0.0000    0.0000       257
      B-MISC     0.0000    0.0000    0.0000       702
      I-MISC     0.0000    0.0000    0.0000       216
       B-ORG     0.4957    0.3444    0.4064      1661
       I-ORG     0.3590    0.2623    0.3031       835
       B-PER     0.6443    0.4459    0.5270      1617
       I-PER     0.6550    0.6341    0.6444      1156

   micro avg     0.5680    0.3895    0.4622      8112
   macro avg     0.3425    0.2794    0.3060      8112
weighted avg     0.4808    0.3895    0.4278      8112

Precision: 0.5680388279705195, Recall: 0.3895463510848126, F1: 0.4621572212065813
