# Named Entity Recognition

### Approach

The neuroner class implements named entity reognition using bi-directional LSTM's. The paper generates character enhances token embeddings as follows-

- Character embeddings are passed through BLSTM
- The output of BLSTM is combined with word embeddings to get character enhanced token embeddings

The character enhanced token embeddings are passed through another BLSTM to generate predictions for each token

Let's import the `neuroner` package

In [1]:
from neuroner import neuroner

### Instantiate object of class

In [2]:
ner = neuroner(parameters_filepath='./parameters.ini')

{'train_model': 1, 'use_pretrained_model': 0, 'pretrained_model_folder': './trained_models/conll_2003_en/', 'dataset_text_folder': './data/conll2003/en', 'main_evaluation_mode': 'bio', 'output_folder': './output', 'use_character_lstm': 1, 'character_embedding_dimension': 25, 'character_lstm_hidden_state_dimension': 25, 'token_pretrained_embedding_filepath': './data/word_vectors/glove.6B.100d.txt', 'token_embedding_dimension': 100, 'token_lstm_hidden_state_dimension': 1, 'use_crf': 1, 'patience': 10, 'maximum_number_of_epochs': 1, 'optimizer': 'sgd', 'learning_rate': 0.005, 'gradient_clipping_value': 5.0, 'dropout_rate': 0.5, 'number_of_cpu_threads': 8, 'number_of_gpus': 0, 'experiment_name': 'test', 'output_scores': 0, 'tagging_format': 'bio', 'tokenizer': 'spacy', 'spacylanguage': 'en', 'remap_unknown_tokens_to_unk': 1, 'load_only_pretrained_token_embeddings': 0, 'load_all_pretrained_token_embeddings': 'False', 'check_for_lowercase': 1, 'check_for_digits_replaced_with_zeros': 1, 'free

### Loading and Preprocessing Dataset

Give the path for train, dev and test files

In [3]:
inputFiles = {'train': '/Users/lakshya/Desktop/CSCI-548/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs-master/conll/train.txt',
              'dev': '/Users/lakshya/Desktop/CSCI-548/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs-master/conll/valid.txt',
              'test': '/Users/lakshya/Desktop/CSCI-548/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs-master/conll/test.txt'}

Read and load the entire dataset

In [4]:
data = ner.read_dataset(inputFiles)

### Preview Data

Let's look at some data from train, dev and test

In [5]:
data['train'][0:12]

[['-DOCSTART-', '-X-', '-X-', 'O'],
 [],
 ['EU', 'NNP', 'B-NP', 'B-ORG'],
 ['rejects', 'VBZ', 'B-VP', 'O'],
 ['German', 'JJ', 'B-NP', 'B-MISC'],
 ['call', 'NN', 'I-NP', 'O'],
 ['to', 'TO', 'B-VP', 'O'],
 ['boycott', 'VB', 'I-VP', 'O'],
 ['British', 'JJ', 'B-NP', 'B-MISC'],
 ['lamb', 'NN', 'I-NP', 'O'],
 ['.', '.', 'O', 'O'],
 []]

In [6]:
data['dev'][0:14]

[['-DOCSTART-', '-X-', '-X-', 'O'],
 [],
 ['CRICKET', 'NNP', 'B-NP', 'O'],
 ['-', ':', 'O', 'O'],
 ['LEICESTERSHIRE', 'NNP', 'B-NP', 'B-ORG'],
 ['TAKE', 'NNP', 'I-NP', 'O'],
 ['OVER', 'IN', 'B-PP', 'O'],
 ['AT', 'NNP', 'B-NP', 'O'],
 ['TOP', 'NNP', 'I-NP', 'O'],
 ['AFTER', 'NNP', 'I-NP', 'O'],
 ['INNINGS', 'NNP', 'I-NP', 'O'],
 ['VICTORY', 'NN', 'I-NP', 'O'],
 ['.', '.', 'O', 'O'],
 []]

In [7]:
data['test'][0:15]

[['-DOCSTART-', '-X-', '-X-', 'O'],
 [],
 ['SOCCER', 'NN', 'B-NP', 'O'],
 ['-', ':', 'O', 'O'],
 ['JAPAN', 'NNP', 'B-NP', 'B-LOC'],
 ['GET', 'VB', 'B-VP', 'O'],
 ['LUCKY', 'NNP', 'B-NP', 'O'],
 ['WIN', 'NNP', 'I-NP', 'O'],
 [',', ',', 'O', 'O'],
 ['CHINA', 'NNP', 'B-NP', 'B-PER'],
 ['IN', 'IN', 'B-PP', 'O'],
 ['SURPRISE', 'DT', 'B-NP', 'O'],
 ['DEFEAT', 'NN', 'I-NP', 'O'],
 ['.', '.', 'O', 'O'],
 []]

### Train the model

In [8]:
ner.train(data)

Checking the validity of BRAT-formatted train set... Done.
Checking compatibility between CONLL and BRAT for train_compatible_with_brat set ... Done.
Checking the validity of BRAT-formatted valid set... Done.
Checking compatibility between CONLL and BRAT for valid_compatible_with_brat set ... Done.
Checking the validity of BRAT-formatted test set... Done.
Checking compatibility between CONLL and BRAT for test_compatible_with_brat set ... Done.
Preprocessing dataset... done (29.72 seconds)
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Load token embeddings... done (0.98 seconds)
number_of_token_original_case_found: 400000
number_of_token_lowercase_found: 8770
number_of_token_digits_replaced_with_zeros_found: 68
number_of_token_lowercase_and_digits_replaced_with_zeros_found: 5
number_of_loaded_word_vectors: 408843
dataset.vocabulary_size: 409977

Starting epoch 0
Training completed in 0.00 seconds
Evaluate model on the train set


  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

       B-LOC     0.0196    0.0140    0.0163      3143
       I-LOC     0.0067    0.1827    0.0130       427
      B-MISC     0.0109    0.0791    0.0192      1466
      I-MISC     0.0048    0.0320    0.0084       532
       B-ORG     0.0264    0.2067    0.0469      2777
       I-ORG     0.0000    0.0000    0.0000      1518
       B-PER     0.0594    0.1869    0.0901      3018
       I-PER     0.0235    0.1018    0.0382      2162

   micro avg     0.0235    0.1072    0.0386     15043
   macro avg     0.0189    0.1004    0.0290     15043
weighted avg     0.0257    0.1072    0.0382     15043

Evaluate model on the valid set
              precision    recall  f1-score   support

       B-LOC     0.0328    0.0267    0.0294      1837
       I-LOC     0.0052    0.1518    0.0100       257
      B-MISC     0.0139    0.0998    0.0244       922
      I-MISC     0.0061    0.0376    0.0105       346
       B-ORG     0.0236    0.2364    0.0430   

### Model Evaluation

Extract the ground truth from test data

In [9]:
ground = ner.convert_ground_truth(data)

Make predictions on test data

In [10]:
predictions = ner.predict(data)

Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from ./output/conll_2019-04-29_23-01-50-456390/output/model.ckpt


### Preview predictions

Lets have a look at what predictions look like

In [11]:
predictions[0:12]

[(None, 6, 'SOCCER', 'O'),
 (None, 1, '-', 'O'),
 (None, 5, 'JAPAN', 'B-LOC'),
 (None, 3, 'GET', 'O'),
 (None, 5, 'LUCKY', 'O'),
 (None, 3, 'WIN', 'O'),
 (None, 1, ',', 'O'),
 (None, 5, 'CHINA', 'O'),
 (None, 2, 'IN', 'O'),
 (None, 8, 'SURPRISE', 'O'),
 (None, 6, 'DEFEAT', 'O'),
 (None, 1, '.', 'O')]

### Results

Calculate precision, recall and f-1 score

In [12]:
P,R,F1 = ner.evaluate(predictions, ground)

print('Precision: %s, Recall: %s, F1: %s'%(P,R,F1))

              precision    recall  f1-score   support

       B-LOC     0.5770    0.4155    0.4831      1668
       I-LOC     0.0000    0.0000    0.0000       257
      B-MISC     0.0000    0.0000    0.0000       702
      I-MISC     0.0000    0.0000    0.0000       216
       B-ORG     0.4838    0.3143    0.3810      1661
       I-ORG     0.4035    0.1653    0.2345       835
       B-PER     0.5565    0.4842    0.5179      1617
       I-PER     0.5622    0.6843    0.6172      1156

   micro avg     0.5384    0.3608    0.4321      8112
   macro avg     0.3229    0.2579    0.2792      8112
weighted avg     0.4503    0.3608    0.3927      8112

Precision: 0.5384473877851361, Recall: 0.3608234714003945, F1: 0.4320932979037497
