 # Named Entity Recognition
## Approach
The neuralSequenceLabeler class implements NER using  CNN's, bi-directional LSTM's and a CRF layer . 

- Convolutional neural networks (CNNs) are used to encode character-level information of a word into its character-level representation. 
- Character- and word-level representations are combined  and fe them into bi-directional LSTM (BLSTM) to model context information of each word. 
- On top of BLSTM, a sequential CRF is used to jointly decode NER labels for the whole sentence

The approach has been tried and executed on the benchmarks( CoNLL 2003, ontonotes 5.0 and CHEMDNER)  chosen by G6.

In [1]:
from experiment_final import NeuralSequenceLabeler
from utils import parseconfig
from utils.conll2003_prepro import process_data
import os
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 

In [2]:
config = parseconfig.parseConfig()
file_dict = {"train":os.path.join(config["raw_path"], "train1.txt"),"dev":os.path.join(config["raw_path"], "valid1.txt"),"test":os.path.join(config["raw_path"], "test1.txt")}


Instantiate object of NeuralSequenceLabeler

In [3]:
neuralSequenceLabeler =  NeuralSequenceLabeler(config)

In [4]:
dataset = neuralSequenceLabeler.read_dataset(file_dict,"ontonotes")

train 219554
dev 55044
test 50350


Inititalise metadata & preprocess

In [5]:
train_set,dev_set,test_set,vocab = process_data(dataset,config)
neuralSequenceLabeler.initialize_metadata(vocab)

params number: 6039690


In [6]:
print(dataset['train'][0])

[['-DOCSTART-', '-X-', '-X-', 'O', '-', '-', '-', '-', '-', '-', '-', '-']]


The input to the train method is as shown in the format below 

In [7]:
print(dataset['train'][3])

[['BRUSSELS', 'NNP', 'B-NP', 'B-LOC', '-', '-', '-', '-', '-', '-', '-', '-'], ['1996-08-22', 'CD', 'I-NP', 'O', '-', '-', '-', '-', '-', '-', '-', '-']]


load the model file , if pre-saved

In [12]:
try:
    neuralSequenceLabeler.load_model()
    print("Loading completed")
except:
    print("Error loading the model")

in load model....


Loading completed


pre-processing the dataset in the required format 

In [13]:
file_dict = {"train": train_set, "dev":dev_set,"test":test_set }

In [14]:
dataset = neuralSequenceLabeler.read_dataset_helper(file_dict,"ontonotes")

14987
3466
3684


Training the model using train/dev & test dataset splits


In [15]:
neuralSequenceLabeler.train(dataset["train_set"],dataset["dev_data"],dataset["dev_set"],dataset["test_set"])

Start training...
Epoch 1/2:




in predict...


dev dataset -- pre: 76.81, rec: 80.10, FB1: 78.42
 -- new BEST score on test dataset: 78.42
Epoch 2/2:




in predict...


dev dataset -- pre: 81.09, rec: 80.79, FB1: 80.94
 -- new BEST score on test dataset: 80.94


In [16]:
predictions_formatted = neuralSequenceLabeler.predict(dataset["test_set"],"test")



in predict...


In [19]:
print(predictions_formatted[1])

[[None, None, 'soccer', 'O'], [None, None, '-', 'O'], [None, None, 'japan', 'B-LOC'], [None, None, 'get', 'O'], [None, None, 'lucky', 'O'], [None, None, 'win', 'O'], [None, None, ',', 'O'], [None, None, 'china', 'B-LOC'], [None, None, 'in', 'O'], [None, None, 'surprise', 'O'], [None, None, 'defeat', 'O'], [None, None, '.', 'O']]


Extracting  the predictions and groundtruth labels in order to evaluate benchmarks 

In [21]:
predictions = []
groundTruths = []
words_list = []
for i in range(len(predictions_formatted)):
    predictions_sentence = []
    groundTruths_sentence = []
    words_list_sentence = []
    for j in range(len(predictions_formatted[i])):

        words_list_sentence.append(predictions_formatted[i][j][2])
        predictions_sentence.append(predictions_formatted[i][j][3])

    predictions.append(predictions_sentence)

    words_list.append(words_list_sentence)
for data in dataset["test_set"]:
    for tags,  seq_len in zip(data["tags"], data["seq_len"]):
            tags = [neuralSequenceLabeler.rev_tag_dict[x] for x in tags[:seq_len]]
            groundTruths.append(tags)

Evaluating the predictions and reporting the p,r, f1 scores


In [23]:
save_path = os.path.join(neuralSequenceLabeler.cfg["checkpoint_path"], "result.txt")
name = "test"
score = neuralSequenceLabeler.evaluate(predictions, groundTruths,words_list, save_path,name)

test dataset -- pre: 81.09, rec: 80.79, FB1: 80.94
