## Assignment 3 - Named Entity Recognition

In this assignment, we are going to build a Named Entity Recognition model. With this model, we will also tag new data.

More on Named Entity Recognition:

https://blog.paralleldots.com/data-science/named-entity-recognition-milestone-models-papers-and-technologies/

https://blog.paralleldots.com/product/applications-named-entity-recognition-api/

### Steps:

**1. Import the data**

**2. Build the model**

**3. Pick a dataset to run the model on**

**4. Build a function to load new data and print the tags**

Your web application will load small sections of text (such as tweets or headlines) and from that, you will tag the text based on the presence of named entities.

*What you will be graded on:*

1. Ability to build a model on word and tag data

2. Ability to use the model to predict on new data and display that prediction

*The model will be based on:*
1. Embeddings from words
2. Embeddings from tag inputs

### Step 1: Importing the data

Below is some code to get you started. As in the part of speech tagging example, you will have to write code to:

0. Split your data into a train/test set (Do a 80/20 or 90/10 split since we'll be later applying this model to an entirely separate set of data)
1. Find the set of all words
2. Find the set of all tags
3. **Create a function called ent_tagger** that will turn a sentence into this output for model building :
``` [('Thousands', 'O'), ('of', 'O'), ('demonstrators', 'O'), ('have',  'O'), ('marched',  'O'), ('through',  'O'), ('London', 'B-geo'), ('to',  'O'), ('protest',  'O'), ('the',  'O'), ('war',  'O'), ('in',  'O'), ('Iraq',  'B-geo'), ('and', 'O'), ('demand',  'O'), ('the',  'O'), ('withdrawal', 'O'), ('of', 'O'), ('British', 'B-gpe'), ('troops',  'O'), ('from', 'O'), ('that', 'O'), ('country', 'O'), ('.', 'O')]
```
4. Make a dictionary of words to index and entity tag to index

In [45]:
import pandas as pd
import numpy as np

data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [46]:
#to get the all words list(no repeat)
words_list = data.Word.values.tolist()
vocabulary = set(words_list)

In [47]:
#to get the all tags list(no repeat)
tags_list = data.Tag.values.tolist()
tags_list_set = set(tags_list)

ent_tagger function will be created at the last question


In [48]:
#dictionary of words to index
word2int = {}
for i,word in enumerate(vocabulary):
    word2int[word] = i+1

In [49]:
#dictionary of entity tag to index to index
tag2int = {}
for i, tag in enumerate(tags_list_set):
    tag2int[tag] = i+1

### Step 1a: Formatting the data
Data will need to be

1. Indexed
2. Limited by vocabulary (ie replace tokens with UNKNOWN if they are too rare, come up with a reasonable limit based on your survey of the data and also model performance)
3. Padded

In [50]:
#create vocabulary and set words that appear less than 2 to unknown
import pickle

def make_lexicon(words_list, min_freq=1):
    word_counts = {}
    for word in words_list:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

    lexicon = [word for word, count in word_counts.items() if count >= min_freq]
    lexicon = {word:idx + 2 for idx,word in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

print("WORDS:")
words_lexicon = make_lexicon(words_list)

print("TAGS:")
tags_lexicon = make_lexicon(tags_list)

WORDS:
LEXICON SAMPLE (35179 total items):
{'Thousands': 2, 'of': 3, 'demonstrators': 4, 'have': 5, 'marched': 6, 'through': 7, 'London': 8, 'to': 9, 'protest': 10, 'the': 11, 'war': 12, 'in': 13, 'Iraq': 14, 'and': 15, 'demand': 16, 'withdrawal': 17, 'British': 18, 'troops': 19, 'from': 20, 'that': 21}
TAGS:
LEXICON SAMPLE (18 total items):
{'O': 2, 'B-geo': 3, 'B-gpe': 4, 'B-per': 5, 'I-geo': 6, 'B-org': 7, 'I-org': 8, 'B-tim': 9, 'B-art': 10, 'I-art': 11, 'I-per': 12, 'I-gpe': 13, 'I-tim': 14, 'B-nat': 15, 'B-eve': 16, 'I-eve': 17, 'I-nat': 18, '<UNK>': 1}


In [51]:
#a dictionary where the string representation of a lexicon item can be retrieved from its numerical index

def get_lexicon_lookup(lexicon):
    lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
    print("LEXICON LOOKUP SAMPLE:")
    print(dict(list(lexicon_lookup.items())[:20]))
    return lexicon_lookup

tags_lexicon_lookup = get_lexicon_lookup(tags_lexicon)

LEXICON LOOKUP SAMPLE:
{2: 'O', 3: 'B-geo', 4: 'B-gpe', 5: 'B-per', 6: 'I-geo', 7: 'B-org', 8: 'I-org', 9: 'B-tim', 10: 'B-art', 11: 'I-art', 12: 'I-per', 13: 'I-gpe', 14: 'I-tim', 15: 'B-nat', 16: 'B-eve', 17: 'I-eve', 18: 'I-nat', 1: '<UNK>'}


In [52]:
#renew the dataframe
def tokens_to_idxs(words_list, lexicon):
    idx_seqs = [lexicon[word] if word in lexicon else lexicon['<UNK>'] for word in words_list]  
    return idx_seqs

data['Word_Idxs'] = tokens_to_idxs(words_list, words_lexicon)
data['Tag_Idxs'] = tokens_to_idxs(tags_list, tags_lexicon)
data[['Sentence #', 'Word', 'Word_Idxs', 'Tag', 'Tag_Idxs']][:10]

Unnamed: 0,Sentence #,Word,Word_Idxs,Tag,Tag_Idxs
0,Sentence: 1,Thousands,2,O,2
1,Sentence: 1,of,3,O,2
2,Sentence: 1,demonstrators,4,O,2
3,Sentence: 1,have,5,O,2
4,Sentence: 1,marched,6,O,2
5,Sentence: 1,through,7,O,2
6,Sentence: 1,London,8,B-geo,3
7,Sentence: 1,to,9,O,2
8,Sentence: 1,protest,10,O,2
9,Sentence: 1,the,11,O,2


In [44]:
#condense each sentence to one row
sentence_list = []
for  i in set(data['Sentence #'].values.tolist()):
    sentence_list.append(i)

Tokenized_Sentence = []
Sentence_Idxs = []
Tagged_Sentence = []
Tag_Idxs = []

for i in sentence_list:
    sentence = data.loc[data['Sentence #'] == i, 'Word'].values.tolist()
    sentence_idx = data.loc[data['Sentence #'] == i, 'Word_Idxs'].values.tolist()
    tag = data.loc[data['Sentence #'] == i, 'Tag'].values.tolist()
    tag_idx = data.loc[data['Sentence #'] == i, 'Tag_Idxs'].values.tolist()
    
    Tokenized_Sentence.append(sentence)
    Sentence_Idxs.append( sentence_idx)
    Tagged_Sentence.append(tag)
    Tag_Idxs.append(tag_idx)

KeyboardInterrupt: 

In [None]:
#padded
from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

max_seq_len = max([len(idx_seq) for idx_seq in Sentence_Idxs])
padded_words = pad_idx_seqs(Sentence_Idxs, max_seq_len + 1) 
padded_tags = pad_idx_seqs(Tag_Idxs,max_seq_len + 1)  

print("WORDS:\n", padded_words)
print("SHAPE:", padded_words.shape, "\n")

print("TAGS:\n", padded_tags)
print("SHAPE:", padded_tags.shape, "\n")

In [None]:
#split data into a train/test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(padded_words, padded_tags, test_size = 0.1, random_state = 1992)

### Step 2. Build the model

Here we will build a Bidirectional LSTM-CRF model using the `Bidirectional` function from Keras and `CRF` function from Keras-contrib

**Documentation and source code:**

https://keras.io/layers/wrappers/#bidirectional

https://github.com/keras-team/keras-contrib

Fit your model with a validation split of 0.1, feel free to use as many epochs as you like. Base your predictions both from the input words **and** the tags from previous words like in the POS example.

After building your model, grade your performance on your test set, both by comparing your predicted output to the actual (*at least 3 examples*) and calculate the averaged precision and recall for your tags.

In [53]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import numpy
from collections import Counter

from keras.models import Sequential
from keras.layers import Embedding, Bidirectional, LSTM
from keras.preprocessing.sequence import pad_sequences
from keras_contrib.layers import CRF

EPOCHS = 5
EMBED_DIM = 200
BiRNN_UNITS = 200


In [54]:
#define the report function
def classification_report(y_true, y_pred, labels):
    y_true = numpy.asarray(y_true).ravel()
    y_pred = numpy.asarray(y_pred).ravel()
    corrects = Counter(yt for yt, yp in zip(y_true, y_pred) if yt == yp)
    y_true_counts = Counter(y_true)
    y_pred_counts = Counter(y_pred)
    report = ((lab,  # label
               corrects[i] / max(1, y_true_counts[i]),  # recall
               corrects[i] / max(1, y_pred_counts[i]),  # precision
               y_true_counts[i]  # support
               ) for i, lab in enumerate(labels))
    report = [(l, r, p, 2 * r * p / max(1e-9, r + p), s) for l, r, p, s in report]

    print('{:<15}{:>10}{:>10}{:>10}{:>10}\n'.format('', 'recall', 'precision', 'f1-score', 'support'))
    formatter = '{:<15}{:>10.2f}{:>10.2f}{:>10.2f}{:>10d}'.format
    for r in report:
        print(formatter(*r))
    print('')
    report2 = list(zip(*[(r * s, p * s, f1 * s) for l, r, p, f1, s in report]))
    N = len(y_true)
    print(formatter('avg / total', sum(report2[0]) / N, sum(report2[1]) / N, sum(report2[2]) / N, N) + '\n')


In [55]:
#build the model
model = Sequential()
model.add(Embedding(len(vocabulary), EMBED_DIM, mask_zero=True))
model.add(Bidirectional(LSTM(BiRNN_UNITS // 2, return_sequences=True)))
crf = CRF(len(tags_lexicon), sparse_target=True)
model.add(crf)
model.summary()

model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])
model.fit(x_train[:,1:], y_train[:, 1:, None], epochs=EPOCHS,validation_data=[x_test[:,1:], y_test[:, 1:, None]])

y_test_pred = model.predict(x_test).argmax(-1)[x_test > 0]
y_test_true = y_test[x_test > 0]

print('\n---- Result of BiLSTM-CRF ----\n')
classification_report(y_test_true, y_test_pred, tags_lexicon)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 200)         7035600   
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 200)         240800    
_________________________________________________________________
crf_2 (CRF)                  (None, None, 18)          3978      
Total params: 7,280,378
Trainable params: 7,280,378
Non-trainable params: 0
_________________________________________________________________


NameError: name 'x_train' is not defined

### Step 3. Pick a dataset

Pick a dataset that has short text, similar to the sentences you just tagged. Headlines and tweets are good choices.

https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=news&page=1&pageSize=20&size=all&filetype=all&license=all

In [56]:
#the dataset I picked is india-news-headlines.csv
headline= pd.read_csv("india-news-headlines.csv", encoding="latin1")
headline = headline.fillna(method="ffill")
headline= headline[:5000]
headline = headline.dropna()
headline = headline.drop(['publish_date',
                'headline_category'], axis=1)
headline.head(10)

Unnamed: 0,headline_text
0,win over cena satisfying but defeating underta...
1,Raju Chacha
2,Status quo will not be disturbed at Ayodhya; s...
3,Fissures in Hurriyat over Pak visit
4,America's unwanted heading for India?
5,For bigwigs; it is destination Goa
6,Extra buses to clear tourist traffic
7,Dilute the power of transfers; says Riberio
8,Focus shifts to teaching of Hindi
9,IT will become compulsory in schools


In [57]:
#deal with the dataset: tokenize, word2idx, tag, tag2idx
import nltk
headline_list = headline.headline_text.values.tolist()
tokenize_headlines = []
headline_words_list = []
for headline in headline_list:
    tokenize_headline = nltk.word_tokenize(headline)
    tokenize_headlines.append(tokenize_headline)
    for word in tokenize_headline:
        headline_words_list.append(word)
vocabulary_headline = set(headline_words_list)

### Step 4. Tag your new data!

Create a modification to the **ent_tagger function** that combined words and tags from your original dataset. Now allow the function to also load new text from your new data set, and output the tags predicted from your trained model alongside the text. Make your function load five random texts from your data and output the tagged text.

In [58]:
#define ent_tagger function using words list

ent_tagger_list = []
headline_tags_list = []
def ent_tagger(tokenize_headlines_sequences):
    for headline in tokenize_headlines_sequences:
        headline_tag_list = []
        for word in headline:
            if word in vocabulary:
                tag = data.loc[data['Word'] == word, 'Tag'].values.tolist()[0]
            else:
                tag = '<UNK>'
            tagger = (word,tag)
            headline_tag_list.append(tag)
            ent_tagger_list.append(tagger)
        headline_tags_list.append(headline_tag_list)

In [59]:
#call the function
ent_tagger(tokenize_headlines)

In [31]:
print("WORDS:")
words_lexicon = make_lexicon(headline_words_list)

print("TAGS:")
tags_lexicon = make_lexicon(headline_tags_list)

WORDS:
LEXICON SAMPLE (9123 total items):
{'win': 2, 'over': 3, 'cena': 4, 'satisfying': 5, 'but': 6, 'defeating': 7, 'undertaker': 8, 'bigger': 9, 'roman': 10, 'reigns': 11, 'Raju': 12, 'Chacha': 13, 'Status': 14, 'quo': 15, 'will': 16, 'not': 17, 'be': 18, 'disturbed': 19, 'at': 20, 'Ayodhya': 21}
TAGS:
LEXICON SAMPLE (1 total items):
{'<UNK>': 1}


In [60]:
#get tags lookup
tags_lexicon_lookup = get_lexicon_lookup(tags_lexicon)

LEXICON LOOKUP SAMPLE:
{2: 'O', 3: 'B-geo', 4: 'B-gpe', 5: 'B-per', 6: 'I-geo', 7: 'B-org', 8: 'I-org', 9: 'B-tim', 10: 'B-art', 11: 'I-art', 12: 'I-per', 13: 'I-gpe', 14: 'I-tim', 15: 'B-nat', 16: 'B-eve', 17: 'I-eve', 18: 'I-nat', 1: '<UNK>'}


In [61]:
def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs

headline_idx = tokens_to_idxs(tokenize_headlines, words_lexicon)
headline_tag_idx = tokens_to_idxs(headline_tags_list, tags_lexicon)

In [62]:
#padded the new dataset
from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

max_seq_len = max([len(idx_seq) for idx_seq in headline_idx])
padded_headlines = pad_idx_seqs(headline_idx, max_seq_len + 1) 
padded_headline_tags = pad_idx_seqs(headline_tag_idx,max_seq_len + 1)  

print("WORDS:\n", padded_headlines)
print("SHAPE:", padded_headlines.shape, "\n")

print("TAGS:\n", padded_headline_tags)
print("SHAPE:", padded_headline_tags.shape, "\n")

WORDS:
 [[    0     0     0 ...,  8967     1     1]
 [    0     0     0 ...,     0     1     1]
 [    0     0     0 ...,  3235   570 32540]
 ..., 
 [    0     0     0 ...,     9  4286   171]
 [    0     0     0 ...,    55   398     1]
 [    0     0     0 ...,   431   314     1]]
SHAPE: (5000, 99) 

TAGS:
 [[ 0  0  0 ...,  2  1  1]
 [ 0  0  0 ...,  0  1  1]
 [ 0  0  0 ...,  2  2 12]
 ..., 
 [ 0  0  0 ...,  2  2  5]
 [ 0  0  0 ...,  2  2  1]
 [ 0  0  0 ...,  2  2  1]]
SHAPE: (5000, 99) 



In [63]:
#predict the tags
y_test_pred = model.predict(padded_headlines).argmax(-1)[padded_headlines > 0]
y_test_true = padded_headline_tags[padded_headlines > 0]

print('\n---- Result of BiLSTM-CRF ----\n')
classification_report(y_test_true, y_test_pred, tags_lexicon)


---- Result of BiLSTM-CRF ----

                   recall precision  f1-score   support

O                    0.00      0.00      0.00         0
B-geo                0.00      0.21      0.00      7322
B-gpe                0.00      0.00      0.00     21421
B-per                0.00      0.00      0.00       610
I-geo                0.00      0.00      0.00       176
B-org                0.98      0.01      0.02       288
I-org                0.00      0.00      0.00       126
B-tim                0.00      0.00      0.00       514
B-art                0.00      0.00      0.00       606
I-art                0.00      0.00      0.00       118
I-per                0.00      0.00      0.00        28
I-gpe                0.00      0.00      0.00        88
I-tim                0.00      0.00      0.00       389
B-nat                0.00      0.00      0.00        23
B-eve                0.00      0.00      0.00       125
I-eve                0.00      0.00      0.00        14
I-nat         

In [64]:
#print out true tags and predict tags to have a direct view
y_test_true_tag = []
for i in y_test_true[:10]:
    tag = tags_lexicon_lookup[i]
    y_test_true_tag.append(tag)
    
y_test_pred_tag = []
for i in y_test_pred[:10]:
    tag = tags_lexicon_lookup[i]
    y_test_pred_tag.append(tag)  

print('True label:   ',y_test_true_tag)
print('Predict label:',y_test_pred_tag)

True label:    ['O', 'O', '<UNK>', 'O', 'O', 'O', '<UNK>', 'O', '<UNK>', '<UNK>']
Predict label: ['B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per', 'B-per']
