# ![img](https://broutonlab.com/static/img/banners/data-extraction-and-document-parsing-software.jpg)

In this notebook we will work with [MIT Movie Corpus](https://groups.csail.mit.edu/sls/downloads/movie/).

```
The MIT Movie Corpus is a semantically tagged training and test corpus in BIO format. 
The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries.
```

# **Load dataset**

Let's load train/test datasets

In [None]:
import urllib.request

def load_and_process_data(data_url, dest_file_path):
  urllib.request.urlretrieve(data_url, dest_file_path)
  with open(dest_file_path) as f:
      text = f.read()
  
  dataset = []
  for item in text.split('\n'):
      item = item.strip()
      if len(item) == 0:
          continue
      [t, w] = item.split('\t')
      dataset.append((w, t))
  return dataset

In [None]:
train_dataset = load_and_process_data("https://groups.csail.mit.edu/sls/downloads/movie/engtrain.bio", "./engtrain.bio")
test_dataset = load_and_process_data("https://groups.csail.mit.edu/sls/downloads/movie/engtest.bio", "./engtest.bio")

... and prepare list of all tags

In [None]:
types = list(set(map(lambda x: x[1], train_dataset + test_dataset)))

# **Let's try pretrained Spacy**

There are a number of NER frameworks. Let's try to use Spacy to solve this problem!

In [None]:
import spacy
spacy_pretrained_model = spacy.load("en_core_web_sm")

def debug_spacy(spacy_model, snt):
  doc = spacy_model(snt)
  for ent in doc.ents:
      print("{} [{}-{}]: {}".format(ent.text, ent.start_char, ent.end_char, ent.label_))

In [None]:
debug_spacy(spacy_pretrained_model, "I live in Russia. I work in ABC LLC.")

In [None]:
debug_spacy(spacy_pretrained_model, " ".join(map(lambda x: x[0], train_dataset[:150])))

Not bad, but seems to be it extracts standard entities like persons, locations

We have checked that Spacy does not solve the problem and magic has not happen. Let's train our NER system in plain Keras

# **NER in Keras**

In this tutorial we will have have a deal with word-based approach and we will see how it works.
First of all we have to prepare dictionary of input words.

In [None]:
from collections import Counter
word2count = Counter(map(lambda x: x[0], train_dataset))
MAX_WORD_COUNT = 50000
top_words = [x[0] for x in sorted(word2count.items(), key=lambda x: x[1], reverse=True)][:MAX_WORD_COUNT]
word2id = {x:index+1 for index, x in enumerate(top_words)}

Let's implement NER model. Don't hesitate to modify it!

In [None]:
from keras.layers import Input, LSTM, Embedding, Dense
from keras.layers.wrappers import Bidirectional
from keras.models import Model

input = Input(shape=(None,))
out = Embedding(input_dim=len(word2id)+1, output_dim=200)(input)
# your code
out = Bidirectional(LSTM(200, activation='relu', return_sequences=True))(out)
out = Dense(len(types), activation='softmax')(out)
model = Model(input, out)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.summary()

In [None]:
import random
import numpy as np

type2id = {x:index for index, x in enumerate(types)}
print(type2id)

def getWordId(w):
    return 0 if not w in word2id else word2id[w]

def gen_batches(dataset, batch_size=64, seq_size=32, batch_count=100):
    random.shuffle(dataset)
    
    features = np.zeros((batch_size, seq_size))
    labels = np.zeros((batch_size, seq_size, len(type2id)))
    for _ in range(batch_count):
        for seq_index in range(batch_size):
            left = random.randint(0, len(dataset) - seq_size)
            features[seq_index,:] = [getWordId(x[0]) for x in dataset[left:left+seq_size]]
            labels[seq_index,:] = 0
            for i,(_,t) in enumerate(dataset[left:left+seq_size]):
                labels[seq_index,i] = 0
                labels[seq_index,i,type2id[t]] = 1
        yield features, labels

def gen_data(dataset, seq_size=32, item_count=100):
  random.shuffle(dataset)
    
  labels = np.zeros((seq_size, len(type2id)))
  for _ in range(item_count):
    left = random.randint(0, len(dataset) - seq_size)
    features = np.array([getWordId(x[0]) for x in dataset[left:left+seq_size]])
    labels[:] = 0
    for i,(_,t) in enumerate(dataset[left:left+seq_size]):
      labels[i] = 0
      labels[i,type2id[t]] = 1
    yield features, labels
        
def encode_text(sentence):
    words = sentence.split()
    result = np.zeros((len(words),))
    for i,w in enumerate(words):
        result[i] = getWordId(w)
    return result

In [None]:
x_val, y_val = zip(*list(gen_batches(test_dataset, batch_count=150)))

In [None]:
from datetime import datetime
import keras.callbacks

logdir = "logs/scalars/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = keras.callbacks.TensorBoard(log_dir=logdir)
mcp_save = keras.callbacks.ModelCheckpoint('trained_model.hdf5', save_best_only=True, monitor='val_loss', mode='min')

training_history = model.fit(
    gen_batches(train_dataset, batch_count=1000),
    validation_data=(x_val, y_val),
    verbose=1, steps_per_epoch=10, epochs=32,
    callbacks=[tensorboard_callback, mcp_save])

In [None]:
# %load_ext tensorboard
# %tensorboard --logdir=logs

# Test model

Let's review how model works in production!

In [None]:
from tensorflow import keras
best_model = keras.models.load_model('trained_model.hdf5')

In [None]:
query = test_dataset[200:300]
query_words = [x[0] for x in query]
query_types = [x[1] for x in query]
result = best_model.predict_on_batch(encode_text(" ".join(query_words)).reshape((1, -1)))[0]
predictions = []
import pandas as pd

for index in range(result.shape[0]):
    w = query_words[index]
    true_type = query_types[index]
    pred_type = types[np.argmax(result[index,:])] 
    predictions.append(pred_type)

result_dataframe = pd.DataFrame.from_dict({"words":query_words, "types":query_types, "preds": predictions})
result_dataframe.head(50)

# Char based approach

# Home task

- 3 points: make the model better
- 7 points: implement the model with CRF layer (https://github.com/Hironsan/keras-crf-layer)