### Step 1: Configure notebook

This notebook might stall when trying to mount the google drive.  It will prompt for an authentication key.  I'm not sure how often it expires.


### Step 2: Prepare embeddings

There are 3 index types / embeddings:
+ Words
  + These are imported from Glove and minimal formatting is applied.
+ Characters
  + These are trained in the model.  However, the dictionary of acceptable values is determined at this stage from `string.printable`
+ Casing
  + A dictionary is initialized at this stage from hard-coded options.
+ Labels
  + The ontonotes training data is used as the source of possible labels.  In addition the 'TITLE' tags are also added to the labels dictionary.


### Step 3: Process data

There are three distinct varieties of data:
+ Original Ontonotes data
  + This data will be used for pre-training the model.  All 3 sets (train, dev, test) are used.
  + The data originates in a 4-column CoNLL format.  We are only interested in the token and label columns.  Each line is a token.  Sentence divisions are indicated by an empty line.
  + Train: for pre-training the model
  + Dev: for determining when the model is trained
  + Test: for establishing the generalized accuracy of the model
+ Domain data
  + This is the manually annotated data that we will use to fine-tune and test our model.
  + It follows a similar format to Ontonotes (4-column, with column order preserved), however the middle two columns have been replaced with 'company' and 'director'
+ BILOU formatted data
  + There are versions of the ontonotes and domain data in this format.
  + It is 2-column, (array_of_tokens, array_of_labels).  Each line is a sentence.

The data needs to be formatted into 4 vectors per sample (word, char, casing, label), where each sample is a sentence.  Each of these vectors needs to be truncated / padded.

There is a caching function that can be used to store the very large (~7GB) vectorized data files.

### Step 4: Build / Train Model

### Step 5: Fine-tune model


# Configuration

In [1]:
import os
import time
import sys
import csv
import pandas as pd
import numpy as np
import string
import ast
from IPython.display import display

from shutil import copyfile
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

csv.field_size_limit(sys.maxsize)

from keras.layers import TimeDistributed, Conv1D, Dense, Embedding, Input, Dropout, LSTM, Bidirectional, MaxPooling1D, Flatten, concatenate
from keras.initializers import RandomUniform
from keras.optimizers import SGD, Nadam
from keras.models import Model, load_model, Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.utils import plot_model

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print('GPU device not found')
print('Found GPU at: {}'.format(device_name))

np.random.seed(1492)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


Using TensorFlow backend.


Found GPU at: /device:GPU:0


In [0]:
### File configurations
drive_dir = "/content/drive/My Drive/W266_Project/"
data_src = os.path.join(drive_dir,"data")
embed_src = os.path.join(drive_dir,"embeddings")

# cache store
cache_dir = os.path.join(drive_dir, "cache")
embed_store =  os.path.join(cache_dir, 'embed.h5')

### Model Parameters
DROPOUT = 0.2
RECURRENT_DROPOUT = 0.25
CHAR_VOCAB = len(string.printable)
CHAR_EMBEDDING_DIM = 30
WORD_LENGTH = 52
CONV_SIZE = 3
CONV_FILTERS = 30
CONV_STRIDE = 1
CONV_WINDOW = 52
LSTM_STATE_SIZE = 200

EPOCHS = 20
BATCH_SIZE = 400
TRAINING_SIZE = 600000

# embedding to use
# 50d vector is consistent with paper
embedding_file = "glove.6B.300d.txt"

annotation_type = "BIO"

# data files
train_file = None
dev_file = None
test_file = None
real_file = None

if annotation_type == "BIO":
  train_file = 'onto.train.ner'
  dev_file = 'onto.development.ner'
  test_file = 'onto.test.ner'
  real_file = 'bios-tagged-final-flat.csv'
elif annotation_type == "BILOU":
  train_file = 'onto.train.ner_bilou.csv'
  dev_file = 'onto.development.ner_bilou.csv'
  test_file = 'onto.test.ner_bilou.csv'
  real_file = 'bios-tagged-final-flat_bilou.csv'
else:
  raise Exception(f"unknown annotation: {annotation_type}")

real_training_sizes = [10,25,50,100,200,300]
# whether to freeze the weights on the bilstm
FROZEN = False
# which layer weights to copy over for fine_tuning
# None results in all weights, otherwise provide an array of layer names
FINE_TUNE_LAYERS = ['biLSTM', 'softmax']

model_dir = os.path.join(drive_dir, 'output')

# if loading a pre-trained model set these
model_name = "std_400b_glove300d_full_BIO_08-0.0079"
model_load_path = os.path.join(model_dir, model_name + '.h5')
INITIAL_EPOCH = EPOCHS

# else use these
model_name = f"std_400b_glove300d_full_{annotation_type}"
model_path = os.path.join(model_dir, model_name + '_{epoch:02d}-{val_loss:.4f}.h5')



### Preprocessing Parameters
UNK_WORD = "<UNK-WORD>"
PAD_WORD = "<PAD-WORD>"

UNK_CHAR = "<UNK-CHAR>"
PAD_CHAR = "<PAD-CHAR>"

# max number of words in a sentence, pad to this length, might throw an error if the sentence is longer
SENTENCE_WIDTH = 256
# max number of characters in a word, pad to this length, will truncate if word is too long
WORD_WIDTH = 52
# symbols to map padding to
CHAR_PAD_SYMBOL = PAD_CHAR
LABEL_PAD_SYMBOL = 'O'
CASE_PAD_SYMBOL = 'other'

# Setup

In [3]:
# show files
print(os.listdir(data_src))
print(os.listdir(embed_src))
print(os.listdir(cache_dir))

['onto.development.ner.sample', 'onto.development.ner', 'onto.test.ner.sample', 'onto.train.ner.sample', 'onto.test.ner', 'onto.train.ner', 'bios-tagged_bilou.csv', 'bios-tagged-set1.csv', 'bios-tagged-set2.csv', 'bios-tagged-final-flat.csv', 'bios-tagged-final-agg.csv', 'onto.train.ner_bilou.csv', 'onto.development.ner_bilou.csv', 'onto.test.ner_bilou.csv', 'bios-tagged-final-flat_bilou.csv', 'GoogleNews-vectors-negative300.bin.gz']
['glove.6B.100d.txt', 'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.6B.50d.txt', 'readme.md', 'Skip100']
['embed.h5', 'onto_train_ner_bilou_word.npy', 'onto_train_ner_bilou_label.npy', 'onto_train_ner_bilou_case.npy', 'onto_train_ner_bilou_char.npy', 'onto_development_ner_bilou_word.npy', 'onto_development_ner_bilou_case.npy', 'onto_development_ner_bilou_label.npy', 'onto_development_ner_bilou_char.npy', 'onto_test_ner_bilou_word.npy', 'onto_test_ner_bilou_case.npy', 'onto_test_ner_bilou_label.npy', 'onto_test_ner_bilou_char.npy', 'bios-tagged_bilou_wor

# Preprocessing

## Load Embeddings

In [0]:
# consider using tf.nn.embedding_lookup instead
# or maybe nltk.tokenize


def get_casing_ix(word):
  '''
  determines the casing of the word
  
  returns casing_ix
  '''
  if word.istitle():
    return case_to_ix['title']
  elif word.islower():
    return case_to_ix['lower']
  elif word.isupper():
    return case_to_ix['upper']
  elif word.isnumeric():
    return case_to_ix['numeric']
  return case_to_ix['other']

def get_word_ix(word):
  '''
  takes w and returns the index of the word embedding
  out of vocabulary terms return the UNK_WORD and the character embeddings
  
  returns word_ix
  '''
  w = word.lower()
  w_ix = word_to_ix.get(w)
  if w_ix is not None:
    return w_ix
  return word_to_ix[UNK_WORD]

def get_char_ix(char):
  char_ix = char_to_ix.get(char)
  if char_ix is not None:
    return char_ix
  return char_to_ix[UNK_CHAR]
  
def create_character_embeddings(words_df):
  '''
  Optional function to create pre-trained character embeddings from averaged word embeddings.  In the model we generate them from a uniform random distribution and train.
  '''
  characters = {}
  for i, word_vec in enumerate(words_df.reset_index().values):
    for char in word_vec[0]:
      if char in characters:
        characters[char] = [characters[char][0] + word_vec[1:].astype(float), characters[char][1] + 1]
      else:
        characters[char] = [word_vec[1:].astype(float), 1]

  for key in characters:
    characters[key] = np.round(characters[key][0]/characters[key][1],6)
    
def initialize_word_embeddings(file_name, use_cache=True, debug=True, save_cache=True):
  loaded = False
  df = None
  
  if use_cache:
    try:
      print("Attempting to load from cache")
      with pd.HDFStore(embed_store, 'r') as store:
        words = store[file_name]
      words = pd.read_hdf(embed_store, file_name)
      loaded=True
      print("Loaded successfully")
    except:
      print("Cache loading failed")
      loaded=False
  
  if not loaded:
    words = pd.read_csv(os.path.join(embed_src, embedding_file), sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
    # some embeddings come back with word == NaN
    words = words[~words.index.isnull()]
    # add entries for special tokens
    words.loc[UNK_WORD] = [0 for x in words.columns]
    words.loc[PAD_WORD] = [0 for x in words.columns]
    if save_cache:
      with pd.HDFStore(embed_store, 'a') as store:
        store[file_name] = words
  
  word2ix = {word:i for i,word in enumerate(words.index)}
  ix2word = {i:word for i,word in enumerate(words.index)}
  words = words.to_numpy().astype(float)
  
  return words, word2ix, ix2word

def initialize_character_embeddings(vocab=string.printable):
  characters = [x for x in string.printable]
  characters += [UNK_CHAR, PAD_CHAR]
  char2ix = {ch:i for i, ch in enumerate(characters)}
  ix2char = {i:ch for i, ch in enumerate(characters)}
  
  return characters, char2ix, ix2char

def initialize_case_embeddings(vocab=['upper','lower','title','numeric','other']):
  case2ix = {case:i for i, case in enumerate(vocab)}
  ix2case = {}
  cases = []
  for k,v in case2ix.items():
    this_case = np.zeros(len(case2ix))
    this_case[v] = 1
    cases.append(this_case)
    ix2case[v] = k
  cases = np.array(cases)
  
  return cases, case2ix, ix2case

  
def initialize_labels(file_name, annotation):
  label_list = None
  
  if annotation == "BIO":
    data = pd.read_csv(os.path.join(data_src, file_name), sep="\t",  quoting=csv.QUOTE_NONE, header=None, skip_blank_lines=False, engine='python', names =['token', 'pos', 'tree', 'label'])
    data.dropna(subset=['label'], inplace=True)
    label_list = list(data.label.unique()) + ['B-TITLE', 'I-TITLE']
  elif annotation == "BILOU":
    data = pd.read_csv(os.path.join(data_src, file_name), header=0, index_col=0,skip_blank_lines=False, engine='python')
    label_list = list(np.unique(np.concatenate(data.y.apply(lambda x: np.array(ast.literal_eval(x))).values))) + ['B-TITLE', 'I-TITLE', 'L-TITLE', 'U-TITLE']
  else:
    raise Exception(f"unknown annotation: {annotation}")
    
  label2ix = {label:i for i, label in enumerate(label_list)}
  ix2label = {i:label for i, label in enumerate(label_list)}
  return label_list, label2ix, ix2label

In [5]:
# load embeddings and format
words, word_to_ix, ix_to_word = initialize_word_embeddings(embedding_file, use_cache=True)
characters, char_to_ix, ix_to_char = initialize_character_embeddings()
cases, case_to_ix, ix_to_case = initialize_case_embeddings()
labels, label_to_ix, ix_to_label = initialize_labels(train_file, annotation_type)

Attempting to load from cache
Loaded successfully


## Process Data

In [0]:
def checkPrior(blah):
  if (blah.prior is None or blah.prior is np.NaN) and (blah.prior_pos is None or blah.prior_pos is np.NaN):
    return True
  else:
    return False
  
def phrase2char(w_vec):
  '''
  This function transforms a sequence of words in index format to a 2d array of character indexes
  
  w_vec - an iterable of word indexes
  
  returns np.ndarray of size (len(w_vec), WORD_WIDTH)
  '''
  phrase_vector = []
  for w_ix in w_vec:
    char_vector = []
    if w_ix not in (word_to_ix[PAD_WORD],word_to_ix[UNK_WORD]):
      for char in ix_to_word[w_ix]:
        char_vector.append(get_char_ix(char))
    phrase_vector.append(np.array(char_vector))
  return pad_sequences(phrase_vector, value=char_to_ix[PAD_CHAR], maxlen=WORD_WIDTH, padding='post')

def pad_truncate(x,width,pad_token):
  if(len(x) > width):
    print(f"Truncating input: {[ix_to_word[ix] for ix in x]}")
    x = x[:256]
  return np.pad(x,pad_width=(0,width-len(x)), mode='constant', constant_values=pad_token)

def verbosity(str, verbose):
  if verbose:
    print(str)

def sent_to_casing_ix(words):
  sentence_vector = []
  for word in words:
    sentence_vector.append(get_casing_ix(word))
  return sentence_vector

def sent_to_word_ix(words):
  sentence_vector = []
  for word in words:
    sentence_vector.append(get_word_ix(word))
  return sentence_vector

def sent_to_label_ix(labels):
  label_vector = []
  for label in labels:
    label_vector.append(label_to_ix[label])
  return label_vector

def preprocess_data(file_name, annotation, use_cache=True, debug=True, save_cache=True):
  '''
  Prepares data for model.  It can be used for both training and test data.
  
  returns pd.DataFrame
  '''
  clean_name = os.path.join(cache_dir, file_name.replace(".csv", "").replace(".", "_"))
  loaded = False
  output = None
      
  if use_cache and os.path.exists(f"{clean_name}_{annotation}_word.npy"):
    verbosity("Attempting to load from cache", debug)
    try:
      word_vectors = np.load(f"{clean_name}_{annotation}_word.npy", allow_pickle=True)
      char_vectors = np.load(f"{clean_name}_{annotation}_char.npy", allow_pickle=True)
      case_vectors = np.load(f"{clean_name}_{annotation}_case.npy", allow_pickle=True)
      label_vectors = np.load(f"{clean_name}_{annotation}_label.npy", allow_pickle=True)
      output = [word_vectors, char_vectors, case_vectors, label_vectors]
      loaded = True
      verbosity("Loaded successfully", debug)
    except:
      verbosity("Loading failed",debug)
      loaded = False
  
  if not loaded:
    if annotation == "BIO":
      verbosity(f"Loading raw data file to process labels: {file_name}", debug)
      checkpoint = time.time()  
      header = 0
      if 'onto' in file_name:
        header = None
      data = pd.read_csv(os.path.join(data_src, file_name), sep="\t",  quoting=csv.QUOTE_NONE, header=header, skip_blank_lines=False, engine='python', names =['token', 'pos', 'tree', 'label'])
      verbosity(f"Parsed data loaded: {time.time()-checkpoint} s", debug)

      # see if prior row was a newline
      data['prior'] = data.token.shift(1)
      data['prior_pos'] = data.pos.shift(1)
      # drop empty rows
      data = data.loc[~data.token.isnull()]
      data.prior = data.apply(checkPrior, axis=1)
      data['phrase'] = data.prior.cumsum()

      verbosity("Processing data into phrase vectors", debug)
      verbosity("Step 1: Translating to indexes", debug)
      checkpoint = time.time()
      data['word_ix'] = data.token.apply(get_word_ix)
      data['case_ix'] = data.token.apply(get_casing_ix)
      data['label_ix'] = data.label.apply(lambda x: label_to_ix[x])
      verbosity(f"Step 1: Translated to indexes complete: {time.time()-checkpoint} s", debug)

      verbosity("Step 2: Creating phrase vectors", debug)
      verbosity("Step 2a: Aggregating phrases", debug)
      checkpoint = time.time()
      phrase_vectors = data.groupby('phrase').agg({'token':list, 'word_ix': list, 'case_ix': list, 'label_ix': list})
      verbosity(f"Step 2a: {time.time()-checkpoint} s", debug)

      verbosity("Step 2b: Padding word vectors", debug)
      checkpoint = time.time()
      phrase_vectors['word_vector'] = phrase_vectors.word_ix.apply(lambda x: pad_truncate(x, SENTENCE_WIDTH, word_to_ix[PAD_WORD]))
      verbosity(f"Step 2b: {time.time()-checkpoint} s", debug)

      verbosity("Step 2c: Creating and padding character vectors", debug)
      checkpoint = time.time()
      phrase_vectors['char_vector'] = phrase_vectors.word_vector.apply(lambda x: phrase2char(x))
      verbosity(f"Step 2c: {time.time()-checkpoint} s", debug)

      verbosity(f"Step 2d: Padding case vectors", debug)
      checkpoint = time.time()
      phrase_vectors['case_vector'] = phrase_vectors.case_ix.apply(lambda x: pad_truncate(x, SENTENCE_WIDTH, case_to_ix[CASE_PAD_SYMBOL]))
      verbosity(f"Step 2d: {time.time()-checkpoint}", debug)

      verbosity("Step 2e: Padding label vectors", debug)
      checkpoint = time.time()
      phrase_vectors['label_vector'] = phrase_vectors.label_ix.apply(lambda x: np.expand_dims(pad_truncate(x, SENTENCE_WIDTH, label_to_ix[LABEL_PAD_SYMBOL]), -1))
      verbosity(f"Step 2e: {time.time()-checkpoint} s", debug)

      phrase_vectors.drop(columns=['word_ix', 'case_ix', 'label_ix'], inplace=True)

      output = [np.stack(phrase_vectors['word_vector'].values), np.stack(phrase_vectors['char_vector'].values), np.stack(phrase_vectors['case_vector'].values), np.stack(phrase_vectors['label_vector'].values)]
    
    elif annotation == "BILOU":
      verbosity(f"Loading raw data file to process labels: {file_name}", debug)
      checkpoint = time.time()  
      data = pd.read_csv(os.path.join(data_src, file_name), header=0, index_col=0,skip_blank_lines=False, engine='python')
      data = data[data.x.str.len() > 2]
      verbosity(f"Data loaded: {time.time()-checkpoint} s", debug)

      verbosity("Processing data into sentence vectors", debug)
      verbosity("Step 1: Translating to indexes", debug)
      checkpoint = time.time()
      data.x = data.x.apply(ast.literal_eval)
      data.y = data.y.apply(ast.literal_eval)
      data['word_vector'] = data.x.apply(lambda x: pad_truncate(sent_to_word_ix(x), SENTENCE_WIDTH, word_to_ix[PAD_WORD]))
      data['case_vector'] = data.x.apply(lambda x: pad_truncate(sent_to_casing_ix(x), SENTENCE_WIDTH, case_to_ix[CASE_PAD_SYMBOL]))
      data['label_vector'] = data.y.apply(lambda x: pad_truncate(sent_to_label_ix(x), SENTENCE_WIDTH, label_to_ix[LABEL_PAD_SYMBOL]))
      print("Processing character vectors")
      data['char_vector'] = data.word_vector.apply(lambda x: phrase2char(x))
      verbosity(f"Step 1: Translated to indexes complete: {time.time()-checkpoint} s", debug)

      output = [np.stack(data["word_vector"].values), np.stack(data["char_vector"].values), np.stack(data["case_vector"].values), np.expand_dims(np.stack(data["label_vector"].values),-1)]
    
    else:
      raise Exception(f"unkown annotation: {annotation}")
    
    if save_cache:
      verbosity("Saving data to disk", debug)
      checkpoint = time.time()
      # saving in multi parts because training data causes a memory error
      np.save(f"{clean_name}_{annotation}_word", output[0], allow_pickle=True)
      np.save(f"{clean_name}_{annotation}_char", output[1], allow_pickle=True)
      np.save(f"{clean_name}_{annotation}_case", output[2], allow_pickle=True)
      np.save(f"{clean_name}_{annotation}_label", output[3], allow_pickle=True)
      verbosity(f"Saved to disk: {time.time()-checkpoint} s", debug)
  
  return output

In [0]:
train_data = preprocess_data(train_file, annotation_type, use_cache=False)
print(train_data[0].shape,train_data[1].shape,train_data[2].shape,train_data[3].shape)

Loading raw data file to process labels: onto.train.ner
Parsed data loaded: 7.589703321456909 s
Processing data into phrase vectors
Step 1: Translating to indexes
Step 1: Translated to indexes complete: 3.730118751525879 s
Step 2: Creating phrase vectors
Step 2a: Aggregating phrases


## Prepare development data

In [0]:
dev_data = preprocess_data(dev_file, annotation_type, use_cache=False)
print(dev_data[0].shape,dev_data[1].shape,dev_data[2].shape,dev_data[3].shape)

## Prepare test data

In [0]:
test_data = preprocess_data(test_file, annotation_type, use_cache=False)
print(test_data[0].shape,test_data[1].shape,test_data[2].shape,test_data[3].shape)

## Prepare domain data

In [0]:
real_data = preprocess_data(real_file, annotation_type, use_cache=False,save_cache=True)
print(real_data[0].shape, real_data[1].shape, real_data[2].shape, real_data[3].shape)

In [0]:
def train_dev_test_split(data, test_size, train_size):
  all_choices = np.arange(0,data[0].shape[0])
  test_choices = np.random.choice(a=all_choices, size=int(np.round(data[0].shape[0]*test_size)), replace=False)
  real_test_data = [data[0][test_choices], data[1][test_choices], data[2][test_choices], data[3][test_choices]]
  not_test_choices = np.setdiff1d(all_choices,test_choices)
  not_test_data = data[0][not_test_choices], data[1][not_test_choices], data[2][not_test_choices], data[3][not_test_choices]
  train_choices = np.random.randint(0,not_test_data[0].shape[0], size=int(np.round(not_test_data[0].shape[0]*train_size)))
  train_choices = np.random.choice(a=not_test_choices, size=int(np.round(not_test_choices.shape[0]*train_size)), replace=False)
  real_train_data = [data[0][train_choices], data[1][train_choices], data[2][train_choices], data[3][train_choices]]
  dev_choices = np.setdiff1d(not_test_choices, train_choices)
  real_dev_data = [data[0][dev_choices], data[1][dev_choices], data[2][dev_choices], data[3][dev_choices]]

  print(f"Original shape: {data[0].shape}")
  print(f"Train shape: {real_train_data[0].shape}")
  print(f"Dev shape: {real_dev_data[0].shape}")
  print(f"Test shape: {real_test_data[0].shape}")

  return real_test_data, real_dev_data, real_train_data

In [0]:
rtrain, rdev, rtest = train_dev_test_split(real_data, 0.435, 0.75)


# Model Building

In [0]:
# https://github.com/mxhofer/Named-Entity-Recognition-BidirectionalLSTM-CNN-CoNLL/blob/master/nn_CoNLL.ipynb

def buildModel(labels, wordEmbeddings, caseEmbeddings, characterEmbeddings=None):
  
  # character - input
  character_input = Input(shape=(None, WORD_LENGTH,), name="Character_input")
  embed_char_out = TimeDistributed(
      Embedding(CHAR_VOCAB,
                CHAR_EMBEDDING_DIM,
                embeddings_initializer=RandomUniform(minval=-0.5, maxval=0.5)),name="Character_embedding")(character_input)

  dropout = Dropout(DROPOUT)(embed_char_out)

  ## character - CNN
  conv1d_out = TimeDistributed(Conv1D(kernel_size=CONV_SIZE,
                                      filters=CONV_FILTERS,
                                      padding='same',
                                      activation='tanh',
                                      strides=CONV_STRIDE), name="Convolution")(dropout)
  maxpool_out = TimeDistributed(MaxPooling1D(CONV_WINDOW), name="Maxpool")(conv1d_out)
  char = TimeDistributed(Flatten(), name="Flatten")(maxpool_out)
  char = Dropout(DROPOUT)(char)

  # word - input
  words_input = Input(shape=(None,), dtype='int32', name='words_input')
  words = Embedding(input_dim=wordEmbeddings.shape[0],
                    output_dim=wordEmbeddings.shape[1],
                    weights=[wordEmbeddings],
                    trainable=False)(words_input)

  # case - input
  casing_input = Input(shape=(None,), dtype='int32', name='casing_input')
  casing = Embedding(input_dim=caseEmbeddings.shape[0],
                     output_dim=caseEmbeddings.shape[1],
                     weights=[caseEmbeddings],
                     trainable=False)(casing_input)
  
  # character + word + case -> biLSTM
  output = concatenate([words, casing, char])
  output = Bidirectional(LSTM(LSTM_STATE_SIZE, 
                              return_sequences=True, 
                              dropout=DROPOUT,
                              recurrent_dropout=RECURRENT_DROPOUT
                             ), name="biLSTM")(output)
  
  # output
  output = TimeDistributed(Dense(len(labels), activation='softmax'),name="softmax")(output)

  # finalize
  model = Model(inputs=[words_input, character_input, casing_input], outputs=[output])

  model.compile(loss='sparse_categorical_crossentropy', optimizer=Nadam())
  
  return model

In [0]:
myModel = None
if os.path.exists(model_load_path):
  print("Attempting to load model")
  myModel = load_model(model_load_path)
  print("Model loaded successfully")
  myModel.summary()
  print(myModel.evaluate(dev_data[:3], dev_data[3]))
  myModel.fit([train_data[0][:TRAINING_SIZE],train_data[1][:TRAINING_SIZE],train_data[2][:TRAINING_SIZE]], train_data[3][:TRAINING_SIZE],
              validation_data = (dev_data[:3], dev_data[3]),
              epochs=EPOCHS,
              initial_epoch=INITIAL_EPOCH,
              batch_size=BATCH_SIZE,
              callbacks=[EarlyStopping(min_delta=0), ModelCheckpoint(model_path)])
else:
  print("Building model")
  myModel = buildModel(labels, words, cases)
  myModel.summary()
  plot_model(myModel, to_file=os.path.join(model_dir, model_name+'.png'))
  print(myModel.evaluate(dev_data[:3], dev_data[3]))
  myModel.fit([train_data[0][:TRAINING_SIZE],train_data[1][:TRAINING_SIZE],train_data[2][:TRAINING_SIZE]], train_data[3][:TRAINING_SIZE],
              validation_data = (dev_data[:3], dev_data[3]),
              epochs=EPOCHS,
              initial_epoch=0,
              batch_size=BATCH_SIZE,
              callbacks=[EarlyStopping(min_delta=0), ModelCheckpoint(model_path)])


W0802 05:06:04.925815 140665170061184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0802 05:06:04.965647 140665170061184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0802 05:06:04.968747 140665170061184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0802 05:06:05.017114 140665170061184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0802 05:06:05.031918 

Attempting to load model


W0802 05:06:12.041017 140665170061184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0802 05:06:14.859666 140665170061184 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0802 05:06:15.362855 140665170061184 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Model loaded successfully
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Character_input (InputLayer)    (None, None, 52)     0                                            
__________________________________________________________________________________________________
Character_embedding (TimeDistri (None, None, 52, 30) 3000        Character_input[0][0]            
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, None, 52, 30) 0           Character_embedding[0][0]        
__________________________________________________________________________________________________
Convolution (TimeDistributed)   (None, None, 52, 30) 2730        dropout_1[0][0]                  
___________________________________________________________________________________

# Metrics

In [0]:
def get_metrics(model, data, save_name=None):
  predictions = model.predict(data[:3])
  y = data[3].reshape(data[3].shape[0],data[3].shape[1])
  
  pf = np.argmax(predictions, axis=2).flatten()
  af = data[3].flatten()
  
  metrics = []
  metrics = pd.DataFrame(columns=['Label', 'Support', 'Precision', 'Recall', "F1"])
  for i, label in enumerate(labels):
    support = np.where(af == i)
    tp = np.sum(pf[support] == af[support])

    precision = None
    if pf[np.where(pf == i)].shape[0] == 0:
      precision = 0.0
    else:
      precision = tp/pf[np.where(pf == i)].shape[0]
      
    recall = tp/af[support].shape[0]
    
    f1 = None
    if precision + recall == 0:
      f1 = 0
    else:
      f1 = 2*precision*recall/(precision+recall)

    metrics = metrics.append({'Label': ix_to_label[i], 'Support':af[support].shape[0], 'Precision': precision, 'Recall':recall, 'F1':f1}, ignore_index=True)
  
  metrics = metrics.append({'Label': 'micro',
                  'Support': metrics.Support.sum(),
                  'Precision': (metrics.Precision*metrics.Support/pf.shape[0]).sum(),
                  'Recall': (metrics.Recall*metrics.Support/pf.shape[0]).sum(),
                  'F1': (metrics.F1*metrics.Support/pf.shape[0]).sum()
                           },
                 ignore_index=True)
  metrics = metrics.set_index('Label')
  if save_name is not None:
    print("saving")
    metrics.to_csv(os.path.join(model_dir, save_name))
  display(metrics)
  return predictions
 

In [0]:
test_pred = get_metrics(myModel, test_data, save_name=model_name + '-untuned-onto-results.csv')

saving




Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,3104227,0.999143,0.999042,0.999092
B-ORG,2002,0.816254,0.807692,0.811951
I-ORG,2703,0.821001,0.862005,0.841003
B-WORK_OF_ART,169,0.706422,0.455621,0.553957
I-WORK_OF_ART,347,0.635071,0.386167,0.480287
B-LOC,215,0.637209,0.637209,0.637209
I-LOC,202,0.656109,0.717822,0.685579
B-CARDINAL,1005,0.708447,0.776119,0.740741
B-EVENT,85,0.744186,0.376471,0.5
I-EVENT,165,0.548673,0.375758,0.446043


In [0]:
real_pred = get_metrics(myModel, rtest, save_name=model_name + '-untuned-real-results.csv')

saving




Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,73968,0.989211,0.996593,0.992888
B-ORG,487,0.634454,0.620123,0.627207
I-ORG,968,0.722575,0.823347,0.769676
B-WORK_OF_ART,0,0.0,,
I-WORK_OF_ART,0,0.0,,
B-LOC,0,0.0,,
I-LOC,0,0.0,,
B-CARDINAL,2,0.2,1.0,0.333333
B-EVENT,0,0.0,,
I-EVENT,0,0.0,,


## Fine-tune

In [0]:
def build_tunable_model(input_model, freeze=False, layers=None):
  interim_model = buildModel(labels, words, cases)
  if layers:
    for layer in layers:
      interim_model.get_layer(layer).set_weights(input_model.get_layer(layer).get_weights())
  else:
    interim_model.set_weights(input_model.get_weights())
  #new_layer = TimeDistributed(Dense(len(labels), activation='softmax'),name="softmax")(interim_model.layers[-2].output)
  #interim_model.layers.pop()
  if freeze:
    interim_model.get_layer('biLSTM').trainable = False
  model2 = Model(inputs=interim_model.input, outputs=interim_model.output)
  model2.compile(loss='sparse_categorical_crossentropy', optimizer=Nadam())
  return model2

In [0]:
try:
  myModel.get_layer('BLSTM').name = 'biLSTM'
  myModel.get_layer('Softmax_layer').name = 'softmax'
except:
  pass
for size in real_training_sizes:
  new_model = build_tunable_model(myModel, freeze=FROZEN, layers=FINE_TUNE_LAYERS)
  if size == np.min(real_training_sizes):
    print(new_model.summary())
    
  t_word = rtrain[0][:size]
  t_char = rtrain[1][:size]
  t_case = rtrain[2][:size]
  t_labels = rtrain[3][:size]
  
  model_tuned_base_name = f"{model_name}_{str(size)}_tuned"
  
  model_tuned_path = os.path.join(model_dir, model_tuned_base_name + "-{epoch:02d}-{val_loss:.4f}.h5")
  
  new_model.fit([t_word, t_char, t_case], t_labels,
                validation_data=(rdev[:3],rdev[3]),
                epochs=5000,
                batch_size = 10,
                callbacks=[EarlyStopping(min_delta=0),
                           ModelCheckpoint(model_tuned_path)])
  
  tuned_pred = get_metrics(new_model, test_data, save_name=model_tuned_base_name + '-onto-results.csv')
  real_tuned_pred = get_metrics(new_model, rtest, save_name=model_tuned_base_name + '-real-results.csv')

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Character_input (InputLayer)    (None, None, 52)     0                                            
__________________________________________________________________________________________________
Character_embedding (TimeDistri (None, None, 52, 30) 3000        Character_input[0][0]            
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, None, 52, 30) 0           Character_embedding[0][0]        
__________________________________________________________________________________________________
Convolution (TimeDistributed)   (None, None, 52, 30) 2730        dropout_1[0][0]                  
__________________________________________________________________________________________________
Maxpool (T



Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,3104227,0.998859,0.999084,0.998972
B-ORG,2002,0.80031,0.772727,0.786277
I-ORG,2703,0.84528,0.798372,0.821157
B-WORK_OF_ART,169,0.675,0.47929,0.560554
I-WORK_OF_ART,347,0.625,0.389049,0.479574
B-LOC,215,0.672414,0.181395,0.285714
I-LOC,202,0.730769,0.470297,0.572289
B-CARDINAL,1005,0.812971,0.536318,0.646283
B-EVENT,85,0.714286,0.352941,0.472441
I-EVENT,165,0.491713,0.539394,0.514451


saving


Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,73968,0.989055,0.998134,0.993574
B-ORG,487,0.617871,0.667351,0.641658
I-ORG,968,0.800703,0.705579,0.750137
B-WORK_OF_ART,0,0.0,,
I-WORK_OF_ART,0,0.0,,
B-LOC,0,0.0,,
I-LOC,0,0.0,,
B-CARDINAL,2,0.333333,1.0,0.5
B-EVENT,0,0.0,,
I-EVENT,0,0.0,,


Train on 25 samples, validate on 100 samples
Epoch 1/5000
Epoch 2/5000
Epoch 3/5000
Epoch 4/5000
Epoch 5/5000
Epoch 6/5000
Epoch 7/5000
saving




Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,3104227,0.997876,0.999211,0.998543
B-ORG,2002,0.747222,0.671828,0.707522
I-ORG,2703,0.844173,0.715501,0.774529
B-WORK_OF_ART,169,0.853659,0.207101,0.333333
I-WORK_OF_ART,347,0.794118,0.15562,0.260241
B-LOC,215,0.576923,0.069767,0.124481
I-LOC,202,0.839506,0.336634,0.480565
B-CARDINAL,1005,0.90099,0.181095,0.301574
B-EVENT,85,0.866667,0.152941,0.26
I-EVENT,165,0.696429,0.236364,0.352941


saving


Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,73968,0.993828,0.999229,0.996521
B-ORG,487,0.741127,0.728953,0.73499
I-ORG,968,0.879332,0.707645,0.784201
B-WORK_OF_ART,0,0.0,,
I-WORK_OF_ART,0,0.0,,
B-LOC,0,0.0,,
I-LOC,0,0.0,,
B-CARDINAL,2,1.0,1.0,1.0
B-EVENT,0,0.0,,
I-EVENT,0,0.0,,


Train on 50 samples, validate on 100 samples
Epoch 1/5000
Epoch 2/5000
Epoch 3/5000
Epoch 4/5000
Epoch 5/5000
Epoch 6/5000
saving




Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,3104227,0.99797,0.999086,0.998528
B-ORG,2002,0.643139,0.650849,0.646971
I-ORG,2703,0.78176,0.72623,0.752973
B-WORK_OF_ART,169,0.857143,0.142012,0.243655
I-WORK_OF_ART,347,0.734694,0.103746,0.181818
B-LOC,215,0.625,0.069767,0.125523
I-LOC,202,0.810811,0.148515,0.251046
B-CARDINAL,1005,0.874172,0.131343,0.228374
B-EVENT,85,0.909091,0.117647,0.208333
I-EVENT,165,0.862069,0.151515,0.257732


saving


Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,73968,0.995355,0.999432,0.997389
B-ORG,487,0.811321,0.794661,0.802905
I-ORG,968,0.906509,0.791322,0.845008
B-WORK_OF_ART,0,0.0,,
I-WORK_OF_ART,0,0.0,,
B-LOC,0,0.0,,
I-LOC,0,0.0,,
B-CARDINAL,2,1.0,1.0,1.0
B-EVENT,0,0.0,,
I-EVENT,0,0.0,,


Train on 100 samples, validate on 100 samples
Epoch 1/5000
Epoch 2/5000
Epoch 3/5000
Epoch 4/5000
saving




Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,3104227,0.99766,0.99916,0.99841
B-ORG,2002,0.647944,0.645355,0.646647
I-ORG,2703,0.813363,0.675546,0.738076
B-WORK_OF_ART,169,0.782609,0.106509,0.1875
I-WORK_OF_ART,347,0.75,0.086455,0.155039
B-LOC,215,0.714286,0.046512,0.087336
I-LOC,202,0.823529,0.069307,0.127854
B-CARDINAL,1005,0.897959,0.043781,0.083491
B-EVENT,85,0.75,0.035294,0.067416
I-EVENT,165,0.714286,0.030303,0.05814


saving


Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,73968,0.994939,0.999392,0.997161
B-ORG,487,0.82268,0.819302,0.820988
I-ORG,968,0.922013,0.757231,0.831537
B-WORK_OF_ART,0,0.0,,
I-WORK_OF_ART,0,0.0,,
B-LOC,0,0.0,,
I-LOC,0,0.0,,
B-CARDINAL,2,0.0,0.0,0.0
B-EVENT,0,0.0,,
I-EVENT,0,0.0,,


Train on 200 samples, validate on 100 samples
Epoch 1/5000
Epoch 2/5000
Epoch 3/5000
Epoch 4/5000
Epoch 5/5000
Epoch 6/5000
saving




Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,3104227,0.99817,0.998795,0.998483
B-ORG,2002,0.538041,0.688811,0.604162
I-ORG,2703,0.713034,0.722531,0.717751
B-WORK_OF_ART,169,1.0,0.023669,0.046243
I-WORK_OF_ART,347,0.7,0.040346,0.076294
B-LOC,215,0.45283,0.111628,0.179104
I-LOC,202,0.769231,0.09901,0.175439
B-CARDINAL,1005,0.893617,0.041791,0.079848
B-EVENT,85,0.6,0.035294,0.066667
I-EVENT,165,0.25,0.006061,0.011834


saving


Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,73968,0.997409,0.999162,0.998285
B-ORG,487,0.890909,0.905544,0.898167
I-ORG,968,0.942063,0.839876,0.888039
B-WORK_OF_ART,0,0.0,,
I-WORK_OF_ART,0,0.0,,
B-LOC,0,0.0,,
I-LOC,0,0.0,,
B-CARDINAL,2,0.0,0.0,0.0
B-EVENT,0,0.0,,
I-EVENT,0,0.0,,


Train on 300 samples, validate on 100 samples
Epoch 1/5000
Epoch 2/5000
Epoch 3/5000
Epoch 4/5000
Epoch 5/5000
Epoch 6/5000
saving




Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,3104227,0.998056,0.998783,0.998419
B-ORG,2002,0.495774,0.673826,0.571247
I-ORG,2703,0.630675,0.760636,0.689586
B-WORK_OF_ART,169,0.875,0.04142,0.079096
I-WORK_OF_ART,347,0.6,0.025937,0.049724
B-LOC,215,0.4,0.148837,0.216949
I-LOC,202,0.888889,0.039604,0.075829
B-CARDINAL,1005,0.88,0.021891,0.042718
B-EVENT,85,0.0,0.0,0.0
I-EVENT,165,1.0,0.006061,0.012048


saving


Unnamed: 0_level_0,Support,Precision,Recall,F1
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
O,73968,0.997867,0.999324,0.998595
B-ORG,487,0.904858,0.917864,0.911315
I-ORG,968,0.928956,0.891529,0.909858
B-WORK_OF_ART,0,0.0,,
I-WORK_OF_ART,0,0.0,,
B-LOC,0,0.0,,
I-LOC,0,0.0,,
B-CARDINAL,2,0.0,0.0,0.0
B-EVENT,0,0.0,,
I-EVENT,0,0.0,,
