# Words are Hard!
### Exploration with DeepNLP for Ontological Word-Sense Disambiguation
### By Uthman Apatira

Consider the following sentence:

> _Alexis won the bet agaist her car._

Quite confusing even for a native English speaker, much less for a deterministic computer. First of all, who exactly is _her_ referring to here? In English, Alexis is a unisex name, so _her_ might very well be referring to Alexis herself or perhaps some other party. The usage of pronouns makes the sentence appear more natural, but also introduces **referential ambiguity**. The way we (people) normally overcome this is by discourse analysis, that is, by examining the immediately preceeding sentences.

Additionally, is the word _bet_ being used as a noun or as a verb? Humans readily identify it as being a noun in this case; but a computer making use of NLP techniques would only have a single feature (TF-IDF) or a single word embedding (Glove/Word2Vec) to describe _bet_ in any context used, noun, verb, or otherwise. This is called **lexical ambiguity**.

Finally, the sample sentence above also exhibits **syntactical ambiguity** because it is not clear if Alexis was betting against her own car, i.e. she assumed her car would not be able to fulfill the conditions of the bet, _or_ if Alexis was physically stationed next to her car.

Part of the beauty of language is its intrisic ambiguity. It's honestly a meracle we can communite and understand each other at all! This notebook will demonstrate some best practices and dive into a few modern NLP techniques that can help computer models perform better at natural language understanding. Stated formally, this notebook will cover some explaratory data analysis ad modeling with ontological word-sense disambiguation.

# Table Of Contents

- [Named Entity Tagging](#Named-Entity-Tagging)
- [Classical Sentiment Analysis](#Classical-Sentiment-Analysis)
- [Bayesian Hyperparameter Tuning](#Bayesian-Hyperparameter-Tuning)
- [Embedding Training](#Embedding-Training)
- [Sentiment Analysis with CNNs](#Sentiment-Analysis-with-CNNs)
- [Sentiment Analysis with RNNs](#Sentiment-Analysis-with-RNNs)
- [Sentiment Analysis with RNNs+CNNs](#Sentiment-Analysis-with-RNNs+CNNs)
- [POS-Augmentation for Embeddings](#POS-Augmentation-for-Embeddings)
- [Transfer Learning](#Transfer-Learning)
- [Reflections](#Reflections)

# Named Entity Tagging

Our journey starts with _Named Entity Tagging_, also known as named entity recognition. A named entity is just some entity that deserves to have a name =). The process of tagging them seeks to classify named entities into some pre-defined categories such as the names of persons, organizations, places, etc.

By properly tagging named entities, our desire is to have a strong ally in our battle of ontological word-sense disambiguation.

In [None]:
import sys, nltk
from nltk.chunk import tree2conlltags

In [None]:
# Downlod some nltk packages, if necessary
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

In [None]:
sample_lines = [
    'Alexis won the bet agaist her car.',
    'Free us from the tyranny of technology, making a connected life a more human experience.',
    'He started working at Microsoft in Seattle, Washington.',
    'San Francisco has a IBM office.'
]

In [None]:
def NETagger(line):
    results = []
    
    # A potnetial improvment :
    # Some organizations, e.g. 'JC Penny', places 'San Francisco'
    # and name-nouns, 'Brian Kursar' are multi-word. We can 
    # combine them..

    # Tag the individual parts of speech;
    line = nltk.pos_tag(line)

    # Now, extract named entities via parse tree
    line = nltk.ne_chunk(line)

    # Finally, covert the tree to a list of
    # tag-tuples: (word, pos, tag)
    line = nltk.tree2conlltags(line)

    for word,pos,tag in line:
        if 'GPE' in tag:            tag = 'PLACE'
        elif 'ORGANIZATION' in tag: tag = 'ORGANIZATION'
        elif 'NN' in pos:           tag = 'NOUN'
        elif 'VB' in pos:           tag = 'VERB'
        else:                       tag = 'OTHER'
        results.append(tag)
        
    return results

In [None]:
#for line in sys.stdin:
for line in sample_lines:
    print('\x1b[1;31m' + line + '\x1b[0m')
    
    # Chop it up, we can use regex to do this directly
    line = nltk.word_tokenize(line)
    
    results = NETagger(line)
    for i,word in enumerate(line):
        if results[i] == 'OTHER': continue
        print(word, 'is', results[i])
    print('')

Not bad! NLTK's `pos_tag` has support for many more parts of speech, and the `ne_chunk` chunker can identify many more named entities that what we've listed. By for this demo notebook, our dataset will be sufficiently small that if we get too crazy at this point, there potentially won't be enough overlap. That stated, to see the full tagset, execute the cell below. For our purposes, we're limiting the output to just nouns, verbs, places, and organizations. Everything else will just be marked as `'other'`:

In [None]:
# Let's see what it gives us...
# nltk.help.upenn_tagset()

# Classical Sentiment Analysis

Here by 'classical', I mean imploring the use of term-frequency, inverse-term-frequency. We will stack this transformer under either a logistic regression classifier for our final classification.

To get us started, I'll be using the `Large Movie Review Dataset`, which is a very balanced dataset of 25k positive and 25k negative reviews of different movies, with no more than 30 reviews per particular movie. So that I don't forget:

```
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}
```

The dataset is quite messy, all the reviews are in separate files. Let's bunch them up into .feather files for easy loading.

In [None]:
import os, gc, re
import pandas as pd
import numpy as np
import pickle

# For the fancy jupyter status bar
from tqdm import *

In [None]:
train_neg = os.listdir("net/aclImdb/train/neg/")
train_pos = os.listdir("net/aclImdb/train/pos/")
test_neg = os.listdir("net/aclImdb/test/neg/")
test_pos = os.listdir("net/aclImdb/test/pos/")

len(train_pos), len(train_neg), len(test_pos), len(test_neg)

In [None]:
train = []

for file in tqdm(train_neg):
    with open("net/aclImdb/train/neg/" + file) as fhandler:
        train.append([0, fhandler.read()])
        
for file in tqdm(train_pos):
    with open("net/aclImdb/train/pos/" + file) as fhandler:
        train.append([1, fhandler.read()])
        
train = pd.DataFrame(train, columns=['sentiment','comment'])


test = []
for file in tqdm(test_neg):
    with open("net/aclImdb/test/neg/" + file) as fhandler:
        test.append([0, fhandler.read()])
        
for file in tqdm(test_pos):
    with open("net/aclImdb/test/pos/" + file) as fhandler:
        test.append([1, fhandler.read()])
        
test  = pd.DataFrame(test, columns=['sentiment','comment'])

In [None]:
# Stash as feather for easy future access
train.to_feather('net/train.ftr')
test.to_feather('net/test.ftr')

In [None]:
# This is how to load it in the future
train = pd.read_feather('net/train.ftr')
test  = pd.read_feather('net/test.ftr')

In [None]:
train.head()

In [None]:
test.tail()

In [None]:
# Let's start by training the tdidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tdidf = TfidfVectorizer(
    lowercase   = True,
    stop_words  = 'english',
    ngram_range = (1, 2),
    max_df      = 0.4,
    min_df      = 5,
    binary      = False, 
    smooth_idf  = True
)

# We'll train our vctorizer using both train+test
tdidf.fit(train.comment.append(test.comment))

In [None]:
# If this dataset didn't already provide us a train/test split, we could have
# use KFold, and re-ran our model a few times over.
train_idf = tdidf.transform(train.comment)
test_idf  = tdidf.transform(test.comment)

In [None]:
# Now, train a LReg classifier over our IDF vector.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
                               
lreg = LogisticRegression(
    dual    = False,  # For large dsets, this can speed things up . . .
    C       = 5.125,  # Our main hyperprameter
    n_jobs  = 1
)

lreg.fit(train_idf, train.sentiment)
predict = lreg.predict(test_idf)

print('Accuracy:', accuracy_score(test.sentiment, predict))

# Bayesian Hyperparameter Tuning

With 88% Accuracy, our score wasn't that great. There are many ways we can improve it. We can add tri-grams to the idf vectorizer. We can also introduce character n-grams as well. No matter what we do, it is always a good idea to tune our hyperparameters using our hold-out set.

One technique for doing this is grid-search over our hyperparameter space. This seems to be going out of fashion because Bayesian optimization is all the craze these days, so let's go with that. The deep learner, a 'frequentists', believes a model's parameters are fixed and the data is random. A bayesian learner on the other hand, holds that the _data_ is fixed, but the parameters are random. If you think about it, this makes sense because when a model is trained, the data is fixed and isn't random anymore. I won't go into more detail than that here, but I'll say that there are packages that will take the conditional probabilities out of you :-). Let's use the hyperopt to explore a possible hyperparam space and find the best `C`, `max_df`, and `min_df` values to maximize our accuracy score:

In [None]:
from hyperopt import fmin, tpe, hp, Trials, space_eval

In [None]:
# First, define the search space:
space = {
    'ngram_range': hp.choice('x_ngram_range', [(1,1), (1,2), (1,3)]),
    'min_df'     : hp.quniform('x_min_df', 3, 6, 1),
    'max_df'     : hp.uniform('x_max_df', 0.4, 1.0),
    'C'          : hp.uniform('x_C', 0.1, 10),
}

In [None]:
# Then define our objective function
def objective(space):
    # This is where KFold would come into play. We actually have a small
    # DSet and using ALL of it (test+train) would be beneficial. We can
    # Concatenate both sets, then do KFold x-validation here. But for this
    # demo, we'll just use the test set as-is.
    global train, test
    
    # Debugging:
    print(space)
    
    tdidf = TfidfVectorizer(
        lowercase   = True,
        stop_words  = 'english',
        ngram_range = space['ngram_range'],
        max_df      = space['max_df'],
        min_df      = int(space['min_df']),
        binary      = False, 
        smooth_idf  = True
    )
    tdidf.fit(train.comment.append(test.comment))
    
    lreg = LogisticRegression(
        dual    = False,
        C       = space['C']
    )
    lreg.fit(train_idf, train.sentiment)
    
    predict = lreg.predict(test_idf)
    score   = accuracy_score(test.sentiment, predict)
    print('\t', score, '\n')
    
    # 1-score, since we're going to attempt to minimize this value
    return 1.0 - score

In [None]:
# The Trials object will store details of each iteration
trials = Trials()

# Run it:
best = fmin(
    objective,
    space     = space,
    algo      = tpe.suggest,
    max_evals = 10
)

# Get the values of the optimal parameters
best_params = space_eval(space, best)

Alright, it looks like the tuned-TfIDF gets us all the way up to 88.704% accuracy.

# Embedding Training

Now that we've gotten as far as we believe we can with TFIDF, let's move on to neural networks.

We could feed in our sparse TF-IDF representation as input into our net (think one hot encoding). But instead, we're going to try to create an embedding, or lower level representation of our language data. This will involve us first tokenizing our textual corpus into indices, and then assigning a random, trainable weight vector to each word.

Before starting, we're going to need to set a fixed input size for our comments. How do we know what size to use? Let's see what the data tells us:

In [None]:
import matplotlib.pyplot as plt

words = train.comment.str.count('\S+').astype(np.uint16)

plt.hist(words, bins=1000)
plt.show()

**250** will be a very competitive number of words. It'll capture most of the next of most of our comments.

In [None]:
from gensim.models import Word2Vec

from random import shuffle

import keras.backend as K
from keras.preprocessing import text, sequence
from keras.layers import Input, SpatialDropout1D,Dropout, GlobalAveragePooling1D, CuDNNGRU, Bidirectional, Dense, Embedding, Conv1D 
from keras.models import Model
from keras.optimizers import Adam, Nadam
from keras.callbacks import ModelCheckpoint

from sklearn import metrics

import gc

In [None]:
max_vocab_size  = 100000   # We won't use more than 100K words
max_text_len    = 250      # We'll only examine up to the first 100 words of each review

embedding_dimension = 100    # Each word will be represented with a dense, 100 unit vector
embedding_maxvocab  = 500000 # Our language model will know up to 500k words, probably more than in our toy dset
embedding_cbow_win  = 5      # Window size for unsuperivsd text context learning

In [None]:
all_comments = train.comment.append(test.comment).values
all_comments = [text.text_to_word_sequence(comment) for comment in tqdm(all_comments)]

w2v = Word2Vec(size=embedding_dimension, window=embedding_cbow_win, max_vocab_size=embedding_maxvocab)
w2v.build_vocab(all_comments)
w2v.train(all_comments, total_examples=w2v.corpus_count, epochs=5)

w2v.save('net/embedding.w2v')

del all_comments; gc.collect()

Note, we allow our embedding vocabulary to be quite big, but this is because while building the corpus, we don't know which words are important yet. Later on once we start supervised learning, we can prune accordingly.

Next up, let's build a mapper to tokenize our text (change the words -> vocab indices):

In [None]:
tokenizer = text.Tokenizer(num_words=max_vocab_size)
tokenizer.fit_on_texts(
    list(train.comment.fillna('NA').values) + list(test.comment.fillna('NA').values)
)

In [None]:
def remove_stopwords(tokenizer):
    # Just like we did with TfIdf, let's remove stopwords
    # From our vocabulary
    
    C_MIN_WF = 4
    C_MIN_DF = 4
    C_MAX_DF = tokenizer.document_count * 0.4
    print('MAX_DF', C_MAX_DF)

    whack_words = [w for w,c in tokenizer.word_counts.items() if c < C_MIN_WF] + [w for w,c in tokenizer.word_docs.items() if c < C_MIN_DF or c > C_MAX_DF]
    whack_words = list(set(whack_words))
    for word in whack_words:
        del tokenizer.word_counts[word], tokenizer.word_index[word], tokenizer.word_docs[word]
        #tokenizer.num_words -= 1

    # Since we deleted some entris, relabel (fix) the word indices
    word_index_keys = tokenizer.word_index.keys()
    for i,word in enumerate(word_index_keys):
        tokenizer.word_index[word] = i+1
    print('Deleting', len(whack_words), 'words that appear too frequently or too infrequently')

    word_index_keys = tokenizer.word_index.keys()
    print(len(word_index_keys), 'words left')
    return tokenizer

In [None]:
tokenizer = remove_stopwords(tokenizer)

In [None]:
# We just built the model, so no need to load it.
# But this is how that would be done:
w2v = Word2Vec.load('net/embedding.w2v')


# Start by initializing our embedding matrix [VocabSize x EmbeddingDimension]
word_index = tokenizer.word_index
nb_words   = min(max_vocab_size, len(word_index))
embedding_matrix = np.zeros((nb_words, embedding_dimension), dtype=np.float64)

# Since we're using a self trained embedding, our expectation is that there
# will be no out-of-vocabulary words present... but I'll leave that code in
# there anyway because I'll be coming back to it:
for word, i in word_index.items():
    if i >= max_vocab_size: continue
    try:
        embedding_vector = w2v.wv[word]
    except KeyError:
        embedding_vector = None
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

In [None]:
embedding_matrix.shape

In [None]:
# Now we can execute the tokenizer:
train_seq = tokenizer.texts_to_sequences(train.comment)
test_seq  = tokenizer.texts_to_sequences(test.comment)

# Add 0-padding to reach max_text_len size
train_seq = sequence.pad_sequences(train_seq, maxlen=max_text_len)
test_seq  = sequence.pad_sequences(test_seq,  maxlen=max_text_len)

In [None]:
print(train_seq.shape)

In [None]:
# The word "I" is mapped to this 100-unit vector:
train.comment[0][0], train_seq[0]

# Sentiment Analysis II with CNNs

That was a lot of prep work, but now we're ready to build a NNet model to perform sentiment analysis. For the loss function, I'll use binary log loss.

In [None]:
def simple_cnn_model(embedding_matrix):
    
    inp = Input(shape = (max_text_len, ))
    x = Embedding(
        nb_words,
        embedding_dimension,
        weights      = [embedding_matrix],
        input_length = max_text_len,
        trainable    = False
    )(inp)
    
    x = SpatialDropout1D(0.2)(x)
    x = Conv1D(filters=32, kernel_size=3, activation = 'relu')(x)
    x = GlobalAveragePooling1D()(x)
    x = Dropout(0.2)(x)
    x = Dense(1, activation='sigmoid')(x) # Squash
    
    model = Model(inputs=inp, outputs=x)

    model.compile(
        optimizer = Nadam(lr=0.01, clipvalue=0.5),
        loss = 'binary_crossentropy',
        metrics=['accuracy']
    )
    return model

model = simple_cnn_model(embedding_matrix)
model.summary()

This is a very simple model. Notice that our embedding layer is now frozen. Our understanding is that the embedding layer encoders some information about 'English', which includes syntax, semantics, and pragmatics to some degree. We don't want to change that. Rather, we'll leave it to our convolutional filters to learn the target mapping of good / bad sentiment.

In [None]:
check_point = ModelCheckpoint(
    'net/best_cnn_model.hdf5',
    monitor = "val_loss",
    mode = "min",
    save_best_only = True,
    verbose = 1
)

model.fit(
    train_seq, train.sentiment,
    batch_size = 16,
    epochs     = 10,
    verbose    = 1,
    callbacks  = [check_point],
    validation_data = (test_seq, test.sentiment)
)

Our convlutional model performed worse than our classical IDF-LReg model, how disappointing. Some things to note -- we used `kernel_size=3`, and don't have any spatial pooling (only temporal pooling). You can imagine here we're looking at n=3 grams. Our TFIDF model used n=1,2, and 3 gram combinations. We could adjust our nnet with additional kernels to consider additional n-grams as well, but that won't be explored in this notebook. There are many other ways to improve our network as well, but let's take a look at an RNN model:

# Sentiment Analysis II with RNNs

In [None]:
def simple_rnn_model(embedding_matrix):
    
    inp = Input(shape = (max_text_len, ))
    x = Embedding(
        nb_words,
        embedding_dimension,
        weights      = [embedding_matrix],
        input_length = max_text_len,
        trainable    = False
    )(inp)
    
    x = SpatialDropout1D(0.2)(x)
    x = Bidirectional(CuDNNGRU(32//2, return_sequences=True))(x)
    x = GlobalAveragePooling1D()(x)
    x = Dropout(0.2)(x)
    x = Dense(1, activation='sigmoid')(x) # Squash
    
    model = Model(inputs=inp, outputs=x)

    model.compile(
        optimizer = Nadam(lr=0.01, clipvalue=0.5),
        loss = 'binary_crossentropy',
        metrics=['accuracy']
    )
    return model

model = simple_rnn_model(embedding_matrix)
model.summary()

In [None]:
check_point = ModelCheckpoint(
    'net/best_rnn_model.hdf5',
    monitor = "val_loss",
    mode = "min",
    save_best_only = True,
    verbose = 1
)

model.fit(
    train_seq, train.sentiment,
    batch_size = 16,
    epochs     = 5,
    verbose    = 1,
    callbacks  = [check_point],
    validation_data = (test_seq, test.sentiment)
)

Since at train and test time, we have the entire sentence available, it makes sense that we use a bidirectional rnn rather than just limiting ourselves to processing the sentence unidirectionally.

But take a look at that training time. RNNs are a lot slower than CNNs because they consume data recursively. There is some research out there on [helping RNNs perform better in parallel](NLP QRNN: https://arxiv.org/pdf/1803.08240.pdf), but it's is not widely adopted as of yet. We can use a larger batch size to speed things up, but that affects training quality.

# Sentiment Analysis II with RNNs+CNNs

Let's see what happens when we combine both RNN and CNN processing:

In [None]:
def simple_rcnn_model(embedding_matrix):
    
    inp = Input(shape = (max_text_len, ))
    x = Embedding(
        nb_words,
        embedding_dimension,
        weights      = [embedding_matrix],
        input_length = max_text_len,
        trainable    = False
    )(inp)
    
    x = SpatialDropout1D(0.2)(x)
    x = Bidirectional(CuDNNGRU(32//2, return_sequences=True))(x)
    x = Conv1D(filters=64, kernel_size=1, activation = 'relu')(x)
    x = GlobalAveragePooling1D()(x)
    x = Dropout(0.2)(x)
    x = Dense(1, activation='sigmoid')(x) # Squash
    
    model = Model(inputs=inp, outputs=x)

    model.compile(
        optimizer = Nadam(lr=0.01, clipvalue=0.5),
        loss = 'binary_crossentropy',
        metrics=['accuracy']
    )
    return model

model = simple_rnn_model(embedding_matrix)
model.summary()

In [None]:
check_point = ModelCheckpoint(
    'net/best_rcnn_model.hdf5',
    monitor = "val_loss",
    mode = "min",
    save_best_only = True,
    verbose = 1
)

model.fit(
    train_seq, train.sentiment,
    batch_size = 16,
    epochs     = 5,
    verbose    = 1,
    callbacks  = [check_point],
    validation_data = (test_seq, test.sentiment)
)

# POS-Augmentation for Embeddings

Perhaps we can further improve our NNet scores by adding in either named entities or the part of speech information we worked on earlier? There are a few techniques we could use. One would be to train an embedding on our original text corpus converted into POS or NETs. Then, armed with an embedding to describe each distinct POS, we could simply cncatenate that onto our input. Another methods would be to OHE the tags and just feed them in as a separate input vector. Let's try out the first method.

But first, a quick "Gotcha!":

In [None]:
nltk.word_tokenize("He started working at Microsoft in Seattle, Washington.")

In [None]:
text.text_to_word_sequence("He started working at Microsoft in Seattle, Washington.")

Notice that Keras's text tokenizer converts everything to lowercase? That will mess up our NET. So let's use nltk's tokenizer this round instead:

In [None]:
netag_max_vocab_size       = 6   # place, org, noun, verb, other (0 is empty)
netag_embedding_dimension  = 3   # Meh, should do it. min(50, dimen//2)

This process takes a while, so let's use all our (my) available CPU threads to speed it up:

In [None]:
from multiprocessing import Pool

def build_ne_tags():
    global train, test
    
    NUM_THREADS = 12

    all_ne_tags = train.comment.append(test.comment).values
    with Pool(NUM_THREADS) as pool:  all_ne_tags = pool.map(nltk.word_tokenize, all_ne_tags)
    with Pool(NUM_THREADS) as pool:  all_ne_tags = pool.map(NETagger, all_ne_tags)
        
    return all_ne_tags

In [None]:
# Now we're in business :-)
all_ne_tags = build_ne_tags()

In [None]:
w2v_tag = Word2Vec(size=netag_embedding_dimension, window=embedding_cbow_win, max_vocab_size=netag_max_vocab_size)
w2v_tag.build_vocab(all_ne_tags)
w2v_tag.train(all_ne_tags, total_examples=w2v.corpus_count, epochs=5)

# Store resulting embedding
w2v_tag.save('net/net_embedding.w2v')

In [None]:
# In case we're coming in from another run where we have it saved already:
w2v_tag = Word2Vec.load('net/net_embedding.w2v')

# Stash the data back into our dset
train['netags'] = all_ne_tags[:train.shape[0]]
test['netags']  = all_ne_tags[train.shape[0]:]

In [None]:
train.head()

Next, rebuild the CNN model to take dual inputs:

In [None]:
w2v_tag.wv.vocab

In [None]:
tagger = {
    'oov':          0,
    'NOUN':         1,
    'ORGANIZATION': 2,
    'OTHER':        3,
    'PLACE':        4,
    'VERB':         5,
}

In [None]:
# Start by initializing our embedding matrix [VocabSize x EmbeddingDimension]
net_embedding_matrix = np.zeros((netag_max_vocab_size, netag_embedding_dimension))

# Self trained, no OOV:
for key,val in tagger.items():
    try:
        net_embedding_matrix[val] = w2v_tag.wv[key]
    except KeyError:
        pass

In [None]:
# Cool !
net_embedding_matrix

In [None]:
train_seq_net = train.netags.map(lambda tags: [tagger[tag] for tag in tags])
test_seq_net  = test.netags.map(lambda tags: [tagger[tag] for tag in tags])

# Add 0-padding to reach max_text_len size
train_seq_net = sequence.pad_sequences(train_seq_net, maxlen=max_text_len)
test_seq_net  = sequence.pad_sequences(test_seq_net,  maxlen=max_text_len)

In [None]:
# Setup our dual-head CNN:
from keras.layers.merge import concatenate

def intermediate_cnn_model(embedding_matrix, net_embedding_matrix):
    
    inp  = Input(shape = (max_text_len, ))
    inp2 = Input(shape = (max_text_len, ))
    
    x = Embedding(
        nb_words,
        embedding_dimension,
        weights      = [embedding_matrix],
        input_length = max_text_len,
        trainable    = False
    )(inp)
    
    y = Embedding(
        netag_max_vocab_size,
        netag_embedding_dimension,
        weights      = [net_embedding_matrix],
        input_length = max_text_len,
        trainable    = False
    )(inp2)
    
    x = SpatialDropout1D(0.2)(x)
    x = concatenate([
        # Text + NETags
        x, y
    ])
    
    x = Conv1D(filters=32, kernel_size=3, activation = 'relu')(x)
    x = GlobalAveragePooling1D()(x)
    x = Dropout(0.2)(x)
    x = Dense(1, activation='sigmoid')(x) # Squash
    
    model = Model(inputs=[inp,inp2], outputs=x)

    model.compile(
        optimizer = Nadam(lr=0.01, clipvalue=0.5),
        loss = 'binary_crossentropy',
        metrics=['accuracy']
    )
    return model

model = intermediate_cnn_model(embedding_matrix, net_embedding_matrix)
model.summary()

In [None]:
check_point = ModelCheckpoint(
    'net/best_intermediate_cnn_model.hdf5',
    monitor = "val_loss",
    mode = "min",
    save_best_only = True,
    verbose = 1
)

model.fit(
    [train_seq, train_seq_net], train.sentiment,
    batch_size = 16,
    epochs     = 10,
    verbose    = 1,
    callbacks  = [check_point],
    validation_data = ([test_seq, test_seq_net], test.sentiment)
)

Wadaya know, it performs considerably better =) !

Stated differently, our net is better able to understand the subtleties of the English language now.

# Transfer Learning

Up until know, our results have been lackluster. Even adding in POS information, our Net has yet to eclipse the results we got from TFiDF, though they came pretty close.

This shouldn't come as a surprise.

What we're essentially doing here is asking our model to not only analyze the sentiment of the movie reviews, but also derive some understanding of English. Recall that our neural nets have no context. They start not knowing _anything_, like babies, and then have to adapt to the world/problem around them. What would our accuracy look like if we first trained our embeddings against a much larger textual corpus than 50k movie reviews? What if we trained it on... Wikipedia?

For this task, I'll make use of 300-dimensional pre-trained [FastText](https://github.com/facebookresearch/fastText). This corpus is trained on the entire Wikipedia circa 2014, which consists of 6Billion tokens, 400K vocabulary words. I hope it performs even better :-).

In [None]:
embedding_dimension = 300

In [None]:
def get_coefs(word, *arr): return word, np.asarray(arr, dtype='float32')
glove = dict(get_coefs(*o.rstrip().rsplit(' ')) for o in open('net/wiki.en.vec'))

In [None]:
oov = []
word_index = tokenizer.word_index
nb_words   = min(max_vocab_size, len(word_index))

# Any OOV word, will have a 0-vector
embedding_matrix = np.zeros((nb_words, embedding_dimension))

for word, i in word_index.items():
    if i >= max_vocab_size or i>=nb_words: break
    vec = glove.get(word)
    if vec is None:
        oov.append(word)
    else:
        embedding_matrix[i] = vec
        
print(len(oov), 'OOV tokens')

In [None]:
oov[:20]

After getting rid of stopwords, which are of little benefit because they are used too often and don't have discriminating power or are used too infrequently and thus won't generalize, we're left with a much smaller vocabulary set with only 4426 OOF tokens. Clearly, we need to do some pre-processing work, such as splitting on "'"'s and perhaps mapping numbers from "10" -> "ten", "9" -> "nine, etc. Maybe next time, since this notebook is getting quite long ...

Let's rebuild our CNN model:

In [None]:
model = simple_cnn_model(embedding_matrix)
model.summary()

In [None]:
check_point = ModelCheckpoint(
    'net/best_cnn_fasttext_model.hdf5',
    monitor = "val_loss",
    mode    = "min",
    save_best_only = True,
    verbose = 1
)

model.fit(
    train_seq, train.sentiment,
    batch_size = 16,
    epochs     = 5,
    verbose    = 1,
    callbacks  = [check_point],
    validation_data = (test_seq, test.sentiment)
)

The Net trained with proper embeddings gave better results than the cold-start net, but not better results than the cold-start + NETagger net. You can probably guess the next best idea is to run training on all of Wikipedia, but this time taking into account NETags. Unfortunately, I'm not willing to donate my GPU time for that at this juuncture, but rest assured, the performance will beat TfIDF considerably.

# Reflections

There are quite a few observations to be made here. First, it's important to have a good understanding of one's problem. The objective here was sentiment analysis. For problems like this, usually, a few words: "I hated", "I loved", "very displeased", "dissapointed", "best movie", "unfortunately", "won't watch", "don't recommend", "brilliant film", etc. are sufficient to identify the sentiment of a comment. Knowing this might help design better model architectures to make the problem space easier to understand.

Furthermore, while a chainsaw cuts well, perhaps it's not the _best_ tool to wield while dining :-). Deep learning models perform the best when given a lot of data, but can underpreform vs classical machine learning on smaller datasets, if transfer learning isn't a possibility. At the end of the day, one should start with intuition, but it shoudn't end there. Try everything under the sun.

Also, neural nets are non-deterministic if not seeded. Re-running the net with different starting values could result in different accuracy / loss scores. Furthermore, we ran bayesian optimization on tf-idf/lreg but did not do so for our net. Optimizing our layer sizes, learning rate, batch rate, etc. will all have an effect on training.

That's all for now.