# Sentiment Analysis using word vector and ConvNet in Keras + data generator

### Data Generator?
In previous attempts, due to RAM constraint we couldn't fit the entire reviews and all the word vector dimensions into memory. Thus we can only take the first 100 vector dimensions, as well as maybe first 200 words of movie reviews.

With data generator and Keras' `model.fit_generator()` function, we can pass a Python generator that spews out infinite number of X_train and Y_train.

Downside: It's slow, and the bottleneck is not the graphics card, nor the CPU, from what I see in htop and nvidia-smi.

In [1]:
# Notebook
%matplotlib inline
import matplotlib.pyplot as plt
import sys
import numpy as np

# ==== CONFIGS ====

# The word vector can be swapped with say GoogleNews 6B dataset
# word_vector_bin_file = "word2vec/w2v-padded.bin"
# word_vector_dims = 100
word_vector_bin_file = "word2vec/GoogleNews.bin"
word_vector_dims = 300
# word_vector_bin_file = "word2vec/model-0.bin"
# word_vector_dims = 100
# word_vector_bin_file = "word2vec/d2v-not-padded-300d.bin"
# word_vector_dims = 100

# in aclImdb, the longest review is 2470 words long
# Due to memory constraint, in this one I limit to 200 words
max_sentence_length = 250

# Can easily swap with other datasets if you want
positive_review_txts = "aclImdb/train/pos/*.txt"
negative_review_txts = "aclImdb/train/neg/*.txt"
positive_review_vals = "aclImdb/test/pos/*.txt"
negative_review_vals = "aclImdb/test/neg/*.txt"
# positive_review_txts = "polarity2/txt_sentoken/pos/*.txt"
# negative_review_txts = "polarity2/txt_sentoken/neg/*.txt"
# positive_review_vals = "polarity2/txt_sentoken/pos/*.txt"
# negative_review_vals = "polarity2/txt_sentoken/neg/*.txt"

pad_token = '<PAD/>'
positive_y = [1, 0]
negative_y = [0, 1]

# Test theano and graphics card
import theano.tensor as T

Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is disabled, cuDNN 5105)


## Step 1: Make X_train

X_train data structure is a 3D array, consisting of reviews, words, and vectors:
```json
[
  // a review
  [
    // a word, and its array of 100 vectors
    [0.75, 0.64 ...],
    ...
  ], 
  ...
]
```

### TODO:
1. Load all the reviews into memory
2. Normalize the text
3. Add words to vocab array to make word vector retrieval faster
3. Determine vocab size, max review length

In [2]:
from bs4 import BeautifulSoup  
import re

def normalise_text(text):
    #1 Remove HTML (inspired by Kaggle)
    text = BeautifulSoup(text, "html.parser").getText()

    #2 Tokenize (stolen from Yoon Kim's CNN)
    text = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", text)     
    text = re.sub(r"\'s", " \'s", text) 
    text = re.sub(r"\'ve", " \'ve", text) 
    text = re.sub(r"n\'t", " n\'t", text) 
    text = re.sub(r"\'re", " \'re", text) 
    text = re.sub(r"\'d", " \'d", text) 
    text = re.sub(r"\'ll", " \'ll", text) 
    text = re.sub(r",", " , ", text) 
    text = re.sub(r"!", " ! ", text) 
    text = re.sub(r"\(", " \( ", text) 
    text = re.sub(r"\)", " \) ", text) 
    text = re.sub(r"\?", " \? ", text) 
    text = re.sub(r"\s{2,}", " ", text)
    
    # Step 3: 
    return text.lower()

def pad_text_list(text_list, pad_token="<PAD/>", pad_width=0):
    return text_list + ([pad_token] * (pad_width - len(text_list)))

def text_to_padded_list(text, pad_token="<PAD/>", pad_width=0):
    text_list = normalise_text(text).split()
    return pad_text_list(text_list, pad_token, pad_width)

In [3]:
import glob

# highest word count shall be the convnet rows
highest_review_word_count = 0
training_reviews = []
validating_reviews = []

# just for notebook
file_read_count = 0

for txt in glob.glob(positive_review_txts):
    with (open(txt, 'r')) as f:
        word_array =  normalise_text(f.read()).split()
        highest_review_word_count = max(highest_review_word_count, len(word_array))
        training_reviews.append([word_array, positive_y])
        file_read_count += 1
        if file_read_count % 1000 == 0:
            sys.stdout.write("\r{0}".format(file_read_count))
            sys.stdout.flush()

for txt in glob.glob(negative_review_txts):
    with (open(txt, 'r')) as f:
        word_array = normalise_text(f.read()).split()
        highest_review_word_count = max(highest_review_word_count, len(word_array))
        training_reviews.append([word_array, negative_y])
        file_read_count += 1
        if file_read_count % 1000 == 0:
            sys.stdout.write("\r{0}".format(file_read_count))
            sys.stdout.flush()
        
# For validation purposes
for txt in glob.glob(positive_review_vals):
    with (open(txt, 'r')) as f:
        word_array = normalise_text(f.read()).split()
        validating_reviews.append([word_array, positive_y])
        file_read_count += 1
        if file_read_count % 1000 == 0:
            sys.stdout.write("\r{0}".format(file_read_count))
            sys.stdout.flush()

for txt in glob.glob(negative_review_vals):
    with (open(txt, 'r')) as f:
        word_array = normalise_text(f.read()).split()
        validating_reviews.append([word_array, negative_y])
        file_read_count += 1
        if file_read_count % 1000 == 0:
            sys.stdout.write("\r{0}".format(file_read_count))
            sys.stdout.flush()

print('highest word count: ', highest_review_word_count)

50000('highest word count: ', 2606)


## Step 2: Assign vector to vocabs

In [4]:
import sys
import gensim
from gensim.models import Word2Vec
word_vecs = Word2Vec.load_word2vec_format(word_vector_bin_file, binary=True)

In [5]:
word_vecs_x = {}

def word_vector_for(word):
    try:
        return word_vecs[word][:word_vector_dims]
    except KeyError:
        pass
    
    try:
        return word_vecs_x[word]
    except KeyError:
        word_vecs_x[word] = np.random.uniform(-0.25, 0.25, word_vector_dims)
        return word_vecs_x[word]

In [6]:
import random
# In this case, we will use a generator to generate these big ass numpy arrays on the fly.
# Considerations:
# - This generator must be an infinite loop
# - Every iteration must have shuffled data
#
#
# It will return a tuple of single X_train (sentence_length, dims) and Y_train (2,1)
#
def test_reviews_generator():
    while 1:
        # Shuffle the reviews
        random.shuffle(training_reviews)
        
        batch_counter = 0
        batch_size = 50
        x_train = np.full((batch_size, max_sentence_length, word_vector_dims), 0, dtype='float32')
        y_train = np.full((batch_size, 2), 0, dtype='float32')
        for i, review in enumerate(training_reviews):
            for j, word in enumerate(review[0]):
                x_train[batch_counter][j] = word_vector_for(word)
                if j == max_sentence_length - 1:
                    break
            x_train[batch_counter][len(review[0]):-1] = word_vector_for(pad_token)
            y_train[batch_counter] = np.array(review[1])
            
            if batch_counter + 1 == batch_size:
                batch_counter = 0
                yield x_train, y_train
                x_train = np.full((batch_size, max_sentence_length, word_vector_dims), 0, dtype='float32')
                y_train = np.full((batch_size, 2), 0, dtype='float32')
            else:
                batch_counter += 1

test123 = test_reviews_generator()
test456 = next(test123)
# print(test456[0].shape)
print(test456[0][0])

def validation_reviews_generator():
    while 1:
        # Shuffle the reviews
        random.shuffle(validating_reviews)
        
        batch_counter = 0
        batch_size = 50
        x_train = np.full((batch_size, max_sentence_length, word_vector_dims), 0, dtype='float32')
        y_train = np.full((batch_size, 2), 0, dtype='float32')
        for i, review in enumerate(validating_reviews):
            for j, word in enumerate(review[0]):
                x_train[batch_counter][j] = word_vector_for(word)
                if j == max_sentence_length - 1:
                    break
            x_train[batch_counter][len(review[0]):-1] = word_vector_for(pad_token)
            y_train[batch_counter] = np.array(review[1])

            if batch_counter + 1 == batch_size:
                batch_counter = 0
                yield x_train, y_train
                x_train = np.full((batch_size, max_sentence_length, word_vector_dims), 0, dtype='float32')
                y_train = np.full((batch_size, 2), 0, dtype='float32')
            else:
                batch_counter += 1

[[ 0.08447266 -0.00035286  0.05322266 ...,  0.01708984  0.06079102
  -0.10888672]
 [ 0.00704956 -0.07324219  0.171875   ...,  0.01123047  0.1640625
   0.10693359]
 [ 0.11474609  0.06689453 -0.15234375 ..., -0.05566406 -0.23046875
  -0.16113281]
 ..., 
 [ 0.16343309  0.11218996  0.21791375 ..., -0.10633187 -0.23156008
  -0.11883703]
 [ 0.16343309  0.11218996  0.21791375 ..., -0.10633187 -0.23156008
  -0.11883703]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]


## Step 3: Keras

![YoonKim CNN Architecture](yoonkim-cnn-architecture.png)

In [7]:
from keras.models import Model, Sequential
from keras.layers.convolutional import Convolution1D, Convolution2D
from keras.layers.pooling import MaxPooling1D, MaxPooling2D
from keras.layers import Merge, Dense, Dropout, Activation, Input, Flatten
from keras.optimizers import SGD

Using Theano backend.


In [8]:
# Based on the paper, there are filters of various sizes
filters = 200
epochs = 10

layer1_filter_sizes = [3,4,5]
layer1_convs = []

graph_in = Input(shape=(max_sentence_length, word_vector_dims))

for filter_size in layer1_filter_sizes:
    conv = Convolution1D(filters,
                         filter_size,
                         border_mode = 'valid',
                         activation='relu',
                         subsample_length=1)(graph_in)
    pool = MaxPooling1D(pool_length=2)(conv)
    flatten = Flatten()(pool)
    layer1_convs.append(flatten)

# Merge the conv
merged = Merge(mode='concat')(layer1_convs)
graph = Model(input=graph_in, output=merged)

final_model = Sequential()
final_model.add(graph)
# final_model.add(Dense(64))
# final_model.add(Activation('relu'))
# final_model.add(Dropout(0.25))
final_model.add(Dense(32))
final_model.add(Activation('relu'))
# final_model.add(Dropout(0.25))
final_model.add(Dense(16))
final_model.add(Activation('relu'))
final_model.add(Dropout(0.5))
final_model.add(Dense(2))
final_model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
final_model.compile(loss='binary_crossentropy',
                    optimizer='rmsprop',
                    metrics=['accuracy'])


## Step 4: TRAIN THIS

In [33]:
final_model.fit_generator(test_reviews_generator(),
                          len(training_reviews),
                          25,
                          validation_data=validation_reviews_generator(),
                          nb_val_samples=(len(validating_reviews)/4))

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7ff30366ce50>

In [34]:
final_model.save('models/cnn-googlenewsw2v.h5')

## Step 5: Custom Predict

Will implement for small bit of text prediction later.

Somehow the predictions are dead wrong here.

In [11]:
def test_predict(text):
    word_array = text_to_padded_list(text)
    word_vec_array = np.full(fill_value=0.0,
                             shape=(1, max_sentence_length, word_vector_dims),
                             dtype='float32')
    for i, word in enumerate(word_array):
        word_vec_array[0][i] = word_vector_for(word)
        if i + 1 >= max_sentence_length:
            word_vec_array[i:-1] = word_vector_for(pad_token)
            break

    prediction = final_model.predict(word_vec_array, batch_size=1)[0]
    if prediction[0] > prediction[1]:
        return ['movie review is positive', prediction]
    else:
        return ['movie review is negative', prediction]

In [12]:
test_predict("this is very good")

['movie review is negative', array([ 0.49774861,  0.50225139], dtype=float32)]

In [13]:
test_predict("gosh this is just bad")

['movie review is negative', array([ 0.49774837,  0.50225157], dtype=float32)]

In [14]:
test_predict("train to busan is one of the most value for money movie one can pay for")

['movie review is negative', array([ 0.49833244,  0.50166756], dtype=float32)]

In [15]:
test_predict("Blackhat is not only disappointing, its embarrassing")

['movie review is negative', array([ 0.49899375,  0.50100625], dtype=float32)]

In [16]:
test_predict('''
Suffers from inconsistencies, both technical and story wise. They change the shooting styles, cameras, fps, warmth/cold - for no apparent reasons at all. Feels like it's not clear what this movie "wants to be". The main character is supposed to be a "super-hacker" but doesn't do anything "super hack-y", just wanders around, shooting people, and nails the female protagonist. Doesn't have many hacking-scenes for a "hackers movie", has tons of boring gun-scenes instead, from some reason. The motivation of the villain was, not interesting. References many other "movie-cliches" (not in a good way). Severely lacks humor. The few jokes in it are really cheesy (yeah, it's not a comedy , but comic reliefs are important). Many of the audience members left the theater in the middle or before the end
''')

['movie review is positive', array([ 0.51813865,  0.48186132], dtype=float32)]

In [17]:
test_predict('explosive summer flick that will keep you on the couch for hours')

['movie review is negative', array([ 0.49716318,  0.50283682], dtype=float32)]

In [18]:
test_predict('why would anyone watch this?')

['movie review is negative', array([ 0.49702409,  0.50297588], dtype=float32)]

In [19]:
test_predict('Some people walked out of this one, it\'s just that crap')

['movie review is negative', array([ 0.49685174,  0.5031482 ], dtype=float32)]

In [20]:
test_predict('this is definitely the best flick from christopher nolan yet!')

['movie review is negative', array([ 0.49752399,  0.50247604], dtype=float32)]

In [21]:
test_predict('i dug my eyes out')

['movie review is negative', array([ 0.4974511 ,  0.50254893], dtype=float32)]

In [22]:
test_predict('this is the one you must watch this year')

['movie review is negative', array([ 0.49720857,  0.5027914 ], dtype=float32)]

In [23]:
test_predict('touching love story indeed')

['movie review is negative', array([ 0.49816152,  0.50183845], dtype=float32)]

In [24]:
test_predict('would love to lie on the grassfield and watch this with her again')

['movie review is negative', array([ 0.49668354,  0.50331652], dtype=float32)]

In [25]:
test_predict('I bet there are more productive things to do than watching this film') # negative

['movie review is negative', array([ 0.49722725,  0.50277275], dtype=float32)]

In [26]:
test_predict('I would have to dig my eyes out from the socket on this one') # negative

['movie review is negative', array([ 0.4966912 ,  0.50330877], dtype=float32)]

In [27]:
# Inception IMDB 10/10
test_predict('''
What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?

Dom Cobb(Di Caprio) is an extractor who is paid to invade the dreams of various business tycoons and steal their top secret ideas. Cobb robs forcefully the psyche with practiced skill, though he's increasingly haunted by the memory of his late wife, Mal (Marion Cotillard), who has a nasty habit of showing up in his subconscious and wreaking havoc on his missions. Cobb had been involved so much in his heist work that he had lost his love!

But then, as fate had decided, a wealthy business man Saito( Ken Watanabe) hands over the responsibility of dissolving the empire of his business rival Robert Fischer Jr.(Cillian Murphy). But this time his job was not to steal the idea but to plant a new one: 'Inception'

Then what happens is the classic heist movie tradition. To carry out the the task, Cobb's 'brainiac' specialists team up again with him, Arthur (Joseph Gordon-Levitt), his longtime organizer; Tom Hardy (Eames), a "forger" who can shapeshift at will; and Yusuf (Dileep Rao), a powerful sedative supplier. 

There is only one word to describe the cinematography, the set designs and the special effects, and that is Exceptional! You don't just watch the scenes happening, you feel them. The movie is a real thrill ride. The action scenes are well picturised and the music by Hans Zimmer is electronically haunting. Never, in the runtime of the movie, you will get a chance to move your eyes from the screen to any other object.

Leonardo, who is still popularly known for Jack Dawson played by him in Titanic, should be relieved as his role as Dom Cobb will be remembered forever. His performance may or may not fetch him an Oscar but it will be his finest performance till date. The supporting cast too did an extraordinary work. Christopher Nolan, ah! what a man he is. His work is nothing less than a masterpiece and he deserves all the awards in the 'Best Director' category. If "Inception" is a metaphysical puzzle, it's also a metaphorical one: It's hard not to draw connections between Cobb's dream-weaving and Nolan's film making, intended to seduce us, mess with our heads and leave an ever-lasting impression.

To conclude, I would just say before your life ends, do yourself a favor by experiencing this exceptionally lucid classic created by Nolan! ''')

['movie review is positive', array([ 0.52830791,  0.47169214], dtype=float32)]

In [28]:
# IMDB Suicide Squad 2/10
test_predict('''
I don't get the ratings here. This is a cut and dry poorly made movie and fans of the DC universe deserve better. I don't normally post my reviews here. But I have to share my take on this movie because it just wasn't good. I didn't even have to go into spoilers to show how terrible it is. Movie goers shouldn't mindlessly consume these films. Christopher Nolan set a high bar, but producers and studios need to step messing with auteurs and maybe we can get a quality DC movie:

There is nothing in Suicide Squad that shows any hope that an auteur filmmaker can do anything distinctive with the current cash cow of the Hollywood machine: the super hero movie. What Christopher Nolan once made his own has devolved into a predictable pastiche whose charms should be wearing thin on audiences. It doesn't help that the movie is also an example of how bad one of these films can be when it becomes watered down and designed to refrain from shaking up anything in the so-called DC Universe. Suicide Squad, a PG-13 film, was supposed to be DC's entry to rival Marvel's R-rated Deadpool. Even though Deadpool had its own problems as a self-aware action movie, it still had focus and a bravado that is nowhere to be found in Suicide Squad.

Suicide Squad follows a group of villains with super powers released from prison as part of a government plan to protect the world from terrorists or whatever sign-of-the-times fear currently plaguing society (Zika?). Starring Will Smith as the hit man Deadshot and Margot Robbie as the Joker's manic girlfriend Harley Quinn, alongside several other less familiar DC baddies, these guys are supposed to be complex people who have long fallen from grace and are supposed to rise above to find their humanity and gain the audience's sympathy. But writer-director David Ayer tries so hard to take a safe route, you can see the gears trying to manipulate audience emotion, revealing the inherit problems of these comic book adaptations straining to catch up with decades of printed storytelling.

You can't totally blame Ayer, who last gave moviegoers Fury, an incredibly strong and startling war movie featuring a better fleshed out motley crew of characters. The preciousness Hollywood has for its ongoing world building of interconnected comic book films creates such tight restrictions on storytelling that anything that might upset that world has no room to prosper. At one point, toward the end of Suicide Squad, one character asks another, "Shouldn't you be dead?" Of course not, this is the DC universe, and it's gotta be milked. That means no major players should be written off in one movie.

The result of these storytelling restraints is a soulless kind of filmmaking hampered by pussyfooting. It's like a syrupy glaze that drowns out any possibility to shine above what has become a predictable pattern of storytelling. Characters dole out uninspired lines that play superficially to feelings, like, "Dad, I know you do bad things, but I still love you." Then there are the clichés, like "fight fire with fire." Sometimes the script inadvertently deflates the tension by spelling things out. Someone over a radio says, "Use extreme caution," and someone in the action responds, "I don't like this." But in case you miss that, someone else says, "I don't like it either." A kid playing with his action figures can come up with better chatter to establish tension....''')

['movie review is positive', array([ 0.50675797,  0.49324206], dtype=float32)]

In [29]:
test_predict('As bad as a cheese topped with naan and sambal')

['movie review is positive', array([ 0.50013787,  0.49986213], dtype=float32)]

In [30]:
test_predict('Still a better love story than Twilight')

['movie review is negative', array([ 0.49879947,  0.50120056], dtype=float32)]

In [31]:
test_predict('''
Cruise is at peak starriness in Jack Reacher: Never Go Back, burning with charisma, purpose and old-school don't-mess-with-the-hero machismo.''')

['movie review is positive', array([ 0.50270253,  0.49729744], dtype=float32)]