# Sentiment Analysis using word vector and LSTM in Keras + data generator


In [1]:
# Notebook
%matplotlib inline
import matplotlib.pyplot as plt
import sys
import numpy as np
import text_tokenizer
from gensim.models import Word2Vec

# ==== CONFIGS ====

# The word vector can be swapped with say GoogleNews 6B dataset
word_vector_bin_file = "word2vec/w2v-padded.bin"
word_vector_dims = 100

# in aclImdb, the longest review is 2470 words long
# Due to memory constraint, in this one I limit to 200 words
max_review_length = 300

# Can easily swap with other datasets if you want
positive_review_txts = "aclImdb/train/pos/*.txt"
negative_review_txts = "aclImdb/train/neg/*.txt"
positive_review_vals = "aclImdb/test/pos/*.txt"
negative_review_vals = "aclImdb/test/neg/*.txt"

positive_y = [1, 0]
negative_y = [0, 1]

Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is disabled, cuDNN 5105)


## Step 1: Make Embedding Layer

Seems like the training input format for LSTM will be a little bit different from other classifiers we test in this project.

### TODO:
1. Use Gensim Word2Vec to load the existing word embeddings.
2. Load the files, tokenize the words, then substitute them with `idx` from w2v_vocab.
2. Set Keras Embedding to use `w2v_vocab` and `w2v_weights`.

In [2]:
w2v_model = Word2Vec.load_word2vec_format(word_vector_bin_file, binary=True)
w2v_vocab = dict([(k, v.index) for k, v in w2v_model.vocab.items()])
w2v_weights = w2v_model.syn0

In [3]:
import glob

# highest word count shall be the convnet rows
highest_review_word_count = 0

X_train = []
y_train = []
X_test  = []
y_test  = []

# just for notebook
file_read_count = 0
def tick_file_read_count():
    global file_read_count
    file_read_count += 1
    if file_read_count % 1000 == 0:
        sys.stdout.write("\r{0}".format(file_read_count))
        sys.stdout.flush()

# Load files, tokenize, and convert into ID in 1 go
for txt in glob.glob(positive_review_txts):
    with (open(txt, 'r')) as f:
        word_array =  text_tokenizer.normalise_text(f.read()).split()
        word_idx = np.array([w2v_vocab[word] if word in w2v_vocab else 0 for word in word_array[:max_review_length]])
        x_train_np = np.zeros((max_review_length))
        x_train_np[:word_idx.shape[0]] = word_idx[:word_idx.shape[0]]
        X_train.append(x_train_np)
        y_train.append(positive_y)
        tick_file_read_count()

for txt in glob.glob(negative_review_txts):
    with (open(txt, 'r')) as f:
        word_array =  text_tokenizer.normalise_text(f.read()).split()
        word_idx = np.array([w2v_vocab[word] if word in w2v_vocab else 0 for word in word_array[:max_review_length]])
        x_train_np = np.zeros((max_review_length))
        x_train_np[:word_idx.shape[0]] = word_idx[:word_idx.shape[0]]
        X_train.append(x_train_np)
        y_train.append(negative_y)
        tick_file_read_count()
        
# # # For validation purposes
for txt in glob.glob(positive_review_vals):
    with (open(txt, 'r')) as f:
        word_array =  text_tokenizer.normalise_text(f.read()).split()
        word_idx = np.array([w2v_vocab[word] if word in w2v_vocab else 0 for word in word_array[:max_review_length]])
        x_train_np = np.zeros((max_review_length))
        x_train_np[:word_idx.shape[0]] = word_idx[:word_idx.shape[0]]
        X_test.append(x_train_np)
        y_test.append(positive_y)
        tick_file_read_count()

for txt in glob.glob(negative_review_vals):
    with (open(txt, 'r')) as f:
        word_array =  text_tokenizer.normalise_text(f.read()).split()
        word_idx = np.array([w2v_vocab[word] if word in w2v_vocab else 0 for word in word_array[:max_review_length]])
        x_train_np = np.zeros((max_review_length))
        x_train_np[:word_idx.shape[0]] = word_idx[:word_idx.shape[0]]
        X_test.append(x_train_np)
        y_test.append(negative_y)
        tick_file_read_count()
        
# numpy-fy them
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test  = np.array(X_test)
y_test  = np.array(y_test)

50000

## Step 3: Keras LSTM

In [4]:
from keras.models import Model, Sequential
from keras.layers import Merge, Dense, Dropout, Activation, Input, Flatten
from keras.layers import Embedding, LSTM
from keras.optimizers import SGD

Using Theano backend.


In [5]:
graph_input = Input(shape=(max_review_length,), dtype='int32')

final_model = Sequential()
# final_model.add()
# final_model.add(embedding_layer(graph_input))
final_model.add(Embedding(len(w2v_vocab),
                          word_vector_dims,
                          weights=[w2v_weights],
                          input_length=max_review_length,
                          trainable=False))
final_model.add(LSTM(word_vector_dims, activation='sigmoid', inner_activation='hard_sigmoid'))
final_model.add(Dropout(0.5))
final_model.add(Dense(2))
final_model.add(Activation('softmax'))

final_model.compile(loss='binary_crossentropy',
                    optimizer='adam',
                    metrics=['accuracy'])


## Step 4: TRAIN THIS

In [6]:
final_model.fit(X_train,
                y_train,
                validation_data=(X_test, y_test),
                nb_epoch=25,
                batch_size=50)

Train on 25000 samples, validate on 25000 samples
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f49e6748850>

In [7]:
final_model.save('models/lstm-w2v-imdb-cbow-100d.h5')

## Step 5: Custom Predict

Will implement for small bit of text prediction later.

In [8]:
def test_predict(text):
    # OK I dunno what to do here
    if prediction[0] > prediction[1]:
        return ['movie review is positive', prediction]
    else:
        return ['movie review is negative', prediction]

In [9]:
test_predict("this is very good")

NameError: global name 'prediction' is not defined

In [None]:
test_predict("gosh this is just bad")

In [None]:
test_predict("train to busan is one of the most value for money movie one can pay for")

In [None]:
test_predict("Blackhat is not only disappointing, its embarrassing")

In [None]:
test_predict('''
Suffers from inconsistencies, both technical and story wise. They change the shooting styles, cameras, fps, warmth/cold - for no apparent reasons at all. Feels like it's not clear what this movie "wants to be". The main character is supposed to be a "super-hacker" but doesn't do anything "super hack-y", just wanders around, shooting people, and nails the female protagonist. Doesn't have many hacking-scenes for a "hackers movie", has tons of boring gun-scenes instead, from some reason. The motivation of the villain was, not interesting. References many other "movie-cliches" (not in a good way). Severely lacks humor. The few jokes in it are really cheesy (yeah, it's not a comedy , but comic reliefs are important). Many of the audience members left the theater in the middle or before the end
''')

In [None]:
test_predict('explosive summer flick that will keep you on the couch for hours')

In [None]:
test_predict('why would anyone watch this?')

In [None]:
test_predict('Some people walked out of this one, it\'s just that crap')

In [None]:
test_predict('this is definitely the best flick from christopher nolan yet!')

In [None]:
test_predict('i dug my eyes out')

In [None]:
test_predict('this is the one you must watch this year')

In [None]:
test_predict('touching love story indeed')

In [None]:
test_predict('would love to lie on the grassfield and watch this with her again')

In [None]:
test_predict('I bet there are more productive things to do than watching this film') # negative

In [None]:
test_predict('I would have to dig my eyes out from the socket on this one') # negative

In [None]:
# Inception IMDB 10/10
test_predict('''
What is the most resilient parasite? An Idea! Yes, Nolan has created something with his unbelievably, incredibly and god- gifted mind which will blow the minds of the audience away. The world premiere of the movie, directed by Hollywood's most inventive dreamers, was shown in London and has already got top notch reviews worldwide and has scored maximum points! Now the question arises what the movie has that it deserve all this?

Dom Cobb(Di Caprio) is an extractor who is paid to invade the dreams of various business tycoons and steal their top secret ideas. Cobb robs forcefully the psyche with practiced skill, though he's increasingly haunted by the memory of his late wife, Mal (Marion Cotillard), who has a nasty habit of showing up in his subconscious and wreaking havoc on his missions. Cobb had been involved so much in his heist work that he had lost his love!

But then, as fate had decided, a wealthy business man Saito( Ken Watanabe) hands over the responsibility of dissolving the empire of his business rival Robert Fischer Jr.(Cillian Murphy). But this time his job was not to steal the idea but to plant a new one: 'Inception'

Then what happens is the classic heist movie tradition. To carry out the the task, Cobb's 'brainiac' specialists team up again with him, Arthur (Joseph Gordon-Levitt), his longtime organizer; Tom Hardy (Eames), a "forger" who can shapeshift at will; and Yusuf (Dileep Rao), a powerful sedative supplier. 

There is only one word to describe the cinematography, the set designs and the special effects, and that is Exceptional! You don't just watch the scenes happening, you feel them. The movie is a real thrill ride. The action scenes are well picturised and the music by Hans Zimmer is electronically haunting. Never, in the runtime of the movie, you will get a chance to move your eyes from the screen to any other object.

Leonardo, who is still popularly known for Jack Dawson played by him in Titanic, should be relieved as his role as Dom Cobb will be remembered forever. His performance may or may not fetch him an Oscar but it will be his finest performance till date. The supporting cast too did an extraordinary work. Christopher Nolan, ah! what a man he is. His work is nothing less than a masterpiece and he deserves all the awards in the 'Best Director' category. If "Inception" is a metaphysical puzzle, it's also a metaphorical one: It's hard not to draw connections between Cobb's dream-weaving and Nolan's film making, intended to seduce us, mess with our heads and leave an ever-lasting impression.

To conclude, I would just say before your life ends, do yourself a favor by experiencing this exceptionally lucid classic created by Nolan! ''')

In [None]:
# IMDB Suicide Squad 2/10
test_predict('''
I don't get the ratings here. This is a cut and dry poorly made movie and fans of the DC universe deserve better. I don't normally post my reviews here. But I have to share my take on this movie because it just wasn't good. I didn't even have to go into spoilers to show how terrible it is. Movie goers shouldn't mindlessly consume these films. Christopher Nolan set a high bar, but producers and studios need to step messing with auteurs and maybe we can get a quality DC movie:

There is nothing in Suicide Squad that shows any hope that an auteur filmmaker can do anything distinctive with the current cash cow of the Hollywood machine: the super hero movie. What Christopher Nolan once made his own has devolved into a predictable pastiche whose charms should be wearing thin on audiences. It doesn't help that the movie is also an example of how bad one of these films can be when it becomes watered down and designed to refrain from shaking up anything in the so-called DC Universe. Suicide Squad, a PG-13 film, was supposed to be DC's entry to rival Marvel's R-rated Deadpool. Even though Deadpool had its own problems as a self-aware action movie, it still had focus and a bravado that is nowhere to be found in Suicide Squad.

Suicide Squad follows a group of villains with super powers released from prison as part of a government plan to protect the world from terrorists or whatever sign-of-the-times fear currently plaguing society (Zika?). Starring Will Smith as the hit man Deadshot and Margot Robbie as the Joker's manic girlfriend Harley Quinn, alongside several other less familiar DC baddies, these guys are supposed to be complex people who have long fallen from grace and are supposed to rise above to find their humanity and gain the audience's sympathy. But writer-director David Ayer tries so hard to take a safe route, you can see the gears trying to manipulate audience emotion, revealing the inherit problems of these comic book adaptations straining to catch up with decades of printed storytelling.

You can't totally blame Ayer, who last gave moviegoers Fury, an incredibly strong and startling war movie featuring a better fleshed out motley crew of characters. The preciousness Hollywood has for its ongoing world building of interconnected comic book films creates such tight restrictions on storytelling that anything that might upset that world has no room to prosper. At one point, toward the end of Suicide Squad, one character asks another, "Shouldn't you be dead?" Of course not, this is the DC universe, and it's gotta be milked. That means no major players should be written off in one movie.

The result of these storytelling restraints is a soulless kind of filmmaking hampered by pussyfooting. It's like a syrupy glaze that drowns out any possibility to shine above what has become a predictable pattern of storytelling. Characters dole out uninspired lines that play superficially to feelings, like, "Dad, I know you do bad things, but I still love you." Then there are the clichés, like "fight fire with fire." Sometimes the script inadvertently deflates the tension by spelling things out. Someone over a radio says, "Use extreme caution," and someone in the action responds, "I don't like this." But in case you miss that, someone else says, "I don't like it either." A kid playing with his action figures can come up with better chatter to establish tension....''')

In [None]:
test_predict('As bad as a cheese topped with naan and sambal')

In [None]:
test_predict('Still a better love story than Twilight')

In [None]:
test_predict('''
Cruise is at peak starriness in Jack Reacher: Never Go Back, burning with charisma, purpose and old-school don't-mess-with-the-hero machismo.''')