# Predicting the Sentiment of Movie Reviews

There are two goals for this analysis. The first is to accurately predict the sentiment of movie reviews, and the second is to develop my model in such a way that its outputs can be analyzed with TensorBoard. This is the first time that I am using TensorBoard, so I want to have a somewhat challenging task, and not use a huge dataset. There are 25,000 training and testing reviews, so this model can train multiple iterations overnight on my MacBook Pro. The data is provided by a Kaggle competition from 2015 (https://www.kaggle.com/c/word2vec-nlp-tutorial). Despite it having concluded, it can still be used as an excellent learning opportunity. The sections of this analysis are:
- Inspect the Data
- Clean and Format the Data
- Build and Train the Model
- Make the Predictions
- Summary

In [0]:
! pip install keras



In [3]:
import pandas as pd
import numpy as np
import tensorflow as tf
import nltk, re, time
from nltk.corpus import stopwords
from string import punctuation
from collections import defaultdict
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from collections import namedtuple

Using TensorFlow backend.


In [0]:
# Load the data
train_data = pd.read_csv("./drive/My Drive/deep_learning/research/labeledTrainData.tsv", delimiter="\t")
test_data = pd.read_csv("./drive/My Drive/deep_learning/research/testData.tsv", delimiter="\t")

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


# Inspect the Data

In [5]:
train_data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [6]:
test_data.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [7]:
print(train_data.shape)
print(test_data.shape)

(25000, 3)
(25000, 2)


The reviews are rather long, so we won't be using all of the text to train our model. Using all of the text would increase our training to a longer timeframe than I would rather give to this project, but it should make the predictions more accurate.

# Clean and Format the Data

In [0]:
def clean_text(text, remove_stopwords=True):
    '''Clean the text, with the option to remove stopwords'''
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"<br />", " ", text)
    text = re.sub(r"[^a-z.]", " ", text)
    text = re.sub(r"   ", " ", text) # Remove any extra spaces
    text = re.sub(r"  ", " ", text)
    # Remove punctuation from text
    text = ''.join([c for c in text if (c not in punctuation) or (c == '.')])

    return(text)

In [0]:
class My_tokenizer:
	token_dict = {'unk':1}
	current_dict_size = 0
	def _init_(self,text):
		self.fit_dict_from_text(text)

	def fit_dict_from_text(self,text,with_dot = False):
		if with_dot:
			text = text.replace("."," . ")
		for token in text.split():
			if token not in self.token_dict:
				self.token_dict[token] = self.current_dict_size + 1
				self.current_dict_size += 1

	def fit_dict_from_list(self,text_list,with_dot = False):
		for t in text_list:
			self.fit_dict_from_text(t,with_dot)

	def get_dict(self):
		return self.token_dict

	def gen(self,text,with_dot=False):
		res = []
		if with_dot:
			text = text.replace("."," . ")
		for t in text.split():
			if t in self.token_dict:
				res.append(self.token_dict[t])
			else:
				res.append(1)
		return res

	def gen_list(self,text_list,with_dot=False):
		total_res = []
		for text in text_list:
			res = []
			if with_dot:
				text = text.replace("."," . ")
			for t in text.split():
				if t in self.token_dict:
					res.append(self.token_dict[t])
				else:
					res.append(1)
			total_res.append(res)
		return total_res

Clean the training and testing reviews

In [10]:
print('test review = ',test_data)

test review =               id                                             review
0      12311_10  Naturally in a film who's main themes are of m...
1        8348_2  This movie is a disaster within a disaster fil...
2        5828_4  All in all, this is a movie for kids. We saw i...
3        7186_2  Afraid of the Dark left me with the impression...
4       12128_7  A very accurate depiction of small time mob li...
...         ...                                                ...
24995   2155_10  Sony Pictures Classics, I'm looking at you! So...
24996     59_10  I always felt that Ms. Merkerson had never got...
24997    2531_1  I was so disappointed in this movie. I am very...
24998    7772_8  From the opening sequence, filled with black a...
24999  11465_10  This is a great horror film for people who don...

[25000 rows x 2 columns]


In [0]:
my_test = ['I really love this movie. It is awesome',
           'I enjoy this movie so much. What a masterpiece. Everything is excellent, the actors are very talented',
           'This movie sucks. It is too bad. What a waste of time. It is terrible',
           'I will never watch it again. Why the actors are so bad',
           'I think this movie is fine']

In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
train_clean = []
for review in train_data.review:
    train_clean.append(clean_text(review))

In [0]:
test_clean = []
for review in test_data.review:
    test_clean.append(clean_text(review))

In [0]:
my_test_clean = []
for review in my_test:
    my_test_clean.append(clean_text(review))

In [16]:
# Tokenize the reviews
all_reviews = train_clean + test_clean
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_reviews)
print("Fitting is complete.")

train_seq = tokenizer.texts_to_sequences(train_clean)
print("train_seq is complete.")

test_seq = tokenizer.texts_to_sequences(test_clean)
print("test_seq is complete")

Fitting is complete.
train_seq is complete.
test_seq is complete


In [17]:
my_tokenizer = My_tokenizer()
my_tokenizer.fit_dict_from_list(all_reviews, with_dot=True)
print("Fitting is complete.")

Fitting is complete.


In [18]:
my_test_seq = tokenizer.texts_to_sequences(my_test_clean)
print("my_test_seq is complete")

my_test_seq is complete


In [19]:
dot_index = my_tokenizer.get_dict()["."]
print('dot_index = ',dot_index)

dot_index =  18


In [42]:
n_words = len(my_tokenizer.get_dict())
print('n_words = ',n_words)

n_words =  99427


In [20]:
my_test_clean

['really love movie. awesome',
 'enjoy movie much. masterpiece. everything excellent actors talented',
 'movie sucks. bad. waste time. terrible',
 'never watch again. actors bad',
 'think movie fine']

In [21]:
train_clean[:5]

['stuff going moment mj i ve started listening music watching odd documentary there watched wiz watched moonwalker again. maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent. moonwalker part biography part feature film remember going see cinema originally released. subtle messages mj s feeling towards press also obvious message drugs bad m kay. visually impressive course michael jackson unless remotely like mj anyway going hate find boring. may call mj egotist consenting making movie mj fans would say made fans true really nice him. the actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord. wants mj dead bad beyond me. mj overheard plans nah joe pesci s character ranted wanted people know supplying drugs etc dunno maybe hates mj s music. lots cool things like mj turning car robot whole speed demon sequence. also director must patience saint came filming k

In [22]:
# my_test_seq = my_tokenizer.texts_to_sequences(my_test_clean)
# print("my_test_seq is complete")

my_test_seq = my_tokenizer.gen_list(my_test_clean, with_dot=True)
print("my_test_seq is complete")

my_test_pad = pad_sequences(my_test_seq, maxlen = 100)
print("my_test_pad is complete.")

my_x_test = my_test_pad

my_test_seq is complete
my_test_pad is complete.


In [23]:

# print("my_test_seq is complete")

my_test_seq = my_tokenizer.gen_list(my_test_clean, with_dot=True)
print("my_test_seq is complete")

my_test_pad = pad_sequences(my_test_seq, maxlen = 100)
print("my_test_pad is complete.")


my_test_seq is complete
my_test_pad is complete.


In [24]:
# # Find the number of unique tokens
# word_index = tokenizer.word_index
# print("Words in index: %d" % len(word_index))
train_seq = my_tokenizer.gen_list(train_clean, with_dot=True)
print("my_test_seq is complete")

train_pad = pad_sequences(train_seq, maxlen = 100)
print("train_pad is complete.")

test_seq = my_tokenizer.gen_list(test_clean, with_dot=True)
print("test_seq is complete")

test_pad = pad_sequences(test_seq, maxlen = 100)
print("test_pad is complete.")


# print('train_seq = ',train_seq)
# print('test_seq = ',test_seq)

my_test_seq is complete
train_pad is complete.
test_seq is complete
test_pad is complete.


In [25]:
print('train_clean = ',train_clean[0])
print('train_seq[0] = ',train_seq[0])
print('train_pad[0] = ',train_pad[0])

train_clean =  stuff going moment mj i ve started listening music watching odd documentary there watched wiz watched moonwalker again. maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent. moonwalker part biography part feature film remember going see cinema originally released. subtle messages mj s feeling towards press also obvious message drugs bad m kay. visually impressive course michael jackson unless remotely like mj anyway going hate find boring. may call mj egotist consenting making movie mj fans would say made fans true really nice him. the actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord. wants mj dead bad beyond me. mj overheard plans nah joe pesci s character ranted wanted people know supplying drugs etc dunno maybe hates mj s music. lots cool things like mj turning car robot whole speed demon sequence. also director must patience saint c

In [26]:
# Pad and truncate the questions so that they all have the same length.
max_review_length = 200

train_pad = pad_sequences(train_seq, maxlen = max_review_length)
print("train_pad is complete.")

test_pad = pad_sequences(test_seq, maxlen = max_review_length)
print("test_pad is complete.")

train_pad is complete.
test_pad is complete.


In [27]:
my_test_pad = pad_sequences(my_test_seq, maxlen = max_review_length)
print("my_test_pad is complete.")

my_test_pad is complete.


In [28]:
print('train_pad = ',train_pad.shape)

train_pad =  (25000, 200)


In [0]:
# Creating the training and validation sets
x_train, x_valid, y_train, y_valid = train_test_split(train_pad, train_data.sentiment, test_size = 0.15, random_state = 2)
x_test = test_pad

In [30]:
train_pad[0]

array([ 48,  49,  50,  51,  52,  53,  54,  55,  18,  56,  57,  58,  59,
        60,  61,  62,  63,   4,  64,   2,  65,  66,  67,  18,  68,  69,
         4,  70,  71,  72,  73,   4,  74,  75,  76,  77,  74,  78,  26,
        79,  80,  18,  81,  82,  36,  37,  83,  84,  85,  86,  87,  88,
        89,  90,  91,  92,  93,  94,  95,  96,  97,  18,  98,   4,  99,
        53, 100, 101,  18,   4, 102, 103, 104,  91,  92,  45, 105, 106,
       107, 108, 109, 110,  52, 111, 112,  19, 113,   4,  45,   9,  18,
       114,  27, 115,  63,   4, 116, 117, 118, 119, 120, 121,  90,  18,
        49, 122, 123, 124, 125, 126, 127, 128,  53,  90, 129, 130,  65,
       131, 132, 133, 134, 135, 119, 136, 137, 138, 139, 140,  18, 141,
       142,  73, 108,  63,   4, 132, 143, 144, 145, 146, 108,  18, 147,
       148, 149,  18, 150, 151, 152,  51, 153,   4,  45, 154, 155,  73,
       156,  59,  60, 157, 132, 158, 108, 159, 160, 161,  32, 162, 163,
         5,   6, 164, 165,  18,  18,  18,  18, 166, 162, 109, 10

In [0]:
my_x_test = my_test_pad

In [32]:
# Inspect the shape of the data
print(x_train.shape)
print(x_valid.shape)
print(x_test.shape)

(21250, 200)
(3750, 200)
(25000, 200)


In [33]:
print('my_x_test = ',my_x_test,my_x_test.shape)

my_x_test =  [[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0

In [0]:
def gen_ones_sparse(input_arr):
  total_res = []
  for i in range(input_arr.shape[0]):
    word_list = []  
    for j in range(input_arr.shape[1]):
      if input_arr[i][j] == dot_index:
        word_list.append(np.ones(10))
      else:
        word_list.append(np.zeros(10))
    total_res.append(word_list)
  return np.array(total_res)

In [35]:
my_test_sparse = gen_ones_sparse(my_test_pad)
print('my_test_sparse = ',my_test_sparse)
print('my_test_sparse shape = ',my_test_sparse.shape)

my_test_sparse =  [[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [1. 1. 1. ... 1. 1. 1.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [1. 1. 1. ... 1. 1. 1.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [1. 1. 1. ... 1. 1. 1.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]
my_test_sparse shape =  (5, 200, 10)


In [36]:
x_train_sparse = gen_ones_sparse(x_train)
print('x_train_sparse shape = ',x_train_sparse.shape)

x_test_sparse = gen_ones_sparse(x_test)
print('x_test_sparse shape = ',x_test_sparse.shape)

x_valid_sparse = gen_ones_sparse(x_valid)
print('x_valid_sparse shape = ',x_valid_sparse.shape)

x_train_sparse shape =  (21250, 200, 10)
x_test_sparse shape =  (25000, 200, 10)
x_valid_sparse shape =  (3750, 200, 10)


In [37]:
print('x_train = ',x_train[1])
print('x_train_sparse = ',x_train_sparse.shape)

x_train =  [ 5390 32474   132  4177  6924   810  2494   722 22296    75   759 11609
    18 25523  8641    18   412 15792   179  8227    40   686   394  1254
    18    73  5261   759   513   340  2356  1986   802  1313    18  4509
   379   468   132  4291   215   206  1238  9022  5058  1944    18    26
    26   544  1705    73   306  1438  2110  2295   306    39  2110  2475
    18    17   122 55144    73  7462 53438 30951 43962    18  3172  3283
 14955  1370    73    45  8121 64904 20605   543 23544 20605  9898   214
  1155   543 23544    73    45  2381 20605 12073  8121    18    79    18
  1916    53   543    73  4061   219    18   245 20605  1149  2484  2905
 24292    63 39566 39567    29    73   215 14588  1433 24492   769 25806
    18  4087  2095  1611  5422 12133   405   758  7932   101    18 18105
  1284   506  1611  8523    18  9923   842   625  1272 20605  7094  3367
  5808   168 18019  7196    18 10099  4345 18697   363 15664   722  2044
   677 20605  9335   759   393 15885 487

# Build and Train the Model

In [0]:
# The default parameters of the model
# n_words = len(word_index)
embed_size = 300
batch_size = 250
lstm_size = 128
num_layers = 2
dropout = 0.5
learning_rate = 0.001
epochs = 100
multiple_fc = False
fc_units = 256

In [0]:
def get_batches(x, y, batch_size):
    '''Create the batches for the training and validation data'''
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

In [0]:
def get_test_batches(x, batch_size):
    '''Create the batches for the testing data'''
    n_batches = len(x)//batch_size
    x = x[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size]

In [0]:
def lstm_cell(lstm_size, keep_prob):
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    return drop

In [0]:
def build_rnn(n_words, embed_size, batch_size, lstm_size, num_layers, 
              dropout, learning_rate, multiple_fc, fc_units, n_prototypes = 10):
    '''Build the Recurrent Neural Network'''

    tf.reset_default_graph()

    # Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [batch_size, max_review_length], name='inputs')

    with tf.name_scope('sent_ind'):
        sent_ind = tf.placeholder(tf.float32, [batch_size, max_review_length, n_prototypes], name='sent_ind')

    with tf.name_scope('labels'):
        labels = tf.placeholder(tf.int32, [batch_size, 1], name='labels')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    # Create the embeddings
    with tf.name_scope("embeddings"):
        embedding = tf.Variable(tf.random_uniform((n_words, embed_size), -1, 1))
        embed = tf.nn.embedding_lookup(embedding, inputs)

    # Build the RNN layers
    with tf.name_scope("RNN_layers"):
        cell = tf.contrib.rnn.MultiRNNCell([lstm_cell(lstm_size, keep_prob) for _ in range(num_layers)])
    
    # Set the initial state
    with tf.name_scope("RNN_init_state"):
        initial_state = cell.zero_state(batch_size, tf.float32)

    # Run the data through the RNN layers
    with tf.name_scope("RNN_forward"):
        outputs, final_state = tf.nn.dynamic_rnn(cell, embed,
                                                 initial_state=initial_state)

    with tf.name_scope("prototype"):
        pos_prototypes = tf.Variable(tf.random_normal([1, lstm_size], stddev=0.03))
        for i in range(n_prototypes - 5):
            prototype = tf.Variable(tf.random_normal([1, lstm_size], stddev=0.03))
            pos_prototypes = tf.concat([pos_prototypes, prototype], 0)

        neg_prototypes = tf.Variable(tf.random_normal([1, lstm_size], stddev=0.03))
        for i in range(3):
            prototype = tf.Variable(tf.random_normal([1, lstm_size], stddev=0.03))
            neg_prototypes = tf.concat([neg_prototypes, prototype], 0)

        prototypes = tf.concat([pos_prototypes,neg_prototypes],0)

        x1 = tf.expand_dims(outputs, 2)
        x1 = tf.broadcast_to(x1, [outputs.shape[0].value, outputs.shape[1].value, 10, lstm_size])

        # # x2 = tf.broadcast_to(prototypes, [outputs.shape[1].value, 10, lstm_size])
        x2 = tf.broadcast_to(prototypes, [outputs.shape[0].value, outputs.shape[1].value, 10, lstm_size])

        x3 = x1 - x2
        x4 = x3 * x3
        distances = tf.reduce_sum(x4 , axis=3)
        
 
        # create sparse distances
        sparse_dist = distances * sent_ind
        max1 = tf.reduce_max(sparse_dist, axis=1)
        max2 = tf.expand_dims(max1, 1) + 1
        # print('[db] max2 = ',max2)
        # print('[db] sparse_dist = ',sparse_dist)
        max3 = max2 * tf.ones_like(sparse_dist)
        
        max4 = tf.where(sparse_dist > 0, sparse_dist, max2 * tf.ones_like(sparse_dist))
        max_index = tf.argmin(tf.where(distances > 0, distances, max2 * tf.ones_like(distances)), axis=1)
        max_reduce_distances = tf.reduce_min(max4, axis=1)  
        print('[db] max_reduce_distances = ',max_reduce_distances)  
    
    # Create the fully connected layers
    with tf.name_scope("fully_connected"):
        
        # Initialize the weights and biases
        # weights = tf.truncated_normal_initializer(stddev=0.1)
        # weights = tf.get_variable("weights", initializer=tf.constant([1]))
        weights = tf.constant_initializer([1])
        biases = tf.zeros_initializer()
        
        # print('[db] outputs = ',outputs)
        # print('[db] outputs[:,-1] = ',outputs[:,-1])

        dense = tf.contrib.layers.fully_connected(max_reduce_distances,
                                                  num_outputs = fc_units,
                                                  activation_fn = tf.sigmoid,
                                                  weights_initializer = weights,
                                                  biases_initializer = biases)

        # dense = tf.contrib.layers.dropout(dense, keep_prob)
        
        # Depending on the iteration, use a second fully connected layer
        if multiple_fc == True:
            dense = tf.contrib.layers.fully_connected(dense,
                                                      num_outputs = fc_units,
                                                      activation_fn = tf.sigmoid,
                                                      weights_initializer = weights,
                                                      biases_initializer = biases)
            dense = tf.contrib.layers.dropout(dense, keep_prob)
    
    # Make the predictions
    with tf.name_scope('predictions'):
        predictions = tf.contrib.layers.fully_connected(dense, 
                                                        num_outputs = 1, 
                                                        activation_fn=tf.sigmoid,
                                                        weights_initializer = weights,
                                                        biases_initializer = biases)
        tf.summary.histogram('predictions', predictions)
    
    # Calculate the cost
    with tf.name_scope('cost'):
        cost = tf.losses.mean_squared_error(labels, predictions)
        tf.summary.scalar('cost', cost)
    
    # Train the model
    with tf.name_scope('train'):    
        optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

    # Determine the accuracy
    with tf.name_scope("accuracy"):
        correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels)
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        tf.summary.scalar('accuracy', accuracy)
    
    # Merge all of the summaries
    merged = tf.summary.merge_all()    

    # Export the nodes 
    export_nodes = ['inputs', 'sent_ind', 'labels', 'keep_prob', 'initial_state', 'final_state','accuracy',
                    'predictions', 'cost', 'optimizer', 'merged']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph

In [0]:
def train(model, epochs, log_string):
    '''Train the RNN'''

    saver = tf.train.Saver()
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Used to determine when to stop the training early
        valid_loss_summary = []
        
        # Keep track of which batch iteration is being trained
        iteration = 0

        print()
        print("Training Model: {}".format(log_string))

        train_writer = tf.summary.FileWriter('./logs/3/train/{}'.format(log_string), sess.graph)
        valid_writer = tf.summary.FileWriter('./logs/3/valid/{}'.format(log_string))

        for e in range(epochs):
            state = sess.run(model.initial_state)
            
            # Record progress with each epoch
            train_loss = []
            train_acc = []
            val_acc = []
            val_loss = []

            with tqdm(total=len(x_train)) as pbar:
                for _, (x, y) in enumerate(get_batches(x_train, y_train, batch_size), 1):
                    # print('[db] x shape ',x,x.shape)
                    x_sparse = gen_ones_sparse(x)
                    # print('[db] x_sparse shape ',x_sparse,x_sparse.shape)
                    feed = {model.inputs: x,
                            model.sent_ind: x_sparse,
                            model.labels: y[:, None],
                            model.keep_prob: dropout,
                            model.initial_state: state}
                    summary, loss, acc, state, _ = sess.run([model.merged, 
                                                             model.cost, 
                                                             model.accuracy, 
                                                             model.final_state, 
                                                             model.optimizer], 
                                                            feed_dict=feed)                
                    
                    # Record the loss and accuracy of each training batch
                    train_loss.append(loss)
                    train_acc.append(acc)
                    
                    # Record the progress of training
                    train_writer.add_summary(summary, iteration)
                    
                    iteration += 1
                    pbar.update(batch_size)
            
            # Average the training loss and accuracy of each epoch
            avg_train_loss = np.mean(train_loss)
            avg_train_acc = np.mean(train_acc) 

            val_state = sess.run(model.initial_state)
            with tqdm(total=len(x_valid)) as pbar:
                for x, y in get_batches(x_valid, y_valid, batch_size):
                    # print('[db] x shape ',x,x.shape)
                    x_sparse = gen_ones_sparse(x)
                    # print('[db] x_sparse shape ',x_sparse,x_sparse.shape)
                    feed = {model.inputs: x,
                            model.sent_ind: x_sparse,
                            model.labels: y[:, None],
                            model.keep_prob: 1,
                            model.initial_state: val_state}
                    summary, batch_loss, batch_acc, val_state = sess.run([model.merged, 
                                                                          model.cost, 
                                                                          model.accuracy, 
                                                                          model.final_state], 
                                                                         feed_dict=feed)
                    
                    # Record the validation loss and accuracy of each epoch
                    val_loss.append(batch_loss)
                    val_acc.append(batch_acc)
                    pbar.update(batch_size)
            
            # Average the validation loss and accuracy of each epoch
            avg_valid_loss = np.mean(val_loss)    
            avg_valid_acc = np.mean(val_acc)
            valid_loss_summary.append(avg_valid_loss)
            
            # Record the validation data's progress
            valid_writer.add_summary(summary, iteration)

            # Print the progress of each epoch
            print("Epoch: {}/{}".format(e, epochs),
                  "Train Loss: {:.3f}".format(avg_train_loss),
                  "Train Acc: {:.3f}".format(avg_train_acc),
                  "Valid Loss: {:.3f}".format(avg_valid_loss),
                  "Valid Acc: {:.3f}".format(avg_valid_acc))

            # Stop training if the validation loss does not decrease after 3 epochs
            if avg_valid_loss > min(valid_loss_summary):
                print("No Improvement.")
                stop_early += 1
                if stop_early == 3:
                    break   
            
            # Reset stop_early if the validation loss finds a new low
            # Save a checkpoint of the model
            else:
                print("New Record!")
                stop_early = 0
                checkpoint = "./sentiment_{}.ckpt".format(log_string)
                saver.save(sess, checkpoint)

In [55]:
# Train the model with the desired tuning parameters
for lstm_size in [64]:
    for multiple_fc in [False]:
        for fc_units in [128]:
            log_string = 'ru={},fcl={},fcu={}'.format(lstm_size,
                                                      multiple_fc,
                                                      fc_units)
            model = build_rnn(n_words = n_words, 
                              embed_size = embed_size,
                              batch_size = batch_size,
                              lstm_size = lstm_size,
                              num_layers = num_layers,
                              dropout = dropout,
                              learning_rate = learning_rate,
                              multiple_fc = multiple_fc,
                              fc_units = fc_units)            
            train(model, epochs, log_string)
            pass

[db] max_reduce_distances =  Tensor("prototype/Min:0", shape=(250, 10), dtype=float32)
Instructions for updating:
Please use `layer.__call__` method instead.


  0%|          | 0/21250 [00:00<?, ?it/s]


Training Model: ru=64,fcl=False,fcu=128


  5%|▍         | 1000/21250 [00:03<01:18, 256.66it/s]


KeyboardInterrupt: ignored

# Make the Predictions (Need modifications)

In [0]:
print('x_test = ',x_test,x_test.shape)

In [0]:
print('my_x_test = ',my_x_test,my_x_test.shape)

In [0]:
def make_predictions(lstm_size, multiple_fc, fc_units, checkpoint):
    '''Predict the sentiment of the testing data'''
    
    # Record all of the predictions
    all_preds = []

    model = build_rnn(n_words = n_words, 
                      embed_size = embed_size,
                      batch_size = batch_size,
                      lstm_size = lstm_size,
                      num_layers = num_layers,
                      dropout = dropout,
                      learning_rate = learning_rate,
                      multiple_fc = multiple_fc,
                      fc_units = fc_units) 
    
    with tf.Session() as sess:
        saver = tf.train.Saver()
        # Load the model
        saver.restore(sess, checkpoint)
        test_state = sess.run(model.initial_state)
        for _, x in enumerate(get_test_batches(x_test, batch_size), 1):
            print("[mydb] x make predictions shape = ",x.shape)
            feed = {model.inputs: x,
                    model.keep_prob: 1,
                    model.initial_state: test_state}
            predictions = sess.run(model.predictions, feed_dict=feed)
            for pred in predictions:
                all_preds.append(float(pred))
                
    return all_preds

In [0]:
def make_predictions_my_test(lstm_size, multiple_fc, fc_units, checkpoint):
    '''Predict the sentiment of the testing data'''
    
    # Record all of the predictions
    all_preds = []

    model = build_rnn(n_words = n_words, 
                      embed_size = embed_size,
                      batch_size = batch_size,
                      lstm_size = lstm_size,
                      num_layers = num_layers,
                      dropout = dropout,
                      learning_rate = learning_rate,
                      multiple_fc = multiple_fc,
                      fc_units = fc_units) 
    
    with tf.Session() as sess:
        saver = tf.train.Saver()
        # Load the model
        saver.restore(sess, checkpoint)
        test_state = sess.run(model.initial_state)
#         print('[mydb]my_x_test = ',my_x_test)
#         print('get_test_batches(my_x_test, batch_size) = ',get_test_batches(my_x_test, batch_size))
#         for x in my_x_test:
        x = np.zeros(shape=(250,200))
        for i,e in enumerate(my_x_test):
           x[i] = e
#         x = np.array(my_x_test)
#         print('[db] x = ',x,x.shape)
        feed = {model.inputs: x,
                    model.keep_prob: 1,
                    model.initial_state: test_state}
        predictions, _final_states = sess.run([model.predictions,model.final_state], feed_dict=feed)
        for pred in predictions:
                all_preds.append(float(pred))
                
    return all_preds, _final_states

I am going to compare the results of the best three models, based on the validation data. Then average the predictions of these three models, which should produce an even better set of predictions. 

In [0]:
checkpoint1 = "./drive/My Drive/deep_learning/research/sentiment_ru=128,fcl=False,fcu=256.ckpt"
checkpoint2 = "./drive/My Drive/deep_learning/research/sentiment_ru=128,fcl=False,fcu=128.ckpt"
checkpoint3 = "./drive/My Drive/deep_learning/research/sentiment_ru=64,fcl=True,fcu=256.ckpt"

In [0]:
# Make predictions using the best 3 models
predictions1 = make_predictions(128, False, 256, checkpoint1)
predictions2 = make_predictions(128, False, 128, checkpoint2)
predictions3 = make_predictions(64, True, 256, checkpoint3)

In [0]:
print('predictions1 = ',predictions1)

In [0]:
# Make predictions using the best 3 models
# my_predictions1 = make_predictions_my_test(128, False, 256, checkpoint1)
# my_predictions2 = make_predictions_my_test(128, False, 128, checkpoint2)
my_predictions3 = make_predictions_my_test(64, True, 256, checkpoint3)

In [0]:
my_test = ['I really love this movie. It is awesome',
           'I enjoy this movie so much. What a masterpiece',
           'Everything is excellent, the actors are very talented',
           'This movie sucks. It is too bad',
           'What a waste of time. It is terrible',
           'I will never watch it again. Why the actors are so bad',
           'I think this movie is fine']

my_test = ['it is excellent',
           'what a masterpiece',
           'everything is great',
           'it sucks',
           'What a waste of time',
           'it is terrible',
           'it is ok']

my_test_clean = []
for review in my_test:
    my_test_clean.append(clean_text(review))

my_test_seq = tokenizer.texts_to_sequences(my_test_clean)
print("my_test_seq is complete")

my_test_pad = pad_sequences(my_test_seq, maxlen = max_review_length)
print("my_test_pad is complete.")

my_x_test = my_test_pad

(my_predictions3, final_states) = make_predictions_my_test(64, True, 256, checkpoint3)
print('my_predictions3 = ',my_predictions3[:len(my_test)])
dd_preds = []
for p in my_predictions3[:len(my_test)]:
  dd_preds.append([p, 1 - p])

(c,h) = final_states[1]
# print('final_states = ',final_states[1])
print('c = ',c.shape)
print('h = ',h.shape)
# print('dd_preds = ',dd_preds)
dd_preds = np.array(dd_preds)

In [0]:
res = []
for i in range(len(my_test)):
  res.append(h[i])

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)

X_2d = tsne.fit_transform(res)

print('X_2d = ',X_2d)
print('X_2d = ',X_2d[:,0])
print('X_2d = ',X_2d[:,1])
# print('res = ',res)

import matplotlib.pyplot as plt

plt.scatter(X_2d[:,0], X_2d[:,1])


for i, txt in enumerate(my_test):
    plt.annotate(txt, (X_2d[:,0][i], X_2d[:,1][i]))

plt.show()



In [0]:
plt.scatter(dd_preds[:,0], dd_preds[:,1])


for i, txt in enumerate(my_test):
    plt.annotate(txt, (dd_preds[:,0][i], dd_preds[:,1][i]))

plt.show()

In [0]:
"""
==========================
tSNE to visualize digits
==========================

Here we use :class:`sklearn.manifold.TSNE` to visualize the digits
datasets. Indeed, the digits are vectors in a 8*8 = 64 dimensional space.
We want to project them in 2D for visualization. tSNE is often a good
solution, as it groups and separates data points based on their local
relationship.

"""

############################################################
# Load the iris data
from sklearn import datasets
digits = datasets.load_digits()
# Take the first 500 data points: it's hard to see 1500 points
X = digits.data[:500]
y = digits.target[:500]

############################################################
# Fit and transform with a TSNE
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)

############################################################
# Project the data in 2D
X_2d = tsne.fit_transform(X)

############################################################
# Visualize the data
target_ids = range(len(digits.target_names))

from matplotlib import pyplot as plt
plt.figure(figsize=(6, 5))
colors = 'r', 'g', 'b', 'c', 'm', 'y', 'k', 'w', 'orange', 'purple'
for i, c, label in zip(target_ids, colors, digits.target_names):
    plt.scatter(X_2d[y == i, 0], X_2d[y == i, 1], c=c, label=label)
plt.legend()
plt.show()



In [0]:
# Average the best three predictions
predictions_combined = (pd.DataFrame(predictions1) + pd.DataFrame(predictions2) + pd.DataFrame(predictions3))/3

In [0]:
def write_submission(predictions, string):
    '''write the predictions to a csv file'''
    submission = pd.DataFrame(data={"id":test["id"], "sentiment":predictions})
    submission.to_csv("submission_{}.csv".format(string), index=False, quoting=3)

In [0]:
write_submission(predictions1, "ru=128,fcl=False,fcu=256") 
write_submission(predictions2, "ru=128,fcl=False,fcu=128") 
write_submission(predictions3, "ru=64,fcl=True,fcu=256") 
write_submission(predictions_combined.ix[:,0], "combined") 

The results of the predictions are as follows (Kaggle used area under the ROC curve to evaluation submissions):
- Predictions1: 0.919
- Predictions2: 0.914
- Predictions3: 0.916
- Combined Predictions: 0.935

# Summary

I am rather pleased by how this analysis has finished. Now I am much more confident in using TensorBoard to improve the design of a model and I have achieved rather good results. The combined predictions' submission ranks 206 out of 578, top 35.6%. This result could have been improved by using a larger model, using pretrained vectors (such as GloVe), and using an ensemble of more predictions. Although it would be nice to carry out these efforts and improve my results, I feel that would not be the best use of my time. There are more complicated projects that I would like to work on now, rather than training this model multiple times for a competition that has already concluded. 