# RNN for Sentiment Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

<br>
<br>

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


In [None]:
import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Let us shuffle the class labels.

In [None]:
import numpy as np
## uncomment these lines if you have dowloaded the original file:
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
#df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)

<br>
<br>

## Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [None]:
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Let's give it at try:

In [None]:
tokenizer('This :) is a <a> test! :-)</br>')

## Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [None]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [None]:
next(stream_docs(path='shuffled_movie_data.csv'))

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [None]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

Using the [SGDClassifier]() from scikit-learn, we will can instanciate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent. 

In [None]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='shuffled_movie_data.csv')

#import pyprind
#pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    #pbar.update()

Depending on your machine, it will take about 2-3 minutes to stream the documents and learn the weights for the logistic regression model to classify "new" movie reviews. Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:

In [None]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

I think that the predictive performance, an accuracy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization. 

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [None]:
clf = clf.partial_fit(X_test, y_test)

## RNN

## Preprocesamiento

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


In [2]:
X = df.review
y = df.sentiment

In [3]:
import tensorflow as tf
EMBEDDING_DIMENSION = 50 #dimension de embbedings
MAX_REVIEW_LENGTH = 200 

In [4]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

MAX_NB_WORDS = 5000 # only more frequently used words will be kept    
tokenizer = Tokenizer(num_words=MAX_REVIEW_LENGTH,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
               lower=True,split=" ")

Using TensorFlow backend.


In [5]:
tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
word_index = tokenizer.word_index #vocabulario del dataset
N_WORDS = len(word_index) #numero de palabras unicas en el dataset
print('%s palabras unicas.' %N_WORDS)

124252 palabras unicas.


In [6]:
reviews_vectors = pad_sequences(sequences, maxlen=MAX_REVIEW_LENGTH)

In [7]:
reviews_vectors[111]

array([ 75,   3,   3, 194,  55,  18,  60,   6, 146,   4,   3,  16, 115,
        10,   1, 108, 136,   3,  36,   1,  39,   5,  82,   9,   2,  44,
        62,  34,   8,   9,   1,  35,   8,   9, 159,   6,   5,  87,  18,
        28, 149,  62,  84, 115,  10,  21,  52,  28,   1,  67,   2,   5,
        20,   9,  51, 119,  53,  30,  33,   5,  26,  84,   4,   1,  18,
        14, 198,  44,  62,  35,  94,   1,   3,  16,   1,   2,  33,   1,
         5,   1,   1, 173,   4,   1,  19,  60,   6,  62,   5,  15,   1,
        84,   9,  39,   4,  57,  51,  16,  57,  51,   8,   9,   8,   3,
        51,   2,  79,  16,  51,  15,   1, 164,  35, 184,   5,  25,  53,
        20,  24,  38,   1,   5,  77,   1,   4,   1,   8,   1, 127,   3,
       169,   4,   1, 168,  12,   1,  90,   1,  28,  66,  16,  35,  13,
        24,   5,  16,   5,  25,  87,  24,   1,   2,  24,   1,  19, 100,
       105,   4,   2, 113,   2,  91,   1,   2,   1,  36,   5, 103, 104,
        16,  54, 111,  12,   5,  26, 137,  37, 104,  20, 136,  3

In [8]:
word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

## Division de dataset

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_vectors, y, test_size=0.2)
print('X : ',X.shape)
print('y : ',y.shape)

print('X_train: ',X_train.shape)
print('y_train: ',y_train.shape)
print('X_test: ',X_test.shape)
print('y_test: ',y_test.shape)

X :  (50000,)
y :  (50000,)
X_train:  (40000, 200)
y_train:  (40000,)
X_test:  (10000, 200)
y_test:  (10000,)


In [10]:
# step = 2
# batch_size = 5
# offset = (step * batch_size) % (y_train.shape[0] - batch_size)
# x = X_train[offset:(offset + batch_size),:]
# y = y_train[offset:(offset + batch_size)]
        
# x.shape, x

In [11]:
X_test[1]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0, 125,  86,  12,  11,   6,
         3,  17,   6,  12,   9,  13,  20,  21,  19,   1,  64,   6,   2,
         1,   6,  15,  12,  50,   1,  23,  20,   3,   8, 117,   5,  65,
         8,   3,   1,   6,  82,  32,  31,  65,   8,   8,   3,   4,  60,
         8,   1, 168,  20, 157,   1,   3, 122,  33,  98,  61,   5,   8,
        14,  15,   1, 111,  11,  17,  22,   5,  12,   1,   4,   6,  41,
         4,  24,  50,   1,  12,  24,   6,  21,  61,  12,  18,   1,  53,
        16,   3, 128,   5,  25,   2,   5,  65,   4,  54,  27,   

## Carga de Embeddings pre-Entrenados (Glove)

In [12]:
glove_file = 'glove.twitter.27B.' + str(EMBEDDING_DIMENSION) + 'd.txt'
emb_dict = {}
glove = open(glove_file)
for line in glove:
    values = line.split()
    word = values[0]
    vector = np.array(values[1:], dtype=np.float32)
#     print(vector.shape)
    if vector.shape[0]== EMBEDDING_DIMENSION:
        emb_dict[word] = vector
glove.close()
print('vocabulario glove size: ',len(emb_dict))


vocabulario glove size:  1193513


In [13]:
embeddings = np.array([emb_dict[i] for i in emb_dict.keys()])
for i in range(embeddings.shape[0]):
    embeddings[i] = embeddings[i].reshape(1,50)
embeddings[0].shape
embeddings.shape

(1193513, 50)

## Construccion del grafo

In [14]:
tf.reset_default_graph()
batchSize = 1000
lstmUnits = 64
numClasses = 1
learning_rate = 0.01
num_layers = 2


In [15]:
# graph = tf.Graph()
# with graph.as_default():
labels = tf.placeholder(tf.int32,[batchSize,numClasses])
    #ids
inputs = tf.placeholder(tf.int32,[batchSize,MAX_REVIEW_LENGTH])
data = tf.Variable(tf.zeros([batchSize, MAX_REVIEW_LENGTH, EMBEDDING_DIMENSION]),dtype=tf.float32)
data = tf.nn.embedding_lookup(embeddings,inputs)
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
#     y_pred = rnn_model(data,lstmUnits,numClasses)
    

In [16]:
def lsmt_cell():
    lstm = tf.contrib.rnn.LSTMCell(lstmUnits, reuse=tf.get_variable_scope().reuse)
    return tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)

cell = tf.contrib.rnn.MultiRNNCell([lsmt_cell() for _ in range(num_layers)])
initial_state = cell.zero_state(batchSize, tf.float32)

In [17]:
outputs, final_state = tf.nn.dynamic_rnn(cell, data,
                                             initial_state=initial_state)
    

In [18]:
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(labels, predictions)    
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

In [19]:
def get_batches(x, y, batch_size=1000):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

In [20]:
epochs = 10
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(X_train, y_train, batchSize), 1):
            feed = {inputs: x,
                    labels: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batchSize, tf.float32))
                for x, y in get_batches(X_test, y_test, batchSize):
                    feed = {inputs: x,
                            labels: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1

Epoch: 0/10 Iteration: 5 Train loss: 0.255
Epoch: 0/10 Iteration: 10 Train loss: 0.249
Epoch: 0/10 Iteration: 15 Train loss: 0.251
Epoch: 0/10 Iteration: 20 Train loss: 0.248
Epoch: 0/10 Iteration: 25 Train loss: 0.245
Val acc: 0.552
Epoch: 0/10 Iteration: 30 Train loss: 0.245
Epoch: 0/10 Iteration: 35 Train loss: 0.237
Epoch: 0/10 Iteration: 40 Train loss: 0.250
Epoch: 1/10 Iteration: 45 Train loss: 0.242
Epoch: 1/10 Iteration: 50 Train loss: 0.236
Val acc: 0.620
Epoch: 1/10 Iteration: 55 Train loss: 0.232
Epoch: 1/10 Iteration: 60 Train loss: 0.226
Epoch: 1/10 Iteration: 65 Train loss: 0.218
Epoch: 1/10 Iteration: 70 Train loss: 0.219
Epoch: 1/10 Iteration: 75 Train loss: 0.217
Val acc: 0.663
Epoch: 1/10 Iteration: 80 Train loss: 0.216
Epoch: 2/10 Iteration: 85 Train loss: 0.214
Epoch: 2/10 Iteration: 90 Train loss: 0.205
Epoch: 2/10 Iteration: 95 Train loss: 0.208
Epoch: 2/10 Iteration: 100 Train loss: 0.215
Val acc: 0.688
Epoch: 2/10 Iteration: 105 Train loss: 0.208
Epoch: 2/10 Ite