# RNN for Sentiment Analysis

En el presente trabajo se realiza Sentiment Analisis de revisiones de peliculas con redes neuronales recurrentes (RNN) multicapa, unidades LSMT unidireccionales y bidireccionales. Ademas se utilizan Wordembeddings preentrenados GloVe para la representacion numerica de texto. 

# 1. Preprocesamiento

En esta etapa se realizan las siguientes tareas:
- Carga del dataset.
- Definicion de longitud maxima de una revision.
- Definicion de dimension de wordembeddings.
- Tokenizacion de cada "movie review".
- Definicion del vocabulario del dataset.
- Representacion de cada movie review como vectores de indices en el vocabulario del dataset.

In [1]:
#Carga del dataset
import pandas as pd
import numpy as np
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


In [2]:
X = df.review
y = df.sentiment

In [3]:
##Definicion de dimension de embeddings y longitut maxima de review

import tensorflow as tf
EMBEDDING_DIMENSION = 50 #dimension de embbedings
MAX_REVIEW_LENGTH = 200 

In [4]:
# Tokenizacion de reviews

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

MAX_NB_WORDS = 5000 # only more frequently used words will be kept    
tokenizer = Tokenizer(num_words=MAX_REVIEW_LENGTH,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
               lower=True,split=" ")

Using TensorFlow backend.


In [5]:
# Definicion del vocabulario del dataset

tokenizer.fit_on_texts(X)
sequences = tokenizer.texts_to_sequences(X)
word_index = tokenizer.word_index #vocabulario del dataset
N_WORDS = len(word_index) #numero de palabras unicas en el dataset
print('%s palabras unicas.' %N_WORDS)

124252 palabras unicas.


In [6]:
# Representacion de reviews como vectores de indices del vocabulario

reviews_vectors = pad_sequences(sequences, maxlen=MAX_REVIEW_LENGTH)

In [7]:
reviews_vectors[111]

array([ 75,   3,   3, 194,  55,  18,  60,   6, 146,   4,   3,  16, 115,
        10,   1, 108, 136,   3,  36,   1,  39,   5,  82,   9,   2,  44,
        62,  34,   8,   9,   1,  35,   8,   9, 159,   6,   5,  87,  18,
        28, 149,  62,  84, 115,  10,  21,  52,  28,   1,  67,   2,   5,
        20,   9,  51, 119,  53,  30,  33,   5,  26,  84,   4,   1,  18,
        14, 198,  44,  62,  35,  94,   1,   3,  16,   1,   2,  33,   1,
         5,   1,   1, 173,   4,   1,  19,  60,   6,  62,   5,  15,   1,
        84,   9,  39,   4,  57,  51,  16,  57,  51,   8,   9,   8,   3,
        51,   2,  79,  16,  51,  15,   1, 164,  35, 184,   5,  25,  53,
        20,  24,  38,   1,   5,  77,   1,   4,   1,   8,   1, 127,   3,
       169,   4,   1, 168,  12,   1,  90,   1,  28,  66,  16,  35,  13,
        24,   5,  16,   5,  25,  87,  24,   1,   2,  24,   1,  19, 100,
       105,   4,   2, 113,   2,  91,   1,   2,   1,  36,   5, 103, 104,
        16,  54, 111,  12,   5,  26, 137,  37, 104,  20, 136,  3

## 2. Division de dataset

En esta etapa se procede a dividir el dataset para Validacion Cruzada

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_vectors, y, test_size=0.2)
print('X : ',X.shape)
print('y : ',y.shape)

print('X_train: ',X_train.shape)
print('y_train: ',y_train.shape)
print('X_test: ',X_test.shape)
print('y_test: ',y_test.shape)

X :  (50000,)
y :  (50000,)
X_train:  (40000, 200)
y_train:  (40000,)
X_test:  (10000, 200)
y_test:  (10000,)


In [9]:
# step = 2
# batch_size = 5
# offset = (step * batch_size) % (y_train.shape[0] - batch_size)
# x = X_train[offset:(offset + batch_size),:]
# y = y_train[offset:(offset + batch_size)]
        
# x.shape, x

## 3. Carga de Embeddings pre-Entrenados (Glove)

In [10]:
glove_file = 'glove.twitter.27B.' + str(EMBEDDING_DIMENSION) + 'd.txt'
emb_dict = {}
glove = open(glove_file)
for line in glove:
    values = line.split()
    word = values[0]
    vector = np.array(values[1:], dtype=np.float32)
#     print(vector.shape)
    if vector.shape[0]== EMBEDDING_DIMENSION:
        emb_dict[word] = vector
glove.close()
print('vocabulario glove size: ',len(emb_dict))


vocabulario glove size:  1193513


In [11]:
embeddings = np.array([emb_dict[i] for i in emb_dict.keys()])
for i in range(embeddings.shape[0]):
    embeddings[i] = embeddings[i].reshape(1,50)
embeddings[0].shape
embeddings.shape

(1193513, 50)

In [12]:
#Funcion que devuelve dataset por minilotes
def get_batches(x, y, batch_size=1000):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

## 4. Definicion del Modelo: Construccion del grafo

En esta etapa se procede a construir el modelo RNN utilizando Tensorflow

In [13]:
tf.reset_default_graph()

batchSize = 1000
lstmUnits = 64
numClasses = 1
learning_rate = 0.01
num_layers = 2


In [14]:
# graph = tf.Graph()
# with graph.as_default():
labels = tf.placeholder(tf.int32,[batchSize,numClasses])
    #ids
inputs = tf.placeholder(tf.int32,[batchSize,MAX_REVIEW_LENGTH])
data = tf.Variable(tf.zeros([batchSize, MAX_REVIEW_LENGTH, EMBEDDING_DIMENSION]),dtype=tf.float32)
data = tf.nn.embedding_lookup(embeddings,inputs)
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
#     y_pred = rnn_model(data,lstmUnits,numClasses)
    

En esta parte se definen las celdas LSMT y se le envuelve en una capa Dropout para regularization.

In [15]:
def lsmt_cell():
    lstm = tf.contrib.rnn.LSTMCell(lstmUnits, reuse=tf.get_variable_scope().reuse)
    return tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)

cell = tf.contrib.rnn.MultiRNNCell([lsmt_cell() for _ in range(num_layers)])
initial_state = cell.zero_state(batchSize, tf.float32)

## Bidirectional LSMT

In [16]:
# Definicion de Celdas LSTM Bidireccionales y capas

 
# cell_fw = tf.nn.rnn_cell.LSTMCell(lstmUnits, reuse=tf.get_variable_scope().reuse)
multi_cell_fw = tf.contrib.rnn.MultiRNNCell([lsmt_cell() for _ in range(num_layers)])
# cell_bw = tf.nn.rnn_cell.LSTMCell(lstmUnits, reuse=tf.get_variable_scope().reuse)
multi_cell_bw = tf.contrib.rnn.MultiRNNCell([lsmt_cell() for _ in range(num_layers)])
initial_state = cell.zero_state(batchSize, tf.float32)

In [17]:
outputs, final_state  = tf.nn.bidirectional_dynamic_rnn(multi_cell_fw, multi_cell_bw, data, initial_state_fw=initial_state,initial_state_bw=initial_state, dtype=tf.float32)

In [18]:
outputs = tf.concat(outputs,axis=2)
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(labels, predictions)    
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

A continuacion se procede a entrenar el modelo construido. El entrenamiento se realiza por minilotes.

In [19]:
epochs = 10
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        
        for ii, (x, y) in enumerate(get_batches(X_train, y_train, batchSize), 1):
            feed = {inputs: x,
                    labels: y[:, None],
                    keep_prob: 0.5}
            loss, state = sess.run([cost, optimizer], feed_dict=feed)
            
            if iteration%8==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%50==0:
                val_acc = []
                for x, y in get_batches(X_test, y_test, batchSize):
                    feed = {inputs: x,
                            labels: y[:, None],
                            keep_prob: 1}
                    batch_acc = sess.run([accuracy], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1

Epoch: 0/10 Iteration: 8 Train loss: 0.252
Epoch: 0/10 Iteration: 16 Train loss: 0.251
Epoch: 0/10 Iteration: 24 Train loss: 0.248
Epoch: 0/10 Iteration: 32 Train loss: 0.234
Epoch: 0/10 Iteration: 40 Train loss: 0.246
Epoch: 1/10 Iteration: 48 Train loss: 0.241
Val acc: 0.596
Epoch: 1/10 Iteration: 56 Train loss: 0.235
Epoch: 1/10 Iteration: 64 Train loss: 0.225
Epoch: 1/10 Iteration: 72 Train loss: 0.224
Epoch: 1/10 Iteration: 80 Train loss: 0.217
Epoch: 2/10 Iteration: 88 Train loss: 0.209
Epoch: 2/10 Iteration: 96 Train loss: 0.215
Val acc: 0.684
Epoch: 2/10 Iteration: 104 Train loss: 0.206
Epoch: 2/10 Iteration: 112 Train loss: 0.194
Epoch: 2/10 Iteration: 120 Train loss: 0.192
Epoch: 3/10 Iteration: 128 Train loss: 0.205
Epoch: 3/10 Iteration: 136 Train loss: 0.210
Epoch: 3/10 Iteration: 144 Train loss: 0.194
Val acc: 0.727
Epoch: 3/10 Iteration: 152 Train loss: 0.175
Epoch: 3/10 Iteration: 160 Train loss: 0.182
Epoch: 4/10 Iteration: 168 Train loss: 0.177
Epoch: 4/10 Iteration: 

El porcentaje de precision que se obtuvo con el modelo implementado es de 76,9%