# Recursive Neural Networks for Sentimental Analysis

Adapted from http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/outofcore_modelpersistence.ipynb

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


### Import libraries and upload all data

We need to import libraries and preprocess texts.

In [1]:
import pandas as pd
import numpy as np
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('data/shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


## 2. Loading Libraries

In [2]:
import re
from keras.models import Sequential
from keras.layers import Activation, Dense, Embedding, SimpleRNN, Bidirectional
from keras import backend as K
from keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from keras.callbacks import TensorBoard

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## 3. Pre-processing data

In [3]:
from keras.preprocessing.text import Tokenizer

In [4]:
num_words = 10000
tokenizer  = Tokenizer(num_words = num_words)
tokenizer.fit_on_texts( df.review )
sequences = tokenizer.texts_to_sequences(df.review)
y  =  np.array((df.sentiment))
y[0:5]


array([1, 0, 0, 1, 0])

In [5]:
from keras.preprocessing.sequence import pad_sequences 

max_review_length = 200

pad = 'pre'

X = pad_sequences(sequences,max_review_length,padding=pad,truncating=pad)

## 4. Splitting data

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)
print(X_train.shape)
print(X_test.shape)
input_shape = X_train.shape

(40000, 200)
(10000, 200)


## 5. Generating model

In [8]:
from keras.layers import LSTM

K.clear_session()

lstm_model = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
lstm_model.add(Embedding(num_words, 
                        32, 
                        input_length=max_review_length))

lstm_model.add(Bidirectional(LSTM(32)))
lstm_model.add(Dense(1))
lstm_model.add(Activation('sigmoid'))
lstm_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 32)           320000    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                16640     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 336,705
Trainable params: 336,705
Non-trainable params: 0
_________________________________________________________________


## 6. Training Network

In [9]:
lstm_model.compile(optimizer="adam", 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

lstm_history = lstm_model.fit(X_train, 
                              y_train,
                              epochs=10,
                              batch_size=128,
                              validation_split=0.3)

Train on 28000 samples, validate on 12000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## 7. Prediction fase

In [10]:
y_pred=[]
print(len(X_test)," Iterations will be done.")
for i in range(len(X_test)):
    result2 = lstm_model.predict(X_test[i].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
    pred=(result2>0.5)*1
    y_pred.append(pred)
    if i%100==0:
        print("Testeo:",i,"-> y_pred:",pred,"-> REAL:",y_test[i])

10000  Iterations will be done.
Testeo: 0 -> y_pred: [1] -> REAL: 1
Testeo: 100 -> y_pred: [1] -> REAL: 1
Testeo: 200 -> y_pred: [0] -> REAL: 0
Testeo: 300 -> y_pred: [1] -> REAL: 1
Testeo: 400 -> y_pred: [1] -> REAL: 1
Testeo: 500 -> y_pred: [1] -> REAL: 1
Testeo: 600 -> y_pred: [0] -> REAL: 1
Testeo: 700 -> y_pred: [0] -> REAL: 0
Testeo: 800 -> y_pred: [0] -> REAL: 0
Testeo: 900 -> y_pred: [1] -> REAL: 1
Testeo: 1000 -> y_pred: [0] -> REAL: 0
Testeo: 1100 -> y_pred: [1] -> REAL: 1
Testeo: 1200 -> y_pred: [0] -> REAL: 0
Testeo: 1300 -> y_pred: [0] -> REAL: 0
Testeo: 1400 -> y_pred: [0] -> REAL: 0
Testeo: 1500 -> y_pred: [1] -> REAL: 1
Testeo: 1600 -> y_pred: [0] -> REAL: 0
Testeo: 1700 -> y_pred: [0] -> REAL: 0
Testeo: 1800 -> y_pred: [1] -> REAL: 1
Testeo: 1900 -> y_pred: [1] -> REAL: 1
Testeo: 2000 -> y_pred: [1] -> REAL: 1
Testeo: 2100 -> y_pred: [0] -> REAL: 1
Testeo: 2200 -> y_pred: [1] -> REAL: 1
Testeo: 2300 -> y_pred: [0] -> REAL: 0
Testeo: 2400 -> y_pred: [1] -> REAL: 1
Teste

## 8. Results

In [11]:
from sklearn.metrics import confusion_matrix
Result=confusion_matrix(y_test, y_pred)
print(Result)

[[4183  808]
 [ 535 4474]]


In [12]:
print("Negative precision: ",Result[0,0]/(Result[0,0]+Result[0,1]))
print("Positive precision: ",Result[1,1]/(Result[1,0]+Result[1,1]))
print("General precision: ",(Result[1,1]+Result[0,0])/(Result[0,0]+Result[0,1]+Result[1,0]+Result[1,1]))

Negative precision:  0.8381085954718493
Positive precision:  0.8931922539429028
General precision:  0.8657
