## Neural Net For Sentiment Analysis

**Objective:**
 Use Neural Net to find out the sentiment of a movie (Cornell Movie Dataset) based on review.

**Background**: 

* With LogisticRegression model, we have got the accuracy of about 84%. [Refer to this python notebook](./MovieSentiment.ipynb)

* Let's see check the sentiment accuracy using Neural Network.


### Using Neural Network

Following are the steps involved to use Neural Network.

  * [Data preparation](#read_data)
      + **Get the Vocabulary set.**
      + **Split the data into train and test. Use only train data to build Vocabulary set.**
  
  * [Train CNN with Embedding Layer](#nnet)
  
  * Evaluate the Model

<a id='read_data'></a>
### Read the Data

In [127]:
import numpy as np
from sklearn.datasets import load_files

In [128]:
reviews = load_files("./txt_sentoken")

In [129]:
type(reviews)

sklearn.utils.Bunch

In [130]:
reviews.keys()

['target_names', 'data', 'target', 'DESCR', 'filenames']

In [131]:
print reviews.target_names

['neg', 'pos']


In [132]:
print reviews.target

[0 1 1 ... 1 0 0]


In [133]:
len(reviews.data)

2000

In [134]:
## Get the data and target values
X, y = reviews.data, reviews.target

#### Split train and test data

In [135]:
from sklearn.model_selection import train_test_split

In [136]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)

#### Build Vocabulary set using Train data

* Clean the train data and collect the often used words.
* Make sure to ignore the stop_words (often used sight words).

In [137]:
import re
from nltk.corpus import stopwords
import nltk

In [138]:
# clean the given input text
def cleanup_text(text):
    clean_tokens = []
    # conver to lower case
    text = text.lower()
    
    #remove special characters
    text = re.sub('\W', ' ', text)
    
    # ignore single character words
    text = re.sub('^\s+[a-z]\s+$', ' ', text)
    
    # ignore more spaces
    text = re.sub('^\s+$', ' ', text)
    
    # ignore the stop words
    for word in nltk.word_tokenize(text):
        if word not in stopwords.words('english'):
            clean_tokens.append(word)
        
    return clean_tokens

In [139]:
print(cleanup_text('1234 hi this movie is nice'))

['1234', 'hi', 'movie', 'nice']


In [140]:
print(cleanup_text('hello 1234 !@#@ 10:30 very boring !!!!. Not a good idea !@!#$ '))

['hello', '1234', '10', '30', 'boring', 'good', 'idea']


#### Identify Often used words

In [141]:
from collections import Counter

In [142]:
Vocab = Counter()

In [143]:
for cnt in range(len(X_train)):
    tokens = cleanup_text(X_train[cnt])
    Vocab.update(tokens)

In [144]:
len(Vocab)

36164

In [145]:
print(Vocab.most_common(25))

[('film', 7666), ('movie', 4673), ('one', 4669), ('like', 2942), ('even', 2056), ('good', 1962), ('time', 1896), ('story', 1747), ('would', 1701), ('much', 1625), ('character', 1622), ('also', 1588), ('get', 1580), ('well', 1548), ('two', 1512), ('characters', 1485), ('first', 1475), ('see', 1411), ('way', 1332), ('make', 1321), ('really', 1255), ('life', 1231), ('plot', 1228), ('little', 1214), ('films', 1209)]


In [146]:
# consider tokens only where words occur atleast twice
min_occurrence = 2
Tokens = [k for k,c in Vocab.items() if c >= min_occurrence]

In [147]:
print(len(Tokens))

23075


#### Save Vocab list to file

In [148]:
def save_to_file(lines, file_name):
    # get the data in chunk
    data = '\n'.join(lines)
    
    f = open(file_name, 'w')
    f.write(data)
    f.close()

In [149]:
# save vocab tokens to vocab.txt
save_to_file(Tokens, 'Vocab.txt')

<a id='nnet'></a>
## Build Embedding Layer

* [Load the Vocabulary.](#load_vocab)
* [Encode the training data to numbers (using tokenizer)](#encode_train)
* Use Keras Embedding Layer
* Build CNN model
* Save the model.

<a id='load_vocab'></a>
#### Load Vocabulary

In [150]:
#load the vocabulary
def read_vocabulary_file(file_name):
    file = open(file_name, 'r')
    text = file.read()
    return text
 


In [151]:
# load vocabulary file
Vocab = read_vocabulary_file('./Vocab.txt')

In [152]:
Vocab = set(Vocab.split())

In [153]:
print(len(Vocab))

23075


#### Clean the train data

For each training review, clean the review and make sure to drop the words if they are NOT in our vocabulary.

In [154]:
#clean the review
# return the cleaned words as one string
def cleanup_review(text):
    # conver to lower case
    text = text.lower()
    
    #remove special characters
    text = re.sub('\W', ' ', text)
    
    # ignore single character words
    text = re.sub('^\s+[a-z]\s+$', ' ', text)
    
    # split into tokens
    tokens = text.split()
    
    #filter out the tokens
    clean_tokens = [word for word in tokens if word in Vocab]
    clean_tokens = ' '.join(clean_tokens)
    return clean_tokens

In [155]:
print( cleanup_review("that was a thrilling movie"))

thrilling movie


In [156]:
X_train_clean = list()

In [157]:
for cnt in range(len(X_train)):
    clean_text = cleanup_review(X_train[cnt])
    X_train_clean.append(clean_text)

In [158]:
print(len(X_train_clean))

1600


<a id='encode_train'></a>
#### Encode the Training data

In [159]:
from keras.preprocessing.text import Tokenizer

In [160]:
tokenizer = Tokenizer()

In [161]:
tokenizer.fit_on_texts(X_train_clean)

In [162]:
tokenizer

<keras_preprocessing.text.Tokenizer at 0x126e39490>

In [163]:
tokenizer.document_count

1600

In [164]:
tokenizer.texts_to_matrix

<bound method Tokenizer.texts_to_matrix of <keras_preprocessing.text.Tokenizer object at 0x126e39490>>

In [165]:
tokenizer.word_counts

OrderedDict([('crazy', 79),
             ('beautiful', 226),
             ('suffers', 43),
             ('damned', 26),
             ('syndrome', 19),
             ('spate', 2),
             ('flighty', 2),
             ('cookie', 29),
             ('cutter', 10),
             ('teen', 123),
             ('films', 1209),
             ('romantic', 211),
             ('drama', 242),
             ('addresses', 12),
             ('alcoholism', 5),
             ('parental', 3),
             ('loss', 48),
             ('along', 423),
             ('love', 896),
             ('story', 1747),
             ('rather', 502),
             ('applaud', 10),
             ('production', 253),
             ('early', 248),
             ('reviews', 73),
             ('dismissed', 11),
             ('overblown', 14),
             ('afterschool', 2),
             ('special', 483),
             ('even', 2056),
             ('worse', 197),
             ('wake', 24),
             ('federal', 27),
            

In [166]:
tokenizer.word_index

{'woods': 1247,
 'spiders': 13580,
 'hanging': 2222,
 'woody': 724,
 'comically': 8245,
 'hennings': 23004,
 'originality': 2449,
 'rickman': 6135,
 'rawhide': 22849,
 'bringing': 1665,
 'liaisons': 9263,
 'sommerset': 14227,
 'wooden': 2778,
 'wednesday': 10600,
 'circuitry': 22661,
 'crotch': 9124,
 'elgar': 16651,
 'stereotypical': 2452,
 'miniatures': 20642,
 'gorman': 20499,
 'francesca': 17689,
 'scraped': 21468,
 'inanimate': 18132,
 'errors': 11691,
 'cooking': 5450,
 'joely': 13574,
 'designing': 9684,
 'succumb': 13495,
 'shocks': 10154,
 'chins': 20693,
 'china': 1933,
 'shandling': 4475,
 'confronts': 4657,
 'wiseguy': 8542,
 'natured': 4778,
 'kids': 300,
 'uplifting': 4037,
 'controversy': 6890,
 'spotty': 20828,
 'golden': 2627,
 'projection': 18739,
 'lengthen': 20093,
 'intent': 4055,
 'unsinkable': 12066,
 'stern': 4213,
 'dna': 5814,
 'catchy': 4048,
 'insecurity': 11459,
 'cannibal': 12511,
 'sidebars': 19315,
 'music': 174,
 'therefore': 1579,
 'mystic': 15630,
 'y

In [167]:
#X_train_encoded_mat = tokenizer.texts_to_matrix(X_train_clean, mode='count')
X_train_encoded_mat = tokenizer.texts_to_sequences(X_train_clean)

In [168]:
X_train_encoded_mat[0:3]

[[1274,
  365,
  2332,
  3612,
  3612,
  4645,
  17884,
  17885,
  3306,
  7419,
  799,
  25,
  400,
  335,
  6640,
  11360,
  14835,
  2093,
  161,
  41,
  8,
  125,
  7420,
  318,
  324,
  1405,
  6974,
  5932,
  17886,
  130,
  5,
  441,
  3858,
  3501,
  2476,
  17887,
  14836,
  1681,
  3402,
  837,
  653,
  38,
  95,
  26,
  694,
  180,
  216,
  17888,
  17889,
  542,
  37,
  91,
  14837,
  1117,
  35,
  566,
  1181,
  850,
  70,
  3307,
  83,
  1708,
  3726,
  143,
  20,
  10273,
  567,
  524,
  11361,
  99,
  2333,
  143,
  8583,
  26,
  115,
  3613,
  1,
  277,
  950,
  4006,
  222,
  401,
  1564,
  96,
  1488,
  8584,
  2423,
  1274,
  365,
  612,
  1621,
  6,
  63,
  162,
  822,
  305,
  2,
  1232,
  458,
  1159,
  1955,
  2664,
  4007,
  1622,
  11362,
  2094,
  12828,
  416,
  126,
  5933,
  8,
  695,
  12829,
  271,
  15,
  2531,
  57,
  2053,
  3129,
  10274,
  17890,
  1248,
  14838,
  108,
  15,
  274,
  1584,
  1054,
  360,
  5107,
  5933,
  123,
  215,
  2733,
  1132

In [169]:
print(len(X_train_encoded_mat))

1600


In [170]:
# all words in vocab + 1 for unknown word
vocab_size = len(tokenizer.word_index) +1

In [171]:
vocab_size

23028

In [172]:
## Find the maximum number of words in a review
max_review_length = max([len(s.split()) for s in X_train_clean ])

In [173]:
max_review_length

1192

In [174]:
from keras.preprocessing.sequence import pad_sequences

In [175]:
# already the train data is encoded.

# keras prefers each input should be of same length.
# pad each input to max_review_length
X_train_encoded_padded = pad_sequences(X_train_encoded_mat, maxlen=max_review_length, padding='post')

In [176]:
X_train_encoded_padded[0]

array([1274,  365, 2332, ...,    0,    0,    0], dtype=int32)

In [177]:
print(X_train_encoded_padded)

[[ 1274   365  2332 ...     0     0     0]
 [ 3308   765   337 ...     0     0     0]
 [   20 14843  1739 ...     0     0     0]
 ...
 [    1   323   286 ...     0     0     0]
 [  655     3   399 ...     0     0     0]
 [  393   192     4 ...     0     0     0]]


In [178]:
print(X_train_encoded_mat)

[[1274, 365, 2332, 3612, 3612, 4645, 17884, 17885, 3306, 7419, 799, 25, 400, 335, 6640, 11360, 14835, 2093, 161, 41, 8, 125, 7420, 318, 324, 1405, 6974, 5932, 17886, 130, 5, 441, 3858, 3501, 2476, 17887, 14836, 1681, 3402, 837, 653, 38, 95, 26, 694, 180, 216, 17888, 17889, 542, 37, 91, 14837, 1117, 35, 566, 1181, 850, 70, 3307, 83, 1708, 3726, 143, 20, 10273, 567, 524, 11361, 99, 2333, 143, 8583, 26, 115, 3613, 1, 277, 950, 4006, 222, 401, 1564, 96, 1488, 8584, 2423, 1274, 365, 612, 1621, 6, 63, 162, 822, 305, 2, 1232, 458, 1159, 1955, 2664, 4007, 1622, 11362, 2094, 12828, 416, 126, 5933, 8, 695, 12829, 271, 15, 2531, 57, 2053, 3129, 10274, 17890, 1248, 14838, 108, 15, 274, 1584, 1054, 360, 5107, 5933, 123, 215, 2733, 1132, 92, 9348, 10274, 1834, 7421, 4150, 277, 127, 6269, 1182, 14, 510, 3859, 10275, 7951, 311, 586, 17891, 770, 12830, 3859, 2232, 511, 455, 4646, 336, 2334, 10274, 625, 63, 2183, 2054, 224, 459, 210, 1012, 1709, 1430, 172, 2233, 60, 5644, 3227, 381, 9349, 7952, 356, 80,

#### Define the Neural Net model

In [179]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

from keras.utils.vis_utils import plot_model

In [180]:
# define the model
def define_model():
    model = Sequential()
    # vocab size
    # dimensions = 100
    # input length
    model.add(Embedding(vocab_size, 100, input_length=max_review_length))
    # add the convolution and max pooling layers
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu' ))
    model.add(Dense(1, activation='sigmoid' ))
    # compile network
    model.compile(loss= 'binary_crossentropy' , optimizer= 'adam' , metrics=[ 'accuracy' ])
    # summarize defined model
    model.summary()
    plot_model(model, to_file= 'model.png' , show_shapes=True)
    return model   

In [181]:
model = define_model()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 1192, 100)         2302800   
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 1185, 32)          25632     
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 592, 32)           0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 18944)             0         
_________________________________________________________________
dense_7 (Dense)              (None, 10)                189450    
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 11        
Total params: 2,517,893
Trainable params: 2,517,893
Non-trainable params: 0
_________________________________________________________________


<img src="./model.png"></img>

In [182]:
# fit network
model.fit(X_train_encoded_padded, y_train, epochs=10, verbose=2)


Epoch 1/10
 - 26s - loss: 0.6864 - acc: 0.5212
Epoch 2/10
 - 25s - loss: 0.5488 - acc: 0.7756
Epoch 3/10
 - 24s - loss: 0.1197 - acc: 0.9881
Epoch 4/10
 - 24s - loss: 0.0071 - acc: 0.9994
Epoch 5/10
 - 24s - loss: 0.0024 - acc: 1.0000
Epoch 6/10
 - 24s - loss: 0.0014 - acc: 1.0000
Epoch 7/10
 - 24s - loss: 9.4401e-04 - acc: 1.0000
Epoch 8/10
 - 25s - loss: 6.8990e-04 - acc: 1.0000
Epoch 9/10
 - 24s - loss: 5.0874e-04 - acc: 1.0000
Epoch 10/10
 - 24s - loss: 3.9727e-04 - acc: 1.0000


<keras.callbacks.History at 0x127f9a5d0>

In [183]:
# save the model
model.save('model.h5' )

<a id='metric'></a>
### Evaluate Model

### Clean the Test data

In [184]:
## Translate the test data
X_test_clean = list()

In [185]:
# clean the test data
for cnt in range(len(X_test)):
    clean_text = cleanup_review(X_test[cnt])
    X_test_clean.append(clean_text)

In [186]:
print(len(X_test_clean))

400


#### Encode the test data

In [187]:
# use tokenizer to encode the test data
X_test_encoded_mat = tokenizer.texts_to_sequences(X_test_clean)

In [188]:
#X_test_encoded_mat

In [189]:
X_test_encoded_padded = pad_sequences(X_test_encoded_mat, maxlen=max_review_length, padding='post')

In [190]:
# evaluate the test data

_, acc = model.evaluate(X_test_encoded_padded, y_test, verbose=0)

In [191]:
print( 'Test Accuracy: %f'  % (acc*100))

Test Accuracy: 85.250000
