## Neural Net For Sentiment Analysis

**Objective:**
 Use Neural Net to find out the sentiment of a movie (Cornell Movie Dataset) based on review.

**Background**: 

* With LogisticRegression model, we have got the accuracy of about 84%. [Refer to this python notebook](./MovieSentiment.ipynb)

* Let's see check the sentiment accuracy using Neural Network.


### Using Neural Network

Following are the steps involved to use Neural Network.

  * [Data preparation](#read_data)
      + **Get the Vocabulary set.**
      + **Split the data into train and test. Use only train data to build Vocabulary set.**
  
  * [Train CNN with Embedding Layer](#nnet)
  
  * Evaluate the Model

<a id='read_data'></a>
### Read the Data

In [1]:
import numpy as np
from sklearn.datasets import load_files

In [2]:
reviews = load_files("./txt_sentoken")

In [3]:
type(reviews)

sklearn.utils.Bunch

In [4]:
reviews.keys()

['target_names', 'data', 'target', 'DESCR', 'filenames']

In [5]:
print reviews.target_names

['neg', 'pos']


In [6]:
print reviews.target

[0 1 1 ... 1 0 0]


In [7]:
len(reviews.data)

2000

In [8]:
## Get the data and target values
X, y = reviews.data, reviews.target

#### Split train and test data

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)

#### Build Vocabulary set using Train data

* Clean the train data and collect the often used words.
* Make sure to ignore the stop_words (often used sight words).

In [11]:
import re
from nltk.corpus import stopwords
import nltk

In [12]:
# clean the given input text
def cleanup_text(text):
    clean_tokens = []
    # conver to lower case
    text = text.lower()
    
    #remove special characters
    text = re.sub('\W', ' ', text)
    
    # ignore single character words
    text = re.sub('^\s+[a-z]\s+$', ' ', text)
    
    # ignore more spaces
    text = re.sub('^\s+$', ' ', text)
    
    # ignore the stop words
    for word in nltk.word_tokenize(text):
        if word not in stopwords.words('english'):
            clean_tokens.append(word)
        
    return clean_tokens

In [13]:
print(cleanup_text('1234 hi this movie is nice'))

['1234', 'hi', 'movie', 'nice']


In [14]:
print(cleanup_text('hello 1234 !@#@ 10:30 very boring !!!!. Not a good idea !@!#$ '))

['hello', '1234', '10', '30', 'boring', 'good', 'idea']


#### Identify Often used words

In [15]:
from collections import Counter

In [16]:
Vocab = Counter()

In [17]:
for cnt in range(len(X_train)):
    tokens = cleanup_text(X_train[cnt])
    Vocab.update(tokens)

In [18]:
len(Vocab)

36164

In [19]:
print(Vocab.most_common(25))

[('film', 7666), ('movie', 4673), ('one', 4669), ('like', 2942), ('even', 2056), ('good', 1962), ('time', 1896), ('story', 1747), ('would', 1701), ('much', 1625), ('character', 1622), ('also', 1588), ('get', 1580), ('well', 1548), ('two', 1512), ('characters', 1485), ('first', 1475), ('see', 1411), ('way', 1332), ('make', 1321), ('really', 1255), ('life', 1231), ('plot', 1228), ('little', 1214), ('films', 1209)]


In [20]:
# consider tokens only where words occur atleast twice
min_occurrence = 2
Tokens = [k for k,c in Vocab.items() if c >= min_occurrence]

In [21]:
print(len(Tokens))

23075


#### Save Vocab list to file

In [22]:
def save_to_file(lines, file_name):
    # get the data in chunk
    data = '\n'.join(lines)
    
    f = open(file_name, 'w')
    f.write(data)
    f.close()

In [23]:
# save vocab tokens to vocab.txt
save_to_file(Tokens, 'Vocab.txt')

<a id='nnet'></a>
## Build Embedding Layer

* [Load the Vocabulary.](#load_vocab)
* [Encode the training data to numbers (using tokenizer)](#encode_train)
* Use Keras Embedding Layer
* Build CNN model
* Save the model.

<a id='load_vocab'></a>
#### Load Vocabulary

In [24]:
#load the vocabulary
def read_vocabulary_file(file_name):
    file = open(file_name, 'r')
    text = file.read()
    return text
 


In [25]:
# load vocabulary file
Vocab = read_vocabulary_file('./Vocab.txt')

In [26]:
Vocab = set(Vocab.split())

In [27]:
print(len(Vocab))

23075


#### Clean the train data

For each training review, clean the review and make sure to drop the words if they are NOT in our vocabulary.

In [28]:
#clean the review
# return the cleaned words as one string
def cleanup_review(text):
    # conver to lower case
    text = text.lower()
    
    #remove special characters
    text = re.sub('\W', ' ', text)
    
    # ignore single character words
    text = re.sub('^\s+[a-z]\s+$', ' ', text)
    
    # split into tokens
    tokens = text.split()
    
    #filter out the tokens
    clean_tokens = [word for word in tokens if word in Vocab]
    clean_tokens = ' '.join(clean_tokens)
    return clean_tokens

In [29]:
print( cleanup_review("that was a thrilling movie"))

thrilling movie


In [30]:
X_train_clean = list()

In [31]:
for cnt in range(len(X_train)):
    clean_text = cleanup_review(X_train[cnt])
    X_train_clean.append(clean_text)

In [32]:
print(len(X_train_clean))

1600


<a id='encode_train'></a>
#### Encode the Training data

In [34]:
from keras.preprocessing.text import Tokenizer

Using Theano backend.


In [35]:
tokenizer = Tokenizer()

In [36]:
tokenizer.fit_on_texts(X_train_clean)

In [37]:
tokenizer

<keras_preprocessing.text.Tokenizer at 0x1c24adf550>

In [38]:
tokenizer.document_count

1600

In [39]:
tokenizer.texts_to_matrix

<bound method Tokenizer.texts_to_matrix of <keras_preprocessing.text.Tokenizer object at 0x1c24adf550>>

In [40]:
print(len(tokenizer.word_counts))

23027


In [41]:
print(len(tokenizer.word_index))

23027


In [42]:
#X_train_encoded_mat = tokenizer.texts_to_matrix(X_train_clean, mode='count')
X_train_encoded_mat = tokenizer.texts_to_sequences(X_train_clean)

In [43]:
#X_train_encoded_mat[0:3]

In [44]:
print(len(X_train_encoded_mat))

1600


In [45]:
# all words in vocab + 1 for unknown word
vocab_size = len(tokenizer.word_index) +1

In [46]:
vocab_size

23028

In [47]:
## Find the maximum number of words in a review
max_review_length = max([len(s.split()) for s in X_train_clean ])

In [48]:
max_review_length

1192

In [49]:
from keras.preprocessing.sequence import pad_sequences

In [50]:
# already the train data is encoded.

# keras prefers each input should be of same length.
# pad each input to max_review_length
X_train_encoded_padded = pad_sequences(X_train_encoded_mat, maxlen=max_review_length, padding='post')

In [51]:
X_train_encoded_padded[0]

array([1274,  365, 2332, ...,    0,    0,    0], dtype=int32)

In [52]:
#print(X_train_encoded_padded)

In [53]:
#print(X_train_encoded_mat)

#### Define the Neural Net model

In [65]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding

from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

from keras.utils.vis_utils import plot_model

In [68]:
# define the model
def define_model():
    model = Sequential()
    # vocab size
    # dimensions = 100
    # input length
    model.add(Embedding(vocab_size, 100, input_length=max_review_length))
    # add the convolution and max pooling layers
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(10, activation='relu' ))
    model.add(Dense(1, activation='sigmoid' ))
    # compile network
    model.compile(loss= 'binary_crossentropy' , optimizer= 'adam' , metrics=[ 'accuracy' ])
    # summarize defined model
    model.summary()
    #plot_model(model, to_file= 'model.png' , show_shapes=True)
    return model   

In [69]:
model = define_model()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1192, 100)         2302800   
_________________________________________________________________
conv1d_6 (Conv1D)            (None, 1185, 32)          25632     
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, 592, 32)           0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 18944)             0         
_________________________________________________________________
dense_11 (Dense)             (None, 10)                189450    
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 11        
Total params: 2,517,893
Trainable params: 2,517,893
Non-trainable params: 0
_________________________________________________________________


<img src="./model.png"></img>

In [70]:
# fit network
model.fit(X_train_encoded_padded, y_train, epochs=10, verbose=2)


Epoch 1/10
 - 22s - loss: 0.6880 - acc: 0.5369
Epoch 2/10
 - 22s - loss: 0.4450 - acc: 0.8706
Epoch 3/10
 - 22s - loss: 0.0469 - acc: 0.9994
Epoch 4/10
 - 22s - loss: 0.0042 - acc: 1.0000
Epoch 5/10
 - 22s - loss: 0.0021 - acc: 1.0000
Epoch 6/10
 - 23s - loss: 0.0015 - acc: 1.0000
Epoch 7/10
 - 432s - loss: 0.0011 - acc: 1.0000
Epoch 8/10
 - 23s - loss: 9.2943e-04 - acc: 1.0000
Epoch 9/10
 - 23s - loss: 7.8380e-04 - acc: 1.0000
Epoch 10/10
 - 22s - loss: 6.7627e-04 - acc: 1.0000


<keras.callbacks.History at 0x1c287fe090>

In [71]:
# save the model
model.save('model.h5' )

<a id='metric'></a>
### Evaluate Model

### Clean the Test data

In [72]:
## Translate the test data
X_test_clean = list()

In [73]:
# clean the test data
for cnt in range(len(X_test)):
    clean_text = cleanup_review(X_test[cnt])
    X_test_clean.append(clean_text)

In [74]:
print(len(X_test_clean))

400


#### Encode the test data

In [75]:
# use tokenizer to encode the test data
X_test_encoded_mat = tokenizer.texts_to_sequences(X_test_clean)

In [76]:
#X_test_encoded_mat

In [77]:
X_test_encoded_padded = pad_sequences(X_test_encoded_mat, maxlen=max_review_length, padding='post')

In [78]:
# evaluate the test data

_, acc = model.evaluate(X_test_encoded_padded, y_test, verbose=0)

In [79]:
print( 'Test Accuracy: %f'  % (acc*100))

Test Accuracy: 84.000000
