# Sentiment Analysis
___
#### Description:
Sentiment analysis is the task of analyzing text and classifying it as either a positive or negative sentiment. In this case, the dataset consists of movie reviews where a positive review has a sentiment rating of '1' and a negative review has a sentiment rating of '0'. There are many ways to go about performing sentiment analysis, but in this notebook I will use a many-to-one recurrent neural network because they perform well on sequential data. 
___
#### Dataset:
The original dataset comes from http://ai.stanford.edu/~amaas/data/sentiment/ and contains 25,000 training examples, but I used a reuploaded version from https://www.kaggle.com/c/word2vec-nlp-tutorial/data. 

Note: 
Keras actually provides the dataset already preprocessed under keras.datasets.imdb, but in this notebook I will preprocess the dataset from scratch.
___
#### Reference:
My intuition behind RNN's and sentiment analysis comes from taking Andrew Ng's Deep Learning course. 

These following resources were also used as a helpful guide:

https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/

In [1]:
# Import dependencies
import pandas as pd
import numpy as np
import re
import time

In [2]:
# Read the data
df = pd.read_table('labeledTrainData.tsv')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
# Separate data into reviews and sentiments
reviews = df['review']
sentiments = df['sentiment']

In [4]:
# Display a review to get an idea of how to preprocess
reviews[np.random.randint(len(reviews))]

'This has to be one of the most sincere and touching boy-meets-girl movies ever made. While \\Rebel Without a Cause\\" and \\"Say Anything\\" deliver nice portrayals, this movies strips down useless subplots and Hollywood divergences. This movie focuses purely on watching the budding of a beautiful romance. You never doubt for a second that the film will lead towards the romantic pairing of these two people. You almost immediately sense the synergy and the chemistry between Jesse and Celine, and it is simply pure joy to watch them find it. This movie is mostly all dialogue -based. But, every conversation between these too is greatly intriguing. What makes this pairing so romantic is how real it is. How in all that conversation, while often having no real bearing on anything critical, you can sense the nuances as these two become more fond and trusting of each other. This is exactly they way you would dream that you meet that special someone. And what makes it so true is that it is not 

In [5]:
# Define a function to clean a review
def clean_review(review):
    review = re.sub('<[^<>]+>', ' ', review) # Remove html formatting
    review = review.replace('\x96', ' ') # Remove weird box symbol apparent in some examples
    review = review.lower() # Make letters lowercase
    words = review.split() # Split review into words for further cleaning
    words = [re.sub('[^a-z]', '', word) for word in words] # Remove non-alphabetical characters
    review = ' '.join(words) # Put words back together to form clean review
    
    return review

In [7]:
# Clean every review
clean_reviews = [clean_review(review) for review in reviews]

In [6]:
# Compare an original to a cleaned review
index = np.random.randint(len(reviews))

print('Original:\n', reviews[index])
print('\nCleaned:\n', clean_review(reviews[index]))

Original:
 The film was made in 1942 and with World War 11 around, the movie industry decided to capitalize on the fact that spies were around.<br /><br />The film is fun to watch due to the fabulous dancing of Eleanor Powell. The late Miss Powell was certainly a great hoofer in every sense of the word. She is again paired with a very young looking Red Skelton here. The two of them also starred in \I Dood It.\"<br /><br />Moroni Olsen, who 3 years later, was superb as the interrogating police officer in \"Mildred Pierce\" again appears as an officer asking Powell to deliver an item. Trouble is that Olsen and his rogues are really the Japanese spies.<br /><br />Bert Lahr is his usual brilliant self here and he gets ample support from Virginia O'Brien."

Cleaned:
 the film was made in  and with world war  around the movie industry decided to capitalize on the fact that spies were around the film is fun to watch due to the fabulous dancing of eleanor powell the late miss powell was certai

In [10]:
# Convert each review to a vector of integers
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(clean_reviews)

sequences = tokenizer.texts_to_sequences(clean_reviews) # list of vectors

In [11]:
# Display a review in its new representation
print(sequences[index])

[1, 18, 12, 89, 7, 2, 15, 184, 328, 183, 1, 16, 1538, 833, 5, 19, 1, 186, 11, 64, 183, 1, 18, 6, 243, 5, 102, 663, 5, 1, 2664, 1098, 4, 2538, 1, 523, 701, 2538, 12, 421, 3, 83, 7, 169, 272, 4, 1, 687, 53, 6, 171, 15, 3, 51, 181, 282, 794, 127, 1, 103, 4, 92, 76, 2753, 7, 9, 8, 35, 146, 292, 12, 878, 13, 1, 549, 1848, 7, 3910, 4954, 171, 712, 13, 32, 1848, 2141, 2538, 5, 1567, 32, 1075, 6, 11, 2, 23, 22, 61, 1, 850, 6, 23, 621, 514, 1307, 127, 2, 26, 202, 1358, 34, 4416]


In [12]:
# Choose length of sequence (input)
maxlen = 500
print('Max length:', maxlen) 

# The actual longest review has a sequence length of 1311, but 
# this will make training slow.

Max length: 500


In [13]:
# Vocab size
vocab_size = len(tokenizer.word_index) + 1 # including 0th index
print('Vocab size: ', vocab_size)

Vocab size:  108639


In [14]:
# Pad sequences to max length
from keras.preprocessing.sequence import pad_sequences

sequences = pad_sequences(sequences, maxlen=maxlen, padding='pre', truncating='pre')

In [16]:
# Get the inputs and outputs ready for training
X = sequences
y = np.array(sentiments)

print('X shape:', X.shape)
print('y shape:', y.shape)

X shape: (25000, 500)
y shape: (25000,)


In [17]:
# Build a RNN
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

model = Sequential([
    Embedding(vocab_size, 32, input_length=X.shape[1]),
    LSTM(100, dropout=0.2),
    Dense(1, activation='sigmoid')
])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           3476448   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 3,529,749
Trainable params: 3,529,749
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
# Compile and fit the model to X and y
from keras.callbacks import ModelCheckpoint

# Checkpoint weights after every epoch (optional)
checkpoint = ModelCheckpoint('weights-{epoch:02d}-{val_acc:.2f}.hdf5')

# Compile
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Fit
model.fit(X, y, epochs=15, validation_split=0.2, callbacks=[checkpoint])

Train on 20000 samples, validate on 5000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x176d4289710>

In [19]:
# Save the model and the summary for it
model.save('sentiment_analysis_movies.h5')

with open('sentiment_analysis_movies.txt', 'w+') as f:
    model.summary(print_fn=lambda x: f.write(x + '\n'))

In [20]:
# Apply model to new examples
example = "This movie was complete trash directed by someone who hasn't even read the script. \
Speaking of the script, it was full of holes and felt that it was written within a week. This \
movie was beyond disappointing. Save yourself 2 hours and avoid this movie at all costs."

In [21]:
# Prepare the example
example = clean_review(example)
example = tokenizer.texts_to_sequences([example])
example = pad_sequences(example, maxlen=maxlen, padding='pre', truncating='pre')

In [22]:
# Make a prediction
model.predict(example)
# Consider values > 0.5 to be positive reviews

array([[4.1450485e-06]], dtype=float32)