[View in Colaboratory](https://colab.research.google.com/github/aniquetahir/Colaboratory/blob/master/SentimentAnalysis.ipynb)

# Sentiment Analysis using Keras

## Part 1: RNN Analysis


## Precursors
- Install Keras
- Install Bokeh for visualizations
- Download the training dataset
- Import Keras Model, Layers

In [1]:
!pip install keras
!pip install bokeh

Collecting bokeh
[?25l  Downloading https://files.pythonhosted.org/packages/07/1b/1bb751797f0bbbafc2642c629656ce158e7e7b7fb1110f449f7c320fb819/bokeh-0.13.0.tar.gz (16.0MB)
[K    100% |████████████████████████████████| 16.0MB 2.6MB/s 
Collecting packaging>=16.8 (from bokeh)
  Downloading https://files.pythonhosted.org/packages/ad/c2/b500ea05d5f9f361a562f089fc91f77ed3b4783e13a08a3daf82069b1224/packaging-17.1-py2.py3-none-any.whl
Building wheels for collected packages: bokeh
  Running setup.py bdist_wheel for bokeh ... [?25l- \ | / - \ | / - \ | / - done
[?25h  Stored in directory: /root/.cache/pip/wheels/05/3e/43/95ff0bde940a0a5d86ec13c22d2a4bddc97271cd788f441a63
Successfully built bokeh
Installing collected packages: packaging, bokeh
Successfully installed bokeh-0.13.0 packaging-17.1


### Test Bokeh visualizations


In [0]:
from bokeh.plotting import figure, show
from bokeh.io import output_notebook, push_notebook

In [66]:
output_notebook()

### Sentiment Analysis Data
We will use the IMDB dataset for this experiment

In [4]:
import numpy as np
import pandas as pd
import keras

Using TensorFlow backend.


In [0]:
from keras.datasets import imdb

In [0]:
#@title ### Dataset Parameters
#@markdown ---
#@markdown #### Number of words to use in classification
num_classification_words = 20000 #@param {type: "number"}
#@markdown #### The word limit for each entry
words_limit = 100 #@param {type: "number"}



In [8]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_classification_words)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


#### Data Preprocessing
Our RNN will take sequences of constant length. In our case this length is the `words_limit`


In [0]:
from keras.preprocessing import sequence

In [10]:
x_train_seq = sequence.pad_sequences(x_train, maxlen=words_limit)
x_test_seq = sequence.pad_sequences(x_test, maxlen=words_limit)

print('train shape:', x_train_seq.shape)
print('test shape:', x_test_seq.shape)

train shape: (25000, 100)
test shape: (25000, 100)


In [0]:
from keras.models import Sequential, Model
from keras.layers import Embedding, SimpleRNN, Dense, Dropout, Activation, Input, LSTM, GRU

## RNN Sentiment Analysis
Here we create an RNN to analyze sentiments


In [0]:
rnn_input = Input(shape=(100,))
embedding = Embedding(num_classification_words, 128, input_length=words_limit)(rnn_input)
simple_rnn = SimpleRNN(128)(embedding)
dropout = Dropout(0.4)(simple_rnn)
#dense_middle = Dense(128)(dropout)
#dropout2 = Dropout(0.4)(dense_middle)
dense = Dense(1)(dropout)
activation = Activation('sigmoid')(dense)
model = Model(rnn_input, activation)


In [37]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_6 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
simple_rnn_6 (SimpleRNN)     (None, 128)               32896     
_________________________________________________________________
dropout_8 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 129       
_________________________________________________________________
activation_6 (Activation)    (None, 1)                 0         
Total params: 2,593,025
Trainable params: 2,593,025
Non-trainable params: 0
_________________________________________________________________


In [0]:
# from keras.metrics import categorical_accuracy
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [39]:
model.fit(x_train_seq, y_train, batch_size=32, epochs=3, validation_data=(x_test_seq, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fb148323470>

## LSTM Sentiment Analysis

In [42]:
lstm_input = Input(shape=(100,))
embedding = Embedding(num_classification_words, 128, input_length=words_limit)(lstm_input)
simple_lstm = LSTM(128)(embedding)
dropout = Dropout(0.4)(simple_lstm)
#dense_middle = Dense(128)(dropout)
#dropout2 = Dropout(0.4)(dense_middle)
dense = Dense(1)(dropout)
activation = Activation('sigmoid')(dense)
model = Model(lstm_input, activation)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_8 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_8 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_10 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 129       
_________________________________________________________________
activation_8 (Activation)    (None, 1)                 0         
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


In [43]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train_seq, y_train, batch_size=32, epochs=3, validation_data=(x_test_seq, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fb1472077f0>

In [45]:
score, acc = model.evaluate(x_test_seq, y_test, batch_size=32)
print("Score: ",score)
print("Accuracy: ",acc)

Score:  0.44481423910140994
Accuracy:  0.83812


## GRU Sentiment Analysis


In [48]:
gru_input = Input(shape=(100,))
embedding = Embedding(num_classification_words, 128, input_length=words_limit)(gru_input)
gru = GRU(128)(embedding)
dropout = Dropout(0.4)(gru)
#dense_middle = Dense(128)(dropout)
#dropout2 = Dropout(0.4)(dense_middle)
dense = Dense(1)(dropout)
activation = Activation('sigmoid')(dense)
model = Model(gru_input, activation)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_9 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_9 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
gru_1 (GRU)                  (None, 128)               98688     
_________________________________________________________________
dropout_11 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 129       
_________________________________________________________________
activation_9 (Activation)    (None, 1)                 0         
Total params: 2,658,817
Trainable params: 2,658,817
Non-trainable params: 0
_________________________________________________________________


In [49]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(x_train_seq, y_train, batch_size=32, epochs=3, validation_data=(x_test_seq, y_test))

Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [51]:
score, acc = model.evaluate(x_test_seq, y_test, batch_size=32)
print("Score: ",score)
print("Accuracy: ",acc)

Score:  0.45425059258461
Accuracy:  0.8454


In [52]:
history.history.keys()

dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

In [72]:
p = figure(title="Loss History", x_axis_label='Time', y_axis_label='Loss')
range(len(history.history['val_loss']))
p.line(range(len(history.history['val_loss'])), history.history['val_loss'], 
       legend="Val. Loss", line_width=2, line_color='orange')
p.line(range(len(history.history['loss'])), history.history['loss'], 
       legend="Loss", line_width=2, line_color='blue')

In [73]:
output_notebook()
show(p)

In [76]:
p = figure(title="Accuracy History", x_axis_label='Time', y_axis_label='Loss')
p.line(range(len(history.history['val_acc'])), history.history['val_acc'], 
       legend="Val. Acc", line_width=2, line_color='orange')
p.line(range(len(history.history['acc'])), history.history['acc'], 
       legend="Acc", line_width=2, line_color='blue')
output_notebook()
show(p)

## Conclusion
RNN and its kind are a great way to do sentiment analysis with minimum amount of workflow. This notebook uses preprocessed imdb data so in a real life use case, preprocessing steps need to be taken into account.