# Predict the sentiments in Movie Reviews datset using RNN
- Use Recurrent Neural networks to predict the sentiment of 25000 Movie Reviews. We would like to predict the reviews as positive or negative.


### Setup Environment

In [None]:
from google.colab import drive
#drive.mount('/content/drive')
drive.mount('/gdrive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


### Kaggle dataset
https://www.kaggle.com/rochachan/bag-of-words-meets-bags-of-popcorn


### Download Kaggle dataset to colab

In [None]:
# Install kaggle support library
!pip install kaggle --quiet

In [None]:
# Make dir for kaggle
!mkdir .kaggle

In [None]:
cp '/gdrive/My Drive/AIML/Kaggle_API_Token/kaggle.json' /content/.kaggle/kaggle.json

In [None]:
!ls '/content/.kaggle/'

kaggle.json


In [None]:
!mkdir ~/.kaggle
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v{/content}
!chmod 600 /root/.kaggle/kaggle.json

- path is now set to: {/content}


In [None]:
!ls -la /root/.kaggle/

total 16
drwxr-xr-x 2 root root 4096 Aug  2 11:35 .
drwx------ 1 root root 4096 Aug  2 11:35 ..
-rw------- 1 root root  101 Aug  2 11:35 kaggle.json


In [None]:
!kaggle datasets list

ref                                                         title                                             size  lastUpdated          downloadCount  
----------------------------------------------------------  -----------------------------------------------  -----  -------------------  -------------  
andrewmvd/data-analyst-jobs                                 Data Analyst Jobs                                  2MB  2020-07-14 08:37:57           1714  
vzrenggamani/hanacaraka                                     Aksara Jawa / Hanacaraka                           9MB  2020-07-10 15:09:31             59  
mrgeislinger/bart-ridership                                 BART Ridership                                   325MB  2020-07-09 22:28:07            179  
moezabid/zillow-all-homes-data                              Zillow All Homes Data                              5MB  2020-07-18 11:44:48            732  
mrmorj/restaurant-recommendation-challenge                  Restaurant Recommendat

In [None]:
!kaggle competitions download -c word2vec-nlp-tutorial -p /content

Downloading sampleSubmission.csv to /content
  0% 0.00/276k [00:00<?, ?B/s]
100% 276k/276k [00:00<00:00, 52.4MB/s]
Downloading unlabeledTrainData.tsv.zip to /content
 35% 9.00M/26.0M [00:00<00:00, 34.9MB/s]
100% 26.0M/26.0M [00:00<00:00, 65.5MB/s]
Downloading testData.tsv.zip to /content
 40% 5.00M/12.6M [00:00<00:00, 31.4MB/s]
100% 12.6M/12.6M [00:00<00:00, 61.8MB/s]
Downloading labeledTrainData.tsv.zip to /content
 39% 5.00M/13.0M [00:00<00:00, 43.2MB/s]
100% 13.0M/13.0M [00:00<00:00, 63.4MB/s]


In [None]:
!ls -l /content/

total 53108
-rw-r--r-- 1 root root 13585269 Aug  2 11:36 labeledTrainData.tsv.zip
drwxr-xr-x 1 root root     4096 Jul 30 16:30 sample_data
-rw-r--r-- 1 root root   282796 Aug  2 11:35 sampleSubmission.csv
-rw-r--r-- 1 root root 13258140 Aug  2 11:36 testData.tsv.zip
-rw-r--r-- 1 root root 27243285 Aug  2 11:35 unlabeledTrainData.tsv.zip


### Load required libraries

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Load trainset

In [None]:
df = pd.read_csv('labeledTrainData.tsv.zip', delimiter='\t', header=0, quoting=3)

In [None]:
df.shape

(25000, 3)

In [None]:
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


### Split into train and test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(20000,) (20000,)
(5000,) (5000,)


### Build a Tokenizer to Tokenize the text and create word indices
- coverting word to numbers 
1. count vectorization (creates Vocabulary and also assigns indices to the words)
2. tf-idf vectorization (creates Vocabulary and also assigns indices to the words)
3. wor2vec vectorization (We need to convert the input into one-hot encoding using tokenizer class and the embedding is provided by Neural Network

steps:
1. Prepare vocabulary of dataset(unique words in the dataset)
2. Assisgn index to each of the unique words of Vocabulary
3. Use Keras Tokenier class to to create the Vocabularry and assign the indices
4. Replace all the words in the reviews with their word indices
5. Which means, convert the text in training data(X_train and X_test) by replacing them with the word indices of Vocabulary
6. Use text_to_sequences method of tokenizer class to replace each word with its word index
7. All examples/text-sequences in a batch should be of same size, because of matrix multiplication requirements. First we should decide the length of input for each batch or the number of words for each review. This can be done using measure like medium of all reviews or mean no of words in all reviews
(Here we assume the maximum length of input as 300)
8. Pad the sequences which are of length less than the maximum length.Usually padding is done at the beginning of the input (pre)
9. Use keras pad_sequences method to pad the sequences having length less than 300 and we use pre padding because LSTM rembembers the latest input well thn the input given in past.So, the beginning zros will be forgotten if pre padding is used
10. Next we need to convert the input into one-hot encoding.We will not be doing that, We use Keras to convert the input to one-hot encoding
11. After giving the input text in the form of word indices, we need to do two things. Convert the wordIndices into One-hot encoding  and converting it into word2vec embedding.
12. The embedding layer in LSTM, takes care of both these steps of converting to one-hot encoding and then convering it into a word2vec embedding

In [None]:
import tensorflow as tf

In [None]:
#top_words is the no of words to be considered for creating the Voacbulary .If left empty uses all words in dataset
top_words=10000 #Vocabulary size
t = tf.keras.preprocessing.text.Tokenizer(num_words= top_words) #num_words-> Vocabulary size

In [None]:
#Fit tokenizer with actual training data
t.fit_on_texts(X_train.tolist())

In [None]:
#word indices
#t.word_index

### Prepare Training and Test data

In [None]:
X_train[0:1]

6655    "Obvious attack on Microsoft made by people wh...
Name: review, dtype: object

In [None]:
#takes each word in text and replaces with its word index
X_train = t.texts_to_sequences(X_train.tolist())

In [None]:
#X_train[0]

In [None]:
X_test = t.texts_to_sequences(X_test.tolist())

### How many words should be present in each review?

In [None]:
#Define maximum number of words to consider in each review
max_review_length = 300

### Pad the Sequences

In [None]:
#Pad training and test reviews

X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_review_length, padding='pre')

X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_review_length, padding='pre')

In [None]:
print(X_train.shape, X_test.shape)

(20000, 300) (5000, 300)


### Build the model

In [None]:
#Initialize a sequential model
tf.keras.backend.clear_session()
model = tf.keras.models.Sequential()

In [None]:
#Add Embedding layer
model.add(tf.keras.layers.Embedding(input_dim=top_words+1, 
                                    output_dim=50, 
                                    input_length=max_review_length))

In [None]:
model.output_shape

(None, 300, 50)

In [None]:
#Add LSTM layer
model.add(tf.keras.layers.LSTM(units=256, 
                               dropout=0.20, 
                               recurrent_dropout=0.20))



In [None]:
model.output_shape

(None, 256)

In [None]:
#Add Dense layer as output layer
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [None]:
model.output_shape

(None, 1)

In [None]:
#compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
#model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 50)           500050    
_________________________________________________________________
lstm (LSTM)                  (None, 256)               314368    
_________________________________________________________________
dense (Dense)                (None, 1)                 257       
Total params: 814,675
Trainable params: 814,675
Non-trainable params: 0
_________________________________________________________________


In [None]:
#train te model
model.fit(X_train, y_train, 
          batch_size=32, 
          epochs=20, 
          validation_data=(X_test, y_test))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fad50f15b00>

In [None]:
model.evaluate(X_train, y_train)



[0.004863114561885595, 0.9990000128746033]

In [None]:
model.evaluate(X_test, y_test)



[0.8150085806846619, 0.852400004863739]