<h2>Building a Sentiment analysis model for Movie reviews using LSTM</h2>


#### Load data

- Data can be downloaded from [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial/data).
- The dataset contains 25000 movie reviews with their sentiment value (1 -> positive sentiment, 0 -> negative sentiment).
- We will use the word2vec model build in previous exercise and use it for building a model for sentiment analysis.
- Download 'labeledTrainData.tsv.zip' from Kaggle for this exercise.

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
#Load dataset in memory.
df = pd.read_csv('/content/gdrive/MyDrive/labeledTrainData.tsv',  header=0, delimiter="\t", quoting=3)

#Check number of records and columns
print(df.shape)

In [None]:
#Preview some records
df.head()

#### Data Preprocessing

 **Split Data** into Training and Test Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
#We will use 80% of examples for training and 20% for test
X_train, X_test, y_train, y_test = train_test_split(df['review'],
                                                    df['sentiment'],
                                                    test_size=0.2,
                                                    random_state=42)

**Build Tokenizer** to get Number sequences for each review

In [None]:
import tensorflow as tf

In [None]:
#Vocab size - we will limit vocabulary to 10000
top_words = 10000

#Build tokenizer
t = tf.keras.preprocessing.text.Tokenizer(num_words=top_words)
t.fit_on_texts(X_train.tolist())

In [None]:
#Get the word index for each of the word in the review
X_train = t.texts_to_sequences(X_train.tolist())
X_test = t.texts_to_sequences(X_test.tolist())

In [None]:
#Check out first training review
X_train[0]

**Pad sequences** to make each review size equalGet the word index for each of the word in the review

In [None]:
#Check length of 101st example and 201st example
len(X_train[100]), len(X_train[200])

In [None]:
#Each review size
max_review_length = 300

X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,maxlen=max_review_length,padding='post')
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_review_length, padding='post')

In [None]:
#Check length of 101st example and 201st example again
len(X_train[100]), len(X_train[200])

#### Build the Graph

In [None]:
#Start a Sequential Model
model = tf.keras.Sequential()

In [None]:
embedding_vector_length=50

Add **Embedding layer**

Here we are training Word2Vec model as part of sentiment analysis. We are not providing pre-trained weights unlike the last exercise. Also this layer is a 'trainable' layer and will build Word2Vec embeddings for each word in vocabulary.

In [None]:
model.add(tf.keras.layers.Embedding(top_words + 1, #Indexes that we need to deal with
                                    embedding_vector_length, #embedding_size i.e 50 in this case
                                    input_length=max_review_length, #Size of each review i.e 300 in this case
                                ))

In [None]:
#Check output of model size
model.output

Output from Embedding is 3 dimension :

- batch_size x max_review_length (300) x embedding_vector_length (50).

Let's add LSTM as hidden layer

In [None]:
model.add(tf.keras.layers.LSTM(128)) #128 is size of hidden state and cell state
model.add(tf.keras.layers.Dropout(0.25))

Add output layer

In [None]:
#We need one output - probability of positive sentiment
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

Compile the model

In [None]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
#Check model
model.summary()

#### Execute the graph

In [None]:
model.fit(X_train,y_train,
          epochs=10,
          batch_size=128,
          validation_data=(X_test, y_test))

Predicting from train model

In [None]:
#feeding 101st test example
model.predict(X_test[100:101])

Try changing the size of hidden state /Cell state in LSTM to improve the model.