# Rotten Tomatoes Movie Reviews

## What to do?
Create a model that analyzes sentiment based on the different movie ratings and comments provided.

## What is this?
The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

* train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
* test.tsv contains just phrases. You must assign a sentiment label to each phrase.

## Import Crucial Packages

In [None]:
# disable any error messages
import warnings
warnings.filterwarnings('ignore')

# fundamental data exploration and manipulation
import random
import pandas as pd
import numpy as np
import tensorflow as tf

# train-test split
from sklearn.model_selection import train_test_split

# lstm model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Conv1D, GlobalMaxPooling1D, Flatten, MaxPooling1D, GRU, SpatialDropout1D, Bidirectional
from keras.optimizers import Adam

seed = 0
max_features = 10000
max_len = 125

random.seed(seed)
tf.random.set_seed(seed)
np.random.seed(seed)

## Exploratory Data Analysis
We check on the shape and description of each variable contained in the `train.tsv` from this [dataset](https://www.kaggle.com/competitions/sentiment-analysis-on-movie-reviews/data).

Commonly we have `.csv` files as our datasets, `.tsv` files are tab-separated values that we can read by adding an argument `sep=''`.

In [None]:
# read .tsv file
train = pd.read_csv("/content/train.tsv", sep="\t")
test = pd.read_csv("/content/test.tsv", sep="\t")
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [None]:
train.shape

(156060, 4)

Now that we found its dimension consisting of 4 columns and 156060 rows shown on the result above, we understand how many to fit into our model and how many convergence is done with the rows found on the dataset.

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [None]:
train.describe()

Unnamed: 0,PhraseId,SentenceId,Sentiment
count,156060.0,156060.0,156060.0
mean,78030.5,4079.732744,2.063578
std,45050.785842,2502.764394,0.893832
min,1.0,1.0,0.0
25%,39015.75,1861.75,2.0
50%,78030.5,4017.0,2.0
75%,117045.25,6244.0,3.0
max,156060.0,8544.0,4.0


In [None]:
train['Phrase'][0]

'A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .'

In [None]:
train["Sentiment"].value_counts()
train['Sentiment']

0         1
1         2
2         2
3         2
4         2
         ..
156055    2
156056    1
156057    3
156058    2
156059    2
Name: Sentiment, Length: 156060, dtype: int64

## Data Preprocessing

After understanding our dataset, we now make them uniform to remove any outliers such as words that may possess letters with capital letters by using `.lower()` for each word.

We then tokenize them to remove any redundancies and keep the unique words that appear in the dataset.

In [None]:
x = train['Phrase'].apply(lambda train: train.lower())
x_test = test['Phrase'].apply(lambda test: test.lower())

tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(x)
x = tokenizer.texts_to_sequences(x)

Since these sequences are not equally sharing the same length, we pad these sequences to 125 leaving those that are empty as 0.

In [None]:
x = pad_sequences(x, maxlen=max_len)
x

array([[   0,    0,    0, ...,    3,    2,   42],
       [   0,    0,    0, ...,   13,    1, 2976],
       [   0,    0,    0, ...,    0,    2,  323],
       ...,
       [   0,    0,    0, ...,    0, 9376, 9377],
       [   0,    0,    0, ...,    0,    0, 9376],
       [   0,    0,    0, ...,    0,    0, 9377]], dtype=int32)

In [None]:
y = to_categorical(train['Sentiment'])
y

array([[0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.]], dtype=float32)

Let's convert the sentiments containing integers into a binary class matrix afterwards to be fed into the network.

In [None]:
x_test = tokenizer.texts_to_sequences(x_test)
x_test = pad_sequences(x_test, maxlen=max_len)
x_test

array([[   0,    0,    0, ...,  614, 1024,  392],
       [   0,    0,    0, ...,  614, 1024,  392],
       [   0,    0,    0, ...,    0,    0,   16],
       ...,
       [   0,    0,    0, ...,    2,  126, 5916],
       [   0,    0,    0, ...,    2,  126, 5916],
       [   0,    0,    0, ...,    0,  373, 2014]], dtype=int32)

## Train-Test Split

After understanding how linear regression works for this, we can now split our variable into train and test sets. These split by 70% for the train set and the remaining 30% is given to the test set.

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.25, random_state=seed)

In [None]:
x_train.shape

(117045, 125)

In [None]:
x_val.shape

(39015, 125)

We can say that `.test` receives 25% of the split while the rest is given to `.train` based from the shape of each .

## Model Building

We'll use LSTM as our model which stands for Long Short Term Memory, this model fits our given dataset since it helps classify sequential data with a network that can learn with long-term dependencies between time steps of data.

![lstm](https://miro.medium.com/v2/resize:fit:720/format:webp/1*ryIzdQtDwrdx_sJHdufrEQ.png)

The [figure](https://towardsdatascience.com/tutorial-on-lstm-a-computational-perspective-f3417442c2cd#:~:text=LSTM%20equations,-The%20figure%20below&text=This%20is%20one%20timestep%20input,the%20LSTM%20for%20this%20timestep.) above explains the equations done using LSTM which divides itself

* Forget Gate
* Input Gate
* Output Gate

Since we're classifying the sentiments of each comments, LSTM would serve to have a better inference with the dataset we have.

Let's create a `Sequential()` model to begin applying layers of LSTM.

In [None]:
model = Sequential()

model.add(Embedding(max_features, 100, mask_zero=True))
model.add(LSTM(128, dropout=0.4, recurrent_dropout=0.4, return_sequences=True))
model.add(LSTM(64, dropout=0.4, recurrent_dropout=0.4, return_sequences=True))
model.add(LSTM(32, dropout=0.5, recurrent_dropout=0.5, return_sequences=False))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
model.summary()



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 100)         1000000   
                                                                 
 lstm (LSTM)                 (None, None, 128)         117248    
                                                                 
 lstm_1 (LSTM)               (None, None, 64)          49408     
                                                                 
 lstm_2 (LSTM)               (None, 32)                12416     
                                                                 
 dense (Dense)               (None, 5)                 165       
                                                                 
Total params: 1,179,237
Trainable params: 1,179,237
Non-trainable params: 0
_________________________________________________________________


## Model Training

Training the model requires a specific epoch and batch size, we can go for 5 for epochs and 2048 for batch size. As our loss, there are several methods yet the famous one used in LSTM is `categorical_crossentropy`.

In [None]:
epochs = 5
batch_size = 2048 

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=epochs, batch_size=batch_size, verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa7885619c0>

## Model Prediction

After training our model, we can start predicting on the `sampleSubmission.csv` and returning the prediction we've got.

In [None]:
submission = pd.read_csv('/content/sampleSubmission.csv')
submission['Sentiment'] = model.predict(x_test, batch_size=batch_size, verbose=1)
submission.to_csv('mrsa_lstm.csv', index=False)



## Conclusion

There we have it, the amount of dataset we had has given a rather accurate prediction on the sentiment of each comment found on `sampleSubmission.csv` proving that this model fits the given dataset.