![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 3</a>

## Final Project: Neural Networks and Recurrent Neural Networks (RNNs) for the IMDB Movie Review Dataset

__Dataset:__ Sentiment (positive or negative) analysis of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

We continue to work on our final project dataset. This time, you will try to see how Neural Networks, Recurrent Neural Networks (RNNs), its variants: GRU and LSTM work in predicting the sentiment of review texts. If you are interested in trying Transformers, here is a good place for that too!

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.
You can follow these steps:
1. Read training-test data (Given)
2. Train a classifier (Implement)
3. Make predictions on your test dataset (Implement)

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__

In [2]:
import pandas as pd

train_df = pd.read_csv('../data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


#### __Test data:__

In [3]:
import pandas as pd

test_df = pd.read_csv('../data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


In [4]:
#Exlore distribution of classes
print(train_df['label'].value_counts())
print(test_df['label'].value_counts())

#Explore NaNs
print(train_df.isnull().values.any())
print(test_df.isnull().values.any())

0    12500
1    12500
Name: label, dtype: int64
0    12500
1    12500
Name: label, dtype: int64
False
False


In [5]:
#Download relevant nltk packages
import nltk, re
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/anshbordia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anshbordia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/anshbordia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
#Stop words removal
#Not all stop words are bad in this scenario. We will retain some helpful words as shown below
from nltk.corpus import stopwords
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

accepted_stopwords = list(set(stopwords.words('english')).symmetric_difference(set(excluding)))

In [7]:
#Setup tokenizer and lemmatizer for preprocessing
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()

In [49]:
"""
Convert to lower case
Remove trailing white space
Remove intermediate extra white space
Remove HTML tags
Lemmatize individual words
"""
def preprocess(text):
    processed_sentences = []
    temp = ""
    for i in range(0, len(text)):
        temp = text[i]
        temp = temp.lower()
        temp = temp.strip()
        temp = re.sub('\s+', ' ', temp)
        temp = re.compile('<.*?>').sub('', temp)
        
        selected_words = []
        for word in word_tokenize(temp):
            if(word not in accepted_stopwords and not word.isnumeric()):
                selected_words.append(lemmatizer.lemmatize(word))
        
        processed_sentences.append(" ".join(selected_words))
        temp = ""
    return processed_sentences

In [50]:
train_X = preprocess(train_df['text'].values)
test_X = preprocess(test_df['text'].values)

In [10]:
train_y = train_df['label'].values
test_y = test_df['label'].values

!pip install sentence-transformers

In [107]:
#BERT sentence transformer for generating sentence embeddings
#Takes too long, so doing for only 1000
from sentence_transformers import SentenceTransformer
bert = SentenceTransformer('bert-base-nli-mean-tokens')

embedded_sentences = bert.encode(train_X[0:1000])
embedded_sentences.shape

(1000, 768)

In [108]:
#Convert to 2D array for input into LSTM
embedded_sentences = np.expand_dims(embedded_sentences, axis = 1)

In [109]:
embedded_sentences.shape

(1000, 1, 768)

In [110]:
#LSTM Training
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

lstm = Sequential()
lstm.add(LSTM(128, input_shape = (1,768)))
lstm.add(Dropout(0.25))
lstm.add(Dense(1, activation='sigmoid'))
lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(lstm.summary())
lstm.fit(embedded_sentences, train_y[0:1000], epochs=5, batch_size=128)

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_10 (LSTM)               (None, 128)               459264    
_________________________________________________________________
dropout_8 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 129       
Total params: 459,393
Trainable params: 459,393
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc063b2b6a0>

In [112]:
#Testing on 500 reviews
from sklearn.metrics import classification_report

test_embeds = bert.encode(test_X[0:500])
test_embeds = np.expand_dims(test_embeds, axis = 1)


In [114]:
preds = lstm.predict_classes(test_embeds)
print(classification_report(test_y[0:500], preds))

              precision    recall  f1-score   support

           0       0.78      0.78      0.78       263
           1       0.76      0.76      0.76       237

    accuracy                           0.77       500
   macro avg       0.77      0.77      0.77       500
weighted avg       0.77      0.77      0.77       500



