# <center> IMDB Review Sentiment Analysis <br> Using Tensorflow-Keras Neural Network</center>


#### <center>Chris Davis <br> October, 2021</center>

## Table of Contents:

- ***[Importing Libraries](#Importing-Libraries)***
- ***[Preprocessing Text](#Preprocessing-Text)***
- ***[Adding GloVe embedded layer](#Adding-GloVe-embedded-layer)***
- ***[Model Building](#Model-BUilding)***
- ***[Model Fit and Performance](#Model-Fit-and-Performance)***
- ***[Citations](#Citations)***

### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import regex as re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')

%matplotlib inline

### Preprocessing Text

In [None]:
#Import raw data file
#df_raw = pd.read_csv('Amazon Labeled Comments.csv', sep=',')
df_raw = pd.read_csv('IMDB Reviews.csv', sep=',')

#View raw data
df_raw.head(10)

***- Perform text standarization, stopword removal, and lemmatization:***

In [None]:
# Define regex cleaning function
def text_clean(data):
    # Convert to lower
    data = data.lower()

    # Remove new lines
    data = re.sub('\s+', ' ', data)

    # Convert ?!. to " "
    data = re.sub('[?!]', '.', data)

    # Remove non alpha characters
    data = re.sub('[^a-zA-Z\']', ' ', data)

    # Remove multiple spaces
    data = re.sub(' +', ' ', data)

    return data

# Define remove stopwords function
stop = set(stopwords.words("english"))

def remove_stopwords(text):
    filtered_words = [word for word in text.split() if word not in stop]
    return " ".join(filtered_words)

# Define lemmatization function
Ltzr = WordNetLemmatizer()

def word_lemmatizer(text):
    filtered_words = [Ltzr.lemmatize(word) for word in text.split()]
    return " ".join(filtered_words)

# Create copy of raw data for processing
df_clean = df_raw.copy()

# Clean/standardize text
df_clean.Comment = df_clean.Comment.apply(lambda x: text_clean(x))

# Remove stopwords
df_clean.Comment = df_clean.Comment.apply(lambda x: remove_stopwords(x))

# Lemmatize
df_clean.Comment = df_clean.Comment.apply(lambda x: word_lemmatizer(x))

***- Define test/train split and tokenize:***

In [None]:
# Split to test/train
X_train, X_test, Y_train, Y_test = train_test_split(df_clean.Comment,
                                                    df_clean.Sentiment,
                                                    random_state=42,
                                                    test_size=0.2)
# Tokenize
tokenizer = Tokenizer(oov_token='OOV')
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

num_words = len(tokenizer.word_index) + 1
unique_words = list(tokenizer.word_index.keys()

***- Pad sequences:***

In [None]:
# View comment length histogram to set max pad length
plt.hist([len(i) for i in X_train])
plt.xlabel("Word Count")
plt.ylabel("Occurence Count")
plt.title("Comment Length")

In [None]:
# Pad Sequences
max_length = 300
X_train = pad_sequences(X_train, maxlen=max_length)
X_test = pad_sequences(X_test, maxlen=max_length)

### Adding GloVe embedded layer

***- Creating trained GloVe embedded layer:***

In [None]:
#Create GloVe embedding layer
embeddings_index = {}
with open("Glove Data\glove.6B.100d.txt", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

embedding_matrix = np.zeros((num_words, 100))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector


### Model Building

***- Building Keras sequential model:***

In [None]:
# Create model
model = keras.models.Sequential()
model.add(layers.Embedding(num_words, 100, weights=[embedding_matrix], input_length=max_length, trainable=False))
model.add(layers.SpatialDropout1D(0.4))
model.add(layers.LSTM(64, dropout=0.1))
model.add(layers.Dense(1, activation="sigmoid"))

# Loss function
loss = keras.losses.BinaryCrossentropy(from_logits=False)

#optimizer
optim = keras.optimizers.Adam(learning_rate=0.001)

#Define metrics
metrics = ["accuracy"]

#Define early stopping metrics
early_stopping = EarlyStopping(monitor='val_accuracy', patience =5, mode = 'max')

#Compile model
model.compile(loss='binary_crossentropy', optimizer=optim, metrics=metrics)

#Print model summary
model.summary()


### Model Composition:
    
***Embedded Layer:*** GloVe embedded layer

***Dropout Layer:*** Added to prevent over-fitting

***LSTM Layer:*** Stacked long short-term memory

***Dense Layer:*** Sigmoid activation binary output


### Model Parameters:

***loss function:***
Cross entropy loss function applied with out logistic fitting because the sigmoid function was already applied in the activation function.

***optimizer:***
'Adam' optimizer used (stochastic gradient descent method).

***stopping criteria:***
Stopping criteria is evaluated on the accuracy of the test set predictions as this is the metric I am trying to maximize. If there are 10 epochs of decreasing accuracy, then the model will stop fitting prematurely.

***evaluation metric:***
The model is tuned to maximize prediction accuracy

### Model Fit and Performance

In [None]:
# Fit model to training data
history = model.fit(X_train, Y_train, epochs=50, validation_data=(X_test, Y_test), verbose=2, callbacks=[early_stopping])

In [None]:
# Return prediction loss and accuracy
model.evaluate(X_test, Y_test)

In [None]:
#Create training visualization
val_accuracy = history.history['val_accuracy']
loss = history.history['val_loss']
epochs = range(1, len(val_accuracy) + 1)
plt.plot(epochs, val_accuracy)
plt.plot(epochs, loss)
plt.title('Validation Performance')
plt.xlabel("Epoch")
plt.ylabel('Accuracy/Loss')
plt.legend(['test_accuracy', 'test_loss'])
plt.show()

In [None]:
# Plot train accuracy vs validation accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Train vs Validation accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'])
plt.show()

The tuned model correctly predicts review sentiment in the test set with an accuracy of %. 

### Citations

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation

Oliinyk, Halyna (2017). https://towardsdatascience.com/word-embeddings-exploration-explanation-and-exploitation-with-code-in-python-5dac99d5d795.("Word embeddings: exploration, explanation, and exploitation (with code in Python)")