## D213 Advanced Data Analytics PA 1
##### Submitted By Edwin Perry
### Table of Contents
<ol>
    <li><a href="#A">Research Question</a></li>
    <li><a href="#B">Data Preparation</a></li>
    <li><a href="#C">Network Architecture</a></li>
    <li><a href="#D">Model Evaluation</a></li>
    <li><a href="#E">Summary and Recommendations</a></li>
    <li><a href="#F">Reporting</a></li>
</ol>
<h4 id="A">Research Question</h4>
<h5>Providing Question</h5>
<p>For this project, I want to determine if a neural network trained on customer reviews is adequate at predicting overall customer sentiment</p>
<h5>Objectives/Goal</h5>
<p>The overall goal of this process is to create a neural network capable of accurately predicting the customer rating of transactions and services based on the textual review the customer has left

The use of a neural network to recognize positive or negative user sentiment about a movie based on textual review is a relatively complex natural language processing (NLP) technique. The used dataset will include a binary decision as to whether or not the customer enjoyed the movie, which can be used as a "truth" or "false" value, allowing me to test the model predictions against the actual values.

The neural network created from this process could be used for a number of future uses. Perhaps the most useful that I can think of is fake review detection. Multiple products and services are known to suffer from fake reviews submitted by bots, rather than reviews from people that have actually watched the movie. 
This neural network can identify those reviews that do not align the textual review with the provided rating, helping to root out the most likely fake reviews.</p>
<h5>Type of Neural Network</h5>
<p>Neural networks come in a variety of types, and as such, it is important to identify the ideal type to use for this particular analysis. The type that I decided to use for this analysis is a Recurrent Neural Network (RNN). I decided to use this type because, rather than analyzing the data in the text review in a manner whereby it is simply checking the value of words, it instead takes into account the word order as well. For example, if a user were to say "I do not recommend this movie," most other models would consider each word independently from each other, and would likely conclude that the rating would be positive. The RNN, however, can take into account the combination of the words "not" and "recommend" being sequential, recognizing that the user has a negative review.</p>
<h4 id="B">Data Preparation</h4>
<p> There are a few tasks that neeed to be performed before the neural network can be created and tested, including performing some exploratory data analysis. The first thing that we will do is import the required libraries/packages and load the data into the Jupyter Notebook</p>

In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, Embedding, Flatten
from tensorflow.keras.models import Sequential
import nltk
import string
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
pd.set_option("display.max_columns", None)

[nltk_data] Downloading package stopwords to /home/edwinp/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [84]:
import os
# Initialize a dictionary to store the data
data_dict = {}
# Define the folder path
folder_path = './sentiment+labelled+sentences'
full_folder_path = './sentiment+labelled+sentences/sentiment labelled sentences'

# List all files in the folder
files = [f for f in os.listdir(full_folder_path) if f != 'readme.txt']

# Initialize a dictionary to store file lengths
file_lengths = {}
print(files)
# Read each file and calculate its length
for file in files:
    with open(os.path.join(full_folder_path, file), 'r', encoding="utf-8") as f:
        content = f.readlines()
        for line in content:
            text = line[0:-2]
            label = line[-2]
            data_dict[text.strip()] = int(label)
            file_lengths[file] = len(content)
        

# Print the lengths of each file
for file, length in file_lengths.items():
    print(f"{file}: {length} lines")


['amazon_cells_labelled.txt', 'imdb_labelled.txt', 'yelp_labelled.txt']
amazon_cells_labelled.txt: 1000 lines
imdb_labelled.txt: 1000 lines
yelp_labelled.txt: 1000 lines


<p>Above, we can notice that we have 1000 lines in each file, giving us access to 3000 rows of data total. Next, we will look into the presence of unusual characters within the data</p>

In [85]:
import re

# Define a function to check for non-English characters
def contains_unusual_characters(text):
    # Define a regex pattern to match non-English characters and emojis
    pattern = re.compile(r'[^\x00-\x7F]+')
    return bool(pattern.search(text))

# Initialize a list to store keys with unusual characters
unusual_keys = []

# Iterate through the keys in data_dict and check for unusual characters
for key in data_dict.keys():
    if contains_unusual_characters(key):
        unusual_keys.append(key)

# Print the keys with unusual characters
print("Keys with unusual characters:")
for key in unusual_keys:
    print(key)
print(len(unusual_keys))

Keys with unusual characters:
It's practically perfect in all of them  a true masterpiece in a sea of faux "masterpieces.
I'm glad this pretentious piece of s*** didn't do as planned by the Dodge stratus Big Shots... It's gonna help movie makers who aren't in the very restrained "movie business" of Québec.
The script iswas there a script?
I'll even say it again  this is torture.
This show is made for Americans - it is too stupid and full with hatred and clichés to be admitted elsewhere.
A cheap and cheerless heist movie with poor characterisation, lots of underbite style stoic emoting (think Chow Yun Fat in A Better Tomorrow) and some cheesy clichés thrown into an abandoned factory ready for a few poorly executed flying judo rolls a la John Woo.
And I forgot: The Casting here i superb, with Trond Fausa Aurvåg being perfect in the role as the Bothersome Man, who doesn't understand where he is, what he is doing and why.
The script is bad, very bad  it contains both cheesiness and une

<p>Obviously, we can see that only 17 rows contain these unusual characters. Due to the incredibly small number of unusual keys, I believe that filtering them from our analysis is justified.</p>

In [86]:
# Filter out keys that are present in unusual_keys
filtered_data_dict = {key: value for key, value in data_dict.items() if key not in unusual_keys}

# Print the length of the filtered dictionary to verify
print(f"Original data_dict length: {len(data_dict)}")
print(f"Filtered data_dict length: {len(filtered_data_dict)}")

Original data_dict length: 2982
Filtered data_dict length: 2965


Now that these filtered values are removed, we will proceed to determine the vocabulary size of the reviewers in the dataset

In [87]:
from collections import Counter
import nltk
# Ensure 'punkt' is downloaded
nltk.download('punkt')
nltk.download('punkt_tab')

# Tokenize the text in the keys of filtered_data_dict
all_words = []
for review in filtered_data_dict.keys():
    tokens = nltk.word_tokenize(review)
    all_words.extend(tokens)

# Count the unique words
vocabulary_size = len(set(all_words))

print(f"Vocabulary size: {vocabulary_size}")

Vocabulary size: 6012


[nltk_data] Downloading package punkt to /home/edwinp/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/edwinp/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Next, we will propose a word embedding length. This will be determined using the logarithm of the vocabulary size, to ensure that the dataset is not filtered such that it is too small to notice critical information, but that we are filtering out enough data to avoid unnecessary computational complexity.

In [88]:
import math

# Calculate the proposed embedding length
proposed_embedding_length = math.ceil(math.log2(vocabulary_size))
print(f"Proposed embedding length: {proposed_embedding_length}")

Proposed embedding length: 13


Finally, we will determine the maximum sequence length for this analysis. This will be done by determining the 95th percentile of review lengths, and using that value as the maximum sequence length. This will allow the reviews to have a length-constraint that optimizes model performance by filtering out excessively long reviews while also retaining the vast majority of reviews.

In [89]:
# Calculate the lengths of the reviews
review_lengths = [len(nltk.word_tokenize(review)) for review in filtered_data_dict.keys()]

# Calculate the 95th percentile of review lengths
max_sequence_length = np.percentile(review_lengths, 95)
max_sequence_length = int(max_sequence_length)
print(f"Statistically justified maximum sequence length: {max_sequence_length}")

Statistically justified maximum sequence length: 30


<h5>Goals of Tokenization Process</h5>
<p>The next step in the data preparation is tokenizing the text reviews. Neural networks cannot inherently interpret raw text, so we need to break the text into smaller units that map to numeric values.</p>

In [90]:
# Initialize the tokenizer
tokenizer = Tokenizer(num_words=vocabulary_size, oov_token="<OOV>")
tokenizer.fit_on_texts(filtered_data_dict.keys())

# Convert the text to sequences
sequences = tokenizer.texts_to_sequences(filtered_data_dict.keys())

# Convert the labels to a numpy array
labels = np.array(list(filtered_data_dict.values()))
print(f"Labels shape: {labels.shape}")

Labels shape: (2965,)


<h5>Padding Process Explanation</h5>
<p>Padding is an essential process in the preparation of this data for the neural network. It is designed to standardize the size of the inputs, which inherently have different lengths. I am going to add the padding to the end of the sequence, using the padding='post' argument.</p>

In [91]:
# Pad the sequences to ensure they all have the same length
padded_sequences = pad_sequences(sequences, maxlen=max_sequence_length, padding='post', truncating='post')

print(f"Padded sequences shape: {padded_sequences.shape}")

Padded sequences shape: (2965, 30)


<p>Below, we see one example of a padded sequence. The several entries of 0 at the end indicate padded values</p>

In [92]:
# Print a single padded sequence
print(padded_sequences[0])

[  28   47    6   57  117   13   71    8  370    7   12   66   12    2
  185  578    4   76   61    5 2243    0    0    0    0    0    0    0
    0    0]


<h5>Sentiment Categories</h5>
<p>There will only be 2 categories of sentiment used in this analysis, as we only have positive and negative reviews for this data, meaning that we have binary classification. The activation function will similarly be sigmoid, which will allow us to determine the positive or negative value based on whether the calculated value meets a certain threshold</p>
<h5>Steps Explanation</h5>
<p>To review the steps of the data preparation, we started, after importing the data, with filtering out any reviews that contained characters that couldn't be interpreted. Then, we determined the vocabulary size, proposed a word embedding length, and calculated a statistically-justified maximum sequence length. Then, we tokenized the reviews to convert the textual, unusable data into numerical data that could be used by the neural network. The next step was to pad the data, to ensure that the inputs were of a standard length. Now, we will split the data into training, validation, and testing sets. An 80%/10%/10% split is most common in a neural network. The training set, of course, requires the majority of the data, to train it in the widest variety of information possible and ensure the best performance. A 10% validation set is technically optional, but it is ideal. The validation set is useful in detecting overfitting and can fine-tine hyperparameters, improving the performance of the model. Finally, we use a 10% test set, to evaluate the performance of the model.

In [93]:
# Split the data into training and temporary datasets
X_train, X_temp, y_train, y_temp = train_test_split(padded_sequences, labels, test_size=0.3, random_state=42)

# Split the temporary dataset into validation and test datasets
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Validation data shape: {X_val.shape}")
print(f"Testing data shape: {X_test.shape}")
print(f"Training labels shape: {y_train.shape}")
print(f"Validation labels shape: {y_val.shape}")
print(f"Testing labels shape: {y_test.shape}")

Training data shape: (2075, 30)
Validation data shape: (445, 30)
Testing data shape: (445, 30)
Training labels shape: (2075,)
Validation labels shape: (445,)
Testing labels shape: (445,)


In [94]:
# Convert the numpy arrays to pandas DataFrames
X_train_df = pd.DataFrame(X_train)
X_test_df = pd.DataFrame(X_test)
y_train_df = pd.DataFrame(y_train, columns=['label'])
y_test_df = pd.DataFrame(y_test, columns=['label'])
y_validation_df = pd.DataFrame(y_train, columns=['label'])
X_validation_df = pd.DataFrame(y_test, columns=['label'])

# Save the DataFrames to CSV files
X_train_df.to_csv('X_train.csv', index=False)
X_test_df.to_csv('X_test.csv', index=False)
y_train_df.to_csv('y_train.csv', index=False)
y_test_df.to_csv('y_test.csv', index=False)
X_validation_df.to_csv('X_validation.csv', index=False)
y_validation_df.to_csv('y_validation.csv', index=False)