# D213 Task 1 Advanced Data Analytics

## Part 1

### A1: Research Question and Data Selection

Research Question:
Is it possible to at accurately determine customer sentiment from a customers review utilizing Natural language processing and nural networks?

Data and Rational:
The data that will be used for this analysis is "sentiment labeled sentences" dataset which can be found at the link bellow. This data set provides a sentence representing a review along with a label of 1 or 0 indicating a positive or negative sentiment respectively


https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences

### A2: Objectives

The objective of this analysis is to determine the feasibility of using a natural language processing neural network to determine a customers sentiment based on a review. The determination of feasibility will be made by creating a NLP model using Tensorflow. The objective of this model is to be able to take in the review text data and determine if the review has a positive or negative sentiment. 

### A3: Neural Network Type

# TODO

## Part 2

### B1: Data Exploration and Cleaning

1. Check presence of unusual characters
2. Vocabulary size
3. proposed word embedding length
4. statistical justification for the chosen maximum sequence length



In [209]:
# Import Packages
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer

# Download Stopwords early to avoid rerunning
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

ImportError: cannot import name 'trapezoid' from 'sklearn.utils.fixes' (/Users/orlandmalphrus/.local/share/virtualenvs/D213-JXc5fzL4/lib/python3.9/site-packages/sklearn/utils/fixes.py)

In [None]:
# Extract data create combined dataframe and check counts
amazon_df = pd.read_csv('./data/amazon_cells_labelled.txt', sep='\t', names=['review', 'score'])
imdb_df = pd.read_csv('./data/imdb_labelled.txt', sep='\t', names=['review', 'score'])
yelp_df = pd.read_csv('./data/yelp_labelled.txt', sep='\t', names=['review', 'score'])

print(f'Amazon Count: {amazon_df.shape[0]}')
print(f'IMDB Count: {imdb_df.shape[0]}')
print(f'Yelp Count: {yelp_df.shape[0]}')

# Label Data Source
amazon_df['source'] = 'amz'
imdb_df['source'] = 'imdb'
yelp_df['source'] = 'yelp'

# Join Dataframes 
df = pd.concat([amazon_df, imdb_df, yelp_df], ignore_index=True)
df.head()

In [None]:
df.info()

In [None]:
# Get all Unique chars from the dataset 
# Convert all reviews to a single string and then to a set to get unique characters
unique_chars = set(''.join(df['review']))
print('All Unique Characters:')
print(unique_chars)

In [None]:
non_alpha_numeric_chars = [char for char in unique_chars if not char.isalnum()]
print('Non alpha numeric characters:')
print(non_alpha_numeric_chars)

In [None]:
# Remove non-alphanumeric chars
df['cleaned_review'] = df['review'].apply(lambda x: ''.join([char for char in x if char.isalnum() or char.isspace()]))
# Remove stopwords
df['cleaned_reduced_review'] = df['cleaned_review'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in stop_words]))

df.head()

In [None]:
# Vocabulary Size
all_words = ' '.join(df['cleaned_reduced_review']).lower().split()
vocabulary = set(all_words)
vocabulary_size =  len(vocabulary)
print(f"Vocabulary size: {vocabulary_size}")

In [None]:
# Proposed word embedding length. Based on an industry rule of thumb for embeddings taking a forth root (Goldman, 2019)
proposed_embedding_length = round(vocabulary_size ** .25)
print(f"Proposed embedding length: {proposed_embedding_length}")

In [None]:
# Justification of max sequence length
df['cleaned_review_length'] = df['cleaned_review'].apply(lambda x: len(x.split()))
clean_review_length_mean = df['cleaned_review_length'].mean()
clean_review_length_std = df['cleaned_review_length'].std()
clean_review_length_max = df['cleaned_review_length'].max()

print(f"Mean clean_review_length: {clean_review_length_mean}")
print(f"Standard deviation of clean_review_length: {clean_review_length_std}")
print(f"Maximum clean_review_length: {clean_review_length_max}")

# max length that covers around 95% of the dataset
cutoff_length = int(clean_review_length_mean + 2 * clean_review_length_std)
print(f"Suggested max sequence length: {cutoff_length}")


In [None]:
# Test train split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=2)

### B2: Tokenization

In [None]:
tokenizer = Tokenizer(oov_token='OOV')
tokenizer.fit_on_texts(train_df['cleaned_reduced_review'])
train_tokens = tokenizer.texts_to_sequences(train_df['cleaned_reduced_review'])
test_tokens = tokenizer.texts_to_sequences(test_df['cleaned_reduced_review'])

test_tokens