# D213 Task 1 Advanced Data Analytics

## Part 1

### A1: Research Question and Data Selection

Research Question:
Is it possible to at accurately determine customer sentiment from a customers review utilizing Natural language processing and nural networks?

Data and Rational:
The data that will be used for this analysis is "sentiment labeled sentences" dataset which can be found at the link bellow. This data set provides a sentence representing a review along with a label of 1 or 0 indicating a positive or negative sentiment respectively


https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences

### A2: Objectives

The objective of this analysis is to determine the feasibility of using a natural language processing neural network to determine a customers sentiment based on a review. The determination of feasibility will be made by creating a NLP model using Tensorflow. The objective of this model is to be able to take in the review text data and determine if the review has a positive or negative sentiment. 

### A3: Neural Network Type

# TODO

## Part 2

### B1: Data Exploration

1. Check presence of unusual characters
2. Vocabulary size
3. proposed word embedding length
4. statistical justification for the chosen maximum sequence length



In [127]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [128]:
# Extract data create combined dataframe and check counts
amazon_df = pd.read_csv('./data/amazon_cells_labelled.txt', sep='\t', names=['review', 'score'])
imdb_df = pd.read_csv('./data/imdb_labelled.txt', sep='\t', names=['review', 'score'])
yelp_df = pd.read_csv('./data/yelp_labelled.txt', sep='\t', names=['review', 'score'])

print(f'Amazon Count: {amazon_df.shape[0]}')
print(f'IMDB Count: {imdb_df.shape[0]}')
print(f'Yelp Count: {yelp_df.shape[0]}')

# Label Data Source
amazon_df['source'] = 'amz'
imdb_df['source'] = 'imdb'
yelp_df['source'] = 'yelp'

# Join Dataframes 
df = pd.concat([amazon_df, imdb_df, yelp_df], ignore_index=True)
df.head()

Amazon Count: 1000
IMDB Count: 1000
Yelp Count: 1000


Unnamed: 0,review,score,source
0,So there is no way for me to plug it in here i...,0,amz
1,"Good case, Excellent value.",1,amz
2,Great for the jawbone.,1,amz
3,Tied to charger for conversations lasting more...,0,amz
4,The mic is great.,1,amz


In [129]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  3000 non-null   object
 1   score   3000 non-null   int64 
 2   source  3000 non-null   object
dtypes: int64(1), object(2)
memory usage: 70.4+ KB


In [141]:
# Get all Unique chars from the dataset 
# Convert all reviews to a single string and then to a set to get unique characters
unique_chars = set(''.join(df['review']))
print('All Unique Characters:')
print(unique_chars)

All Unique Characters:
{'+', 'm', 'b', '8', 'p', 'I', 'A', 'v', ')', 'G', 'y', '[', 'é', 'V', 'å', '?', 'a', "'", 'x', '%', 'j', '6', '2', '5', '$', 'n', 's', 'f', '0', 'X', 'k', 'K', ' ', 'N', 'O', 'Y', 'w', '1', 'L', 'h', '4', '.', '*', 'q', 'r', 'E', 't', 'M', 'u', 'B', 'U', ';', 'c', '\x97', '9', '\x96', ']', '"', 'W', '(', 'i', '&', '3', ':', 'C', 'S', 'ê', ',', 'J', '7', 'd', 'F', 'Z', 'H', 'P', '!', 'l', '#', '\x85', 'D', 'z', 'g', 'e', 'T', '/', 'R', '-', 'Q', 'o'}


In [142]:
non_alpha_numeric_chars = [char for char in unique_chars if not char.isalnum()]
print('Non alpha numeric characters:')
print(non_alpha_numeric_chars)

Non alpha numeric characters:
['+', ')', '[', '?', "'", '%', '$', ' ', '.', '*', ';', '\x97', '\x96', ']', '"', '(', '&', ':', ',', '!', '#', '\x85', '/', '-']


In [143]:
df['cleaned_review'] = df['review'].apply(lambda x: ''.join([char for char in x if char.isalnum() or char.isspace()]))
df.head()

Unnamed: 0,review,score,source,cleaned_review,cleaned_review_length
0,So there is no way for me to plug it in here i...,0,amz,So there is no way for me to plug it in here i...,21
1,"Good case, Excellent value.",1,amz,Good case Excellent value,4
2,Great for the jawbone.,1,amz,Great for the jawbone,4
3,Tied to charger for conversations lasting more...,0,amz,Tied to charger for conversations lasting more...,11
4,The mic is great.,1,amz,The mic is great,4


In [144]:
# Vocabulary Size
all_words = ' '.join(df['cleaned_review']).lower().split()
vocab_size = len(set(all_words))
print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 5399


In [135]:
# Justification of max sequence length
df['cleaned_review_length'] = df['cleaned_review'].apply(lambda x: len(x.split()))
rl_mean = df['cleaned_review_length'].mean()
rl_std = df['cleaned_review_length'].std()
rl_max = df['cleaned_review_length'].max()

print(f"Mean review length: {rl_mean}")
print(f"Standard deviation of review length: {rl_std}")
print(f"Maximum review length: {rl_max}")

# max length that covers around 95% of the dataset
cutoff_length = int(rl_mean + 2 * rl_std)
print(f"Suggested max sequence length: {cutoff_length}")


Mean review length: 11.777666666666667
Standard deviation of review length: 7.8309221893934255
Maximum review length: 70
Suggested max sequence length: 27
