### DataSet

The Amazon dataset contains the customer reviews for all listed Electronics products spanning from May 1996 up to July 2014.There are a total of 1,689,188 reviews by a total of 192,403 customers on 63,001 unique products. The data dictionary is as follows:

* asin - Unique ID of the product being reviewed, string
* helpful - A list with two elements: the number of users that voted helpful, and the total number of users that voted on the review (including the not helpful votes), list
* overall - The reviewer's rating of the product, int64
* reviewText - The review text itself, string
* reviewerID - Unique ID of the reviewer, string
* reviewerName - Specified name of the reviewer, string
* summary - Headline summary of the review, string
* unixReviewTime - Unix Time of when the review was posted, string

# Data Wrangling

In [1]:
import warnings

warnings.simplefilter("ignore", UserWarning)
warnings.simplefilter("ignore", FutureWarning)
warnings.simplefilter("ignore", DeprecationWarning)

In [2]:
import pandas as pd
import numpy as np
import os



In [3]:
#Get the current working directory of a process and print 
MyWorkingDir = os.getcwd()

print(MyWorkingDir)

/Users/aquinojoeanson/Desktop/SPRINGBOARD/Capstone_Project_3/Notebook


In [4]:
# load and read the Dataset reviews_Electronics_5.json on woking directory
# and fed into DataFrame or df
df = pd.read_json('/Users/aquinojoeanson/Desktop/SPRINGBOARD/Capstone_Project_3/Notebook/reviews_Electronics_5.json', lines = True)

In [5]:
# Overall and the unixReviewTime only stored as integers. The rest are interpreted as strings.
display(df.head(10))

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AO94DHGC771SJ,528881469,amazdnu,"[0, 0]",We got this GPS for my husband who is an (OTR)...,5,Gotta have GPS!,1370131200,"06 2, 2013"
1,AMO214LNFCEI4,528881469,Amazon Customer,"[12, 15]","I'm a professional OTR truck driver, and I bou...",1,Very Disappointed,1290643200,"11 25, 2010"
2,A3N7T0DY83Y4IG,528881469,C. A. Freeman,"[43, 45]","Well, what can I say. I've had this unit in m...",3,1st impression,1283990400,"09 9, 2010"
3,A1H8PY3QHMQQA0,528881469,"Dave M. Shaw ""mack dave""","[9, 10]","Not going to write a long review, even thought...",2,"Great grafics, POOR GPS",1290556800,"11 24, 2010"
4,A24EV6RXELQZ63,528881469,Wayne Smith,"[0, 0]",I've had mine for a year and here's what we go...,1,"Major issues, only excuses for support",1317254400,"09 29, 2011"
5,A2JXAZZI9PHK9Z,594451647,"Billy G. Noland ""Bill Noland""","[3, 3]",I am using this with a Nook HD+. It works as d...,5,HDMI Nook adapter cable,1388707200,"01 3, 2014"
6,A2P5U7BDKKT7FW,594451647,Christian,"[0, 0]",The cable is very wobbly and sometimes disconn...,2,Cheap proprietary scam,1398556800,"04 27, 2014"
7,AAZ084UMH8VZ2,594451647,"D. L. Brown ""A Knower Of Good Things""","[0, 0]",This adaptor is real easy to setup and use rig...,5,A Perfdect Nook HD+ hook up,1399161600,"05 4, 2014"
8,AEZ3CR6BKIROJ,594451647,Mark Dietter,"[0, 0]",This adapter easily connects my Nook HD 7&#34;...,4,A nice easy to use accessory.,1405036800,"07 11, 2014"
9,A3BY5KCNQZXV5U,594451647,Matenai,"[3, 3]",This product really works great but I found th...,5,This works great but read the details...,1390176000,"01 20, 2014"


In [6]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1689188 entries, 0 to 1689187
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   reviewerID      1689188 non-null  object
 1   asin            1689188 non-null  object
 2   reviewerName    1664458 non-null  object
 3   helpful         1689188 non-null  object
 4   reviewText      1689188 non-null  object
 5   overall         1689188 non-null  int64 
 6   summary         1689188 non-null  object
 7   unixReviewTime  1689188 non-null  int64 
 8   reviewTime      1689188 non-null  object
dtypes: int64(2), object(7)
memory usage: 116.0+ MB
None


In [7]:
# Converting unixReviewTime to Unix time, so it will be more accurate when the review was posted
from datetime import datetime

condition = lambda row: datetime.fromtimestamp(row).strftime("%m-%d-%Y")
df["unixReviewTime"] = df["unixReviewTime"].apply(condition)

In [8]:
df.drop(labels="reviewTime", axis=1, inplace=True)
display(df.head())

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime
0,AO94DHGC771SJ,528881469,amazdnu,"[0, 0]",We got this GPS for my husband who is an (OTR)...,5,Gotta have GPS!,06-01-2013
1,AMO214LNFCEI4,528881469,Amazon Customer,"[12, 15]","I'm a professional OTR truck driver, and I bou...",1,Very Disappointed,11-24-2010
2,A3N7T0DY83Y4IG,528881469,C. A. Freeman,"[43, 45]","Well, what can I say. I've had this unit in m...",3,1st impression,09-08-2010
3,A1H8PY3QHMQQA0,528881469,"Dave M. Shaw ""mack dave""","[9, 10]","Not going to write a long review, even thought...",2,"Great grafics, POOR GPS",11-23-2010
4,A24EV6RXELQZ63,528881469,Wayne Smith,"[0, 0]",I've had mine for a year and here's what we go...,1,"Major issues, only excuses for support",09-28-2011


In [9]:
#Sample product review (string) in reviewText
print(df["reviewText"].iloc[0])

We got this GPS for my husband who is an (OTR) over the road trucker.  Very Impressed with the shipping time, it arrived a few days earlier than expected...  within a week of use however it started freezing up... could of just been a glitch in that unit.  Worked great when it worked!  Will work great for the normal person as well but does have the "trucker" option. (the big truck routes - tells you when a scale is coming up ect...)  Love the bigger screen, the ease of use, the ease of putting addresses into memory.  Nothing really bad to say about the unit with the exception of it freezing which is probably one in a million and that's just my luck.  I contacted the seller and within minutes of my email I received a email back with instructions for an exchange! VERY impressed all the way around!


In [10]:
# On overall field is associated with reviews rating, will use it as the base truth labels for the model as a quantified summary.
print(df.overall.unique())

[5 1 3 2 4]


In [11]:
# NLP Pre-Processing... Original Form
sample_review = df["reviewText"].iloc[1689160]
print(sample_review)

Want to add wireless audio streaming to your home theater or home stereo that has 3.5mm input? Maybe you want to stream music in your car from your Android or iPhone? This little beauty adds wireless capability to a wireless-incapable device with ease.The only caveat is that it needs a USB power connection to run. This is good if you have a spare charger and a spare power outlet, bad if you don't. It's not a big deal to work around, but it is something you will want to keep in mind.Connecting to the device is easy. Just plug it into power, let it power on. Connect the 3.5mm output to your audio setup with a 3.5mm input. On your phone or Wi-Fi enabled audio player, turn on Wi-Fi. Search for the Sabrent_A1AE and connect. Now fire up your music app, such as Music on the iPhone, and play away. Well, make sure your stereo is powered up and on the right input (or car stereo or surround receiver and so on).The nice thing about this is that it uses Wi-Fi. This means it doesn't have the compres

In [12]:
# Reversing the HTML to original character presentation.
import html

decoded_review = html.unescape(sample_review)
print(decoded_review)

Want to add wireless audio streaming to your home theater or home stereo that has 3.5mm input? Maybe you want to stream music in your car from your Android or iPhone? This little beauty adds wireless capability to a wireless-incapable device with ease.The only caveat is that it needs a USB power connection to run. This is good if you have a spare charger and a spare power outlet, bad if you don't. It's not a big deal to work around, but it is something you will want to keep in mind.Connecting to the device is easy. Just plug it into power, let it power on. Connect the 3.5mm output to your audio setup with a 3.5mm input. On your phone or Wi-Fi enabled audio player, turn on Wi-Fi. Search for the Sabrent_A1AE and connect. Now fire up your music app, such as Music on the iPhone, and play away. Well, make sure your stereo is powered up and on the right input (or car stereo or surround receiver and so on).The nice thing about this is that it uses Wi-Fi. This means it doesn't have the compres

In [14]:
# Removing the punctuation marks, it has no value on NLP.
pattern = r"\&\#[0-9]+\;"
df["preprocessed"] = df["reviewText"].str.replace(pat=pattern, repl="", regex=True)
print(df["preprocessed"].iloc[1689160])

Want to add wireless audio streaming to your home theater or home stereo that has 3.5mm input? Maybe you want to stream music in your car from your Android or iPhone? This little beauty adds wireless capability to a wireless-incapable device with ease.The only caveat is that it needs a USB power connection to run. This is good if you have a spare charger and a spare power outlet, bad if you don't. It's not a big deal to work around, but it is something you will want to keep in mind.Connecting to the device is easy. Just plug it into power, let it power on. Connect the 3.5mm output to your audio setup with a 3.5mm input. On your phone or Wi-Fi enabled audio player, turn on Wi-Fi. Search for the Sabrent_A1AE and connect. Now fire up your music app, such as Music on the iPhone, and play away. Well, make sure your stereo is powered up and on the right input (or car stereo or surround receiver and so on).The nice thing about this is that it uses Wi-Fi. This means it doesn't have the compres

# Extracting the root word

In [None]:
# linking together as a string whitespace.
#%%time
import re
import nltk
#nltk.download()

from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.corpus import wordnet

#import nltk resources
resources = ["wordnet", "stopwords", "punkt", \
             "averaged_perceptron_tagger", "maxent_treebank_pos_tagger"]

for resource in resources:
    try:
        nltk.data.find("tokenizers/" + resource)
    except LookupError:
        nltk.download(resource)

#create Lemmatizer object
lemma = WordNetLemmatizer()

def lemmatize_word(tagged_token):
    """ Returns lemmatized word given its tag"""
    root = []
    for token in tagged_token:
        tag = token[1][0]
        word = token[0]
        if tag.startswith('J'):
            root.append(lemma.lemmatize(word, wordnet.ADJ))
        elif tag.startswith('V'):
            root.append(lemma.lemmatize(word, wordnet.VERB))
        elif tag.startswith('N'):
            root.append(lemma.lemmatize(word, wordnet.NOUN))
        elif tag.startswith('R'):
            root.append(lemma.lemmatize(word, wordnet.ADV))
        else:          
            root.append(word)
    return root

def lemmatize_doc(document):
    """ Tags words then returns sentence with lemmatized words"""
    lemmatized_list = []
    tokenized_sent = sent_tokenize(document)
    for sentence in tokenized_sent:
        no_punctuation = re.sub(r"[`'\",.!?()]", " ", sentence)
        tokenized_word = word_tokenize(no_punctuation)
        tagged_token = pos_tag(tokenized_word)
        lemmatized = lemmatize_word(tagged_token)
        lemmatized_list.extend(lemmatized)
    return " ".join(lemmatized_list)

#apply our functions
df["preprocessed"] = df["preprocessed"].apply(lambda row: lemmatize_doc(row))
print(df["preprocessed"].iloc[1689160])

In [None]:
# Removing accents 
from unicodedata import normalize

remove_accent = lambda text: normalize("NFKD", text).encode("ascii", "ignore").decode("utf-8", "ignore")

df["preprocessed"] = df["preprocessed"].apply(remove_accent)

print(df["preprocessed"].iloc[1689160])

In [None]:
# Removing puntuations, keeping whitespace and alphanumeric only. 
pattern = r"[^\w\s]"

df["preprocessed"] = df["preprocessed"].str.replace(pat=pattern, repl=" ", regex=True)

print(df["preprocessed"].iloc[1689160])

In [None]:
# Converted to lower case
df["preprocessed"] = df["preprocessed"].str.lower()

print(df["preprocessed"].iloc[1689160])

In [None]:
# Removing Stop Words.
from nltk.corpus import stopwords

stop_words = stopwords.words("english")

stop_words = [word.replace("\'", "") for word in stop_words]

print(f"sample stop words: {stop_words[:15]} \n")

remove_stop_words = lambda row: " ".join([token for token in row.split(" ") \
                                          if token not in stop_words])
df["preprocessed"] = df["preprocessed"].apply(remove_stop_words)

print(df["preprocessed"].iloc[1689160])

In [None]:
# Removing Extra Space, To ensure no more than one single whitespace in the sentence.
pattern = r"[\s]+"

df["preprocessed"] = df["preprocessed"].str.replace(pat=pattern, repl=" ", regex=True)

print(df["preprocessed"].iloc[1689160])

# Tokenization

In [None]:
# Collecting each review and transfored into a list of words.
corpora = df["preprocessed"].values
tokenized = [corpus.split(" ") for corpus in corpora]

print(tokenized[1689160])

# Phrase Modeling

In [None]:
#setting atleast 300 time that two words appear.
from gensim.models import Phrases
from gensim.models.phrases import Phraser

bi_gram = Phrases(tokenized, min_count=300, threshold=50)

tri_gram = Phrases(bi_gram[tokenized], min_count=300, threshold=50)

# Unigrams

In [None]:
#Single pieces of token.
uni_gram_tokens = set([token for text in tokenized for token in text])
uni_gram_tokens = set(filter(lambda x: x != "", uni_gram_tokens))

print(list(uni_gram_tokens)[:50])

# Bigrams

In [None]:
# From gensim phaser filtering bi_gram phrases.
bigram_min = bi_gram.min_count

bi_condition = lambda x: x[1] >= bigram_min

bi_gram_tokens = dict(filter(bi_condition, bi_gram.vocab.items()))
bi_gram_tokens = set([token.decode("utf-8") \
                      for token in bi_gram_tokens])

bi_grams_only = bi_gram_tokens.difference(uni_gram_tokens)
print(list(bi_grams_only)[:50])

# Trigrams

In [None]:
# Linking bi_gram plus adjacent tokens.
trigram_min = tri_gram.min_count

tri_condition = lambda x: x[1] >= trigram_min

tri_gram_tokens = dict(filter(tri_condition, tri_gram.vocab.items()))
tri_gram_tokens = set([token.decode("utf-8") \
                       for token in tri_gram_tokens])

tri_grams_only = tri_gram_tokens.difference(bi_gram_tokens)
print(list(tri_grams_only)[:50])

In [None]:
# tri_gram and bi_gram phrasers are applied to our tokenized corpora.
tokenized = [Phraser(tri_gram)[Phraser(bi_gram)[i]] for i in tokenized]

In [None]:
#Final form, Single character are removed.
tokenized = [list(filter(lambda x: len(x) > 1, document)) \
             for document in tokenized]

print(tokenized[1689160])

# Creating the Vocabulary

In [None]:
# Tokens are assigned to lookup ID.
from gensim.corpora.dictionary import Dictionary

vocabulary = Dictionary(tokenized)

vocabulary_keys = list(vocabulary.token2id)[0:10]

for key in vocabulary_keys:
    print(f"ID: {vocabulary.token2id[key]}, Token: {key}")

# Vectorization

# Bag of Words Model

In [None]:
# Converting to numerical values. 
# Counting how many times a word appears.
bow = [vocabulary.doc2bow(doc) for doc in tokenized]

for idx, freq in bow[0]:
    print(f"Word: {vocabulary.get(idx)}, Frequency: {freq}")

# TF-IDF Model

In [None]:
# TF-IDF weighting if lower or higher based on our bow variable.
from gensim.models.tfidfmodel import TfidfModel

tfidf = TfidfModel(bow)

for idx, weight in tfidf[bow[0]]:
    print(f"Word: {vocabulary.get(idx)}, Weight: {weight:.3f}")

# Word Embedding for Feature Engineering

In [None]:
# embedding token in the Word2vec model.
%%time
import numpy as np

from gensim.models import word2vec

np.set_printoptions(suppress=True)

feature_size = 100
context_size = 20
min_word = 1

word_vec= word2vec.Word2Vec(tokenized, size=feature_size, \
                            window=context_size, min_count=min_word, \
                            iter=50, seed=42)

# Final Dataframe

In [None]:
# gathering all unique token using word_vec model. 
word_vec_unpack = [(word, idx.index) for word, idx in \
                   word_vec.wv.vocab.items()]

tokens, indexes = zip(*word_vec_unpack)

# word_vec_df is sliced by words
word_vec_df = pd.DataFrame(word_vec.wv.syn0[indexes, :], index=tokens)

display(word_vec_df.head())

In [None]:
%%time
tokenized_array = np.array(tokenized)

# model_array shape is therefore the word count on axis 0 and the number of dimensions on axis 1.
model_array = np.array([word_vec_df.loc[doc].mean(axis=0) for doc in tokenized_array])

In [None]:
# model_df final DF
model_df = pd.DataFrame(model_array)
model_df["label"] = df["overall"]

display(model_df.head())

# Principal Component Analysis

In [None]:
# use on our model_df to reduce its 100 dimentions to 2 dimentions.
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

#sampling the model_df population
pca_df = model_df.reset_index()
pca_df = model_df.dropna(axis=0).iloc[:,1:]
pca_df = pca_df.iloc[::50]

#setting up PCA
pca = PCA(n_components=2, random_state=42)
pca = pca.fit_transform(pca_df.iloc[:, :-1])
labels = pca_df["label"]

#setting up plot components
x_axis = pca[:,0]
y_axis = pca[:,1]
color_map = pca_df["label"].map({1:"blue", \
                                 2:"red", \
                                 3:"yellow", \
                                 4:"green", \
                                 5:"orange"})

#plotting PCA
f, axes = plt.subplots(figsize=(20,10))
plt.scatter(x_axis, y_axis, color=color_map, s=1)
plt.show()

# EDA

# Word2Vec More

In [None]:
#We'll implement several interesting Natural Language Processing techniques in order to explore our Amazon dataset.
# Taking five common words in our corpora and using word_vec we derive their five most related words. 
word_bank = ["nook", "phone", "tv", "good", "price"]

for word in word_bank[:]:
    related_vec = word_vec.wv.most_similar(word, topn=5)
    related_words = np.array(related_vec)[:,0]
    word_bank.extend(related_words)
    print(f"{word}: {related_words}")

# t-SNE

In [None]:
# assists in visualizing high-dimensional dataset.
# will provide coordinates of each word in a 2D scatterplot plane.
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=5, n_iter=1000, random_state=42)

sample_vecs = word_vec.wv[set(word_bank)]
sample_tsne = tsne.fit_transform(sample_vecs)
tsne_x = sample_tsne[:, 0]
tsne_y = sample_tsne[:, 1]

f, axes = plt.subplots(figsize=(20,7))
ax = plt.scatter(x=tsne_x, y=tsne_y)

for label, x, y in zip(word_bank, tsne_x, tsne_y):
    plt.annotate(label, xy=(x+3, y+3))

plt.show()

# Vector Algebra

add (combine the meaning of the components) or 
subtract (to take out the context of one token from another) word vectors together.

### Example 1: Books + Touchscreen

In [None]:
# vector algebra and similarity scores
word_vec.wv.most_similar(positive=["books", "touchscreen"], \
                      negative=[], topn=1)

### Example 2: Cheap – Quality

In [None]:
# vector algebra and similarity scores
word_vec.wv.most_similar(positive=["cheap"], \
                      negative=["quality"], topn=1)

### Example 3: Tablet – Phone

In [None]:
# vector algebra and similarity scores
word_vec.wv.most_similar(positive=["tablet"], \
                      negative=["phone"], topn=1)

# Named-Entity Recognition

In [None]:
# most_helpful_text is the highest-rated product review by Amazon users.
# 1st element storing the number of helful votes
# second element containing  the total number of helpful and not helpful review votes.
helpful = df["helpful"].tolist()
most_helpful = max(helpful, key=lambda x: x[0])

most_helpful_idx = df["helpful"].astype(str) == str(most_helpful)
most_helpful_idx = df[most_helpful_idx].index

most_helpful_text = df["reviewText"].iloc[most_helpful_idx].values[0]

print(most_helpful_text)

In [None]:
# to go further and identify what nouns in th documents refer to using NER(noun tagging)
#ner_dict, a dictionary initialized as a list, to segregate the nouns in the most_helpful_text into the NER tags
%%time
import spacy

from collections import defaultdict

ner = spacy.load("en")

ner_helpful = ner(most_helpful_text)

ner_dict = defaultdict(list)
for entity in ner_helpful.ents:
    ner_dict[entity.label_].append(entity)

for NER, name in ner_dict.items():
    print(f"{NER}:\n{name}\n")

In [None]:
# to visualize the tags in the review
from spacy import displacy

displacy.render(ner_helpful, style="ent", jupyter=True)

# Dependency Tree

In [None]:
# deciphering by breaking down and influending each tokens
# dependency trees of the first three sentences of the most_helpful_text.
def ner_displacy(sentence):
    ner_sentence = ner(sentence)
    displacy.render(ner_sentence, jupyter=True, \
                    options={"compact": False, \
                             "distance": 90, \
                             "word_spacing":20, \
                             "arrow_spacing":10, \
                             "arrow_stroke": 2, \
                             "arrow_width": 5})

for sentence in most_helpful_text.split(".")[0:3]:
    ner_displacy(sentence)

# Topic Modeling

In [None]:
# reviews can be classified and grouped according to the type of electronics product they correspond to.
# product reviews will assigned weight to the topic
# topics will have weights on token.
# top five words that are salient to the first group of product reviews.

%%time
import multiprocessing

from gensim.models.ldamulticore import LdaMulticore

cores = multiprocessing.cpu_count()

num_topics = 10
bow_lda = LdaMulticore(bow, num_topics=num_topics, id2word=vocabulary, \
                       passes=5, workers=cores, random_state=42)

for token, frequency in bow_lda.show_topic(0, topn=5):
    print(token, frequency)

In [None]:
# Summarizing the data 
# calling upon each word group
for topic in range(0, num_topics):
    print(f"\nTopic {topic+1}:")
    for token, frequency in bow_lda.show_topic(topic, topn=5):
        print(f" {token}, {frequency}")

In [None]:
#  interactively explore the words associated with the topics derived by LDA
import pyLDAvis.gensim

lda_idm = pyLDAvis.gensim.prepare(bow_lda, bow, vocabulary)

pyLDAvis.display(lda_idm)