# Capstone Project: Amazon Review Classification (Part 1)
Author: **Steven Lee**

# Categorizing Amazon Reviews

User reviews on products and services can often provide potentially valuable feedback to sellers and service providers on various business related areas.  At the very least, for instance, the reviews could signal potential problems with the manufacture of goods, a dip in the quality of services, or some issue with deliveries.  Additionally, they could also provide business owners with useful ideas on how to improve products and services.  Above that, they could even sometimes help generate ideas of new products or services that are in demand.

The goal is to build a classification model to categorize reviews into meaningful multi-classes, and help inform on the multiple product aspects that customers find below par, meet expectations or lacking in certain regards.  This new model would have an Accuracy score above 85%.  Models included for comparison will include, Naive Bayes, Random Forest and Neural Networks.

Sentiment analysis merely attempts to see if a review is positive or negative.  While this is helpful, it only tells business owners the proportion of buyers who were happy or unsatisfied with their purchases.  This model will help the business owner gain more meaningful insights about their products.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Data" data-toc-modified-id="Import-Data-1">Import Data</a></span></li><li><span><a href="#Inspect-and-Clean-Data" data-toc-modified-id="Inspect-and-Clean-Data-2">Inspect and Clean Data</a></span></li><li><span><a href="#Prepare-Bag-of-Words" data-toc-modified-id="Prepare-Bag-of-Words-3">Prepare Bag of Words</a></span></li><li><span><a href="#Save-Clean-Data-to-File" data-toc-modified-id="Save-Clean-Data-to-File-4">Save Clean Data to File</a></span></li></ul></div>

## Import Data

In [1]:
# Import required libraries.
import numpy as np
import pandas as pd
import gzip
import json

# Set pandas display options.
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 200)

from random import sample

# Import Tokenizer, Lemmatizer and stop words.
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

The following datasets are [updated versions](https://nijianmo.github.io/amazon/index.html) of the 2014 released Amazon review dataset.  For this project, the scope will be limited only to reviews for products under the **Tools and Home Improvement** main category.  I will also be using the smaller subset of the review data (roughly 2 mil.), which is extracted from the main data of greater than 9 mil. reviews.  The product meta data is included here to see if it can be merged with the final named entities to provide enhanced insights.

In [2]:
# Read in review and product datasets.
review_data = "../data/Tools_and_Home_Improvement_5.json.gz"
product_data = "../data/meta_Tools_and_Home_Improvement.json.gz"

In [3]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [4]:
%%time

reviews = getDF(review_data)
reviews.head(3)

Wall time: 27.4 s


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5.0,True,"01 28, 2018",AL19QO4XLBQPU,982085028,{'Style:': ' 1) IR30 POU (30A/3.4kW/110v)'},J. Mollenkamp,"returned, decided against this product",Five Stars,1517097600,,
1,5.0,True,"11 30, 2017",A1I7CVB7X3T81E,982085028,{'Style:': ' 3) IR260 POU (30A/6kW/220v)'},warfam,Awesome heater for the electrical requirements! Makes an awesome preheater for my talnkless system,Five Stars,1512000000,,
2,5.0,True,"09 12, 2017",A1AQXO4P5U674E,982085028,{'Style:': ' Style64'},gbieber2,Keeps the mist of your wood trim and on you. Bendable too.,Five Stars,1505174400,,


In [5]:
%%time

products = getDF(product_data)
products.head(3)

Wall time: 1min 59s


Unnamed: 0,category,tech1,description,fit,title,also_buy,image,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,details
0,"[Tools & Home Improvement, Lighting & Ceiling Fans, Lamps & Shades, Table Lamps]",,[collectible table lamp],,Everett's Cottage Table Lamp,[],[],,,[],"[>#3,780,135 in Tools & Home Improvement (See top 100), >#45,028 in Tools & Home Improvement > Lighting & Ceiling Fans > Lamps & Shades > Table Lamps]",[],Tools & Home Improvement,,"October 30, 2010",,001212835X,
1,"[Tools & Home Improvement, Lighting & Ceiling Fans, Novelty Lighting]","class=""a-keyvalue prodDetTable"" role=""presentation"">\n \n \n \n \n <tr>\n \n \n \n \n ...",[Fun book light! Comes with two AAA batteries and a long-lasting LED lightbulb. Makes a great gift!],,Diary of a Wimpy Kid Book Light,[],"[https://images-na.ssl-images-amazon.com/images/I/51rylQjhYLL._SX38_SY50_CR,0,0,38,50_.jpg, https://images-na.ssl-images-amazon.com/images/I/51Svo2N%2B9DL._SX38_SY50_CR,0,0,38,50_.jpg, https://ima...",,Barnes & Noble,"[Easily clips to hardcover and paperback books, Comes with two AAA batteries, Long-lasting LED lightbulb]","[>#1,074,903 in Tools & Home Improvement (See top 100), >#37,631 in Tools & Home Improvement > Lighting & Ceiling Fans > Novelty Lighting]",[],Tools & Home Improvement,,"March 9, 2013",,0594510384,
2,"[Tools & Home Improvement, Paint, Wall Treatments & Supplies, Wall Stickers & Murals]",,"[A fun addition to any decor, The Beatles Yellow Submarine Wall Decals feature art from the 1968 animated classic starring the Fab Four in one of their most memorable adventures. Each package incl...",,Mudpuppy The Beatles Yellow Submarine Wall Decals,"[1481403621, B00EMLN7PS, B077NNCBTP, B01GWKH0FO, B01FUFGEHM, 1536201464, 1785863940, B003SJK6I6, B005S9JJOG, B07BTMLNFN, B01JH1GTUW, 1786707039, B0185QMG7K, B00KCNBED2, B07DV95WN6, 0735342431, B00...",[https://images-na.ssl-images-amazon.com/images/I/51IBDcZ5tJL._SS40_.jpg],,Mudpuppy,"[6 sheets of decals, Package: 11.25 x 11.25 in, Shrink wrapped]","[>#105,697 in Toys & Games (See Top 100 in Toys & Games), >#1,297 in Home & Kitchen > Home Dcor > Home Dcor Accents > Wall Stickers & Murals]","[B003VYAHRI, B00EMLN7PS, B01GWKH0FO, 0735344523, B00EMLB85Y, B00OZ95GDI, B004B49A6Q, B079KCKKVP, B003SJK6I6, B00DSD5D0S, B01FUFGEHM, B077NNCBTP, B07BTMLNFN, 0735342431, B00DSD5DUI, B0799GWBQ7, B07...",Toys & Games,"class=""a-bordered a-horizontal-stripes a-spacing-extra-large a-size-base comparison_table"">\n\n\n\n \n \n \n \n \n <tr class=""co...",,$20.52,0735342989,


In [6]:
# Check number of records and columns.
reviews.shape, products.shape

((2070831, 12), (571535, 18))

In [7]:
# Check for data types and nulls.
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2070831 entries, 0 to 2070830
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   overall         float64
 1   verified        bool   
 2   reviewTime      object 
 3   reviewerID      object 
 4   asin            object 
 5   style           object 
 6   reviewerName    object 
 7   reviewText      object 
 8   summary         object 
 9   unixReviewTime  int64  
 10  vote            object 
 11  image           object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 191.6+ MB


In [8]:
# Check for data types and nulls.
products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 571535 entries, 0 to 571534
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   category      571535 non-null  object
 1   tech1         571535 non-null  object
 2   description   571535 non-null  object
 3   fit           571535 non-null  object
 4   title         571535 non-null  object
 5   also_buy      571535 non-null  object
 6   image         571535 non-null  object
 7   tech2         571535 non-null  object
 8   brand         571535 non-null  object
 9   feature       571535 non-null  object
 10  rank          571535 non-null  object
 11  also_view     571535 non-null  object
 12  main_cat      571535 non-null  object
 13  similar_item  571535 non-null  object
 14  date          571535 non-null  object
 15  price         571535 non-null  object
 16  asin          571535 non-null  object
 17  details       571444 non-null  object
dtypes: object(18)
memory usa

## Inspect and Clean Data

|Required|Feature|Type|Description|
|:--|:--|:--|:--|
|o|`overall`|float|Rating of the product|
|x|`verified`|object|If review has been verified|
|x|`reviewTime`|object|Time of the review (raw)|
|x|`reviewerID`|object|ID of the reviewer e.g. A2SUAM1J3GNN3B|
|o|`asin`|object|ID of the product e.g. 0000013714|
|x|`style`|object|Dictionary of the product metadata e.g. "Format" is "Hardcover"|
|x|`reviewerName`|object|Name of the reviewer|
|o|`reviewText`|object|Text of the review|
|o|`summary`|float|Summary of the review|
|x|`unixReviewTime`|integer|Time of the review (unix time)|
|o|`vote`|object|Helpful votes of the review|
|x|`image`|object|Images that users post after they have received the product|

|Required|Feature|Type|Description|
|:--|:--|:--|:--|
|x|`category`|object|List of categories the product belongs to|
|x|`tech1`|object|First technical detail table of the product|
|o|`description`|object|Description of the product|
|x|`fit`|object||
|o|`title`|object|Name of the product|
|x|`also_buy`|object|Related products (also bought, also viewed, bought together, buy after viewing)|
|x|`image`|object|url of the product image|
|x|`tech2`|object|Second technical detail table of the product|
|o|`brand`|object|Brand name|
|o|`feature`|object|Bullet-point format features of the product|
|o|`rank`|object|Sales rank information|
|x|`also_view`|object|images that users post after they have received the product|
|o|`main_cat`|object|Main category of the product|
|x|`similar_item`|object|similar product table|
|x|`date`|object|Date|
|o|`price`|object|Price in US dollars (at time of crawl)|
|o|`asin`|object|ID of the product, e.g. 0000031852|
|x|`details`|object|images that users post after they have received the product|

In [9]:
# Identify unwanted columns in both dataframes for dropping.
unwanted_rev_cols = ['verified', 'reviewTime', 'reviewerID', 'style', 'reviewerName', 'unixReviewTime', 'image']
unwanted_pdt_cols = ['category', 'tech1', 'fit', 'also_buy', 'image', 'tech2', 'also_view', 'similar_item', 'date', 'details']

In [10]:
reviews.shape

(2070831, 12)

In [11]:
# Examine the various data columns to better understand the data.
randomlist = sample(range(reviews.shape[0]), 10)
for i in randomlist:
    print(reviews.loc[i, ['reviewText']], "\n")

reviewText    Works flawlessly and provides a nice light for the closet.  I love the USB recharging.  One star downgrade because I don't think the batteries last long enough on a charge.  We open that closet ma...
Name: 1641303, dtype: object 

reviewText    A thousand times better than the ones my work provide. The adjustable strap is thin but stays well, and it is well worth buying these for the comfort vs other brands of safety glasses. I can wear ...
Name: 145296, dtype: object 

reviewText    Excellent
Name: 343418, dtype: object 

reviewText    Worked great!
Name: 204868, dtype: object 

reviewText    nice for MANY applications, very well made, but that is to be expected with Klein at the helm.
Name: 1418134, dtype: object 

reviewText    finally my led won't flicker at lower output
Name: 1103105, dtype: object 

reviewText    Like the idea but some reviews kept me from buying this save your money i used the plastic box that bob evens mashed potatoes came in this winter to cover 

In [12]:
# Check for duplicate reviews e.g. same reviewer, same product, same review and same summary.
reviews[reviews.duplicated(subset=['reviewerID', 'reviewerName', 'asin', 'reviewText', 'summary'])].count()

overall           100252
verified          100252
reviewTime        100252
reviewerID        100252
asin              100252
style              57427
reviewerName      100243
reviewText        100238
summary           100244
unixReviewTime    100252
vote               14568
image               1293
dtype: int64

In [13]:
# Drop duplicate reviews e.g. same reviewer, same product, same review and same summary.
reviews.drop_duplicates(subset=['reviewerID', 'reviewerName', 'asin', 'reviewText', 'summary'], inplace=True)

In [14]:
# Check count of verified reviews.
reviews['verified'].value_counts()

True     1809779
False     160800
Name: verified, dtype: int64

In [15]:
# Check for null reviewText and summary values.
reviews['reviewText'].isnull().sum(), reviews['summary'].isnull().sum()

(508, 271)

In [16]:
# Check for records with both null summary and reviewText.  These records will be dropped.
reviews[reviews['summary'].isnull() & reviews['reviewText'].isnull()].count()

overall           30
verified          30
reviewTime        30
reviewerID        30
asin              30
style             18
reviewerName      30
reviewText         0
summary            0
unixReviewTime    30
vote               0
image              0
dtype: int64

In [17]:
# Drop rows with both null summary and reviewText.
reviews.drop(reviews[reviews['summary'].isnull() & reviews['reviewText'].isnull()].index, inplace=True)

In [18]:
# Check for records with null reviewText but summary has data.
reviews[reviews['reviewText'].isnull() & ~reviews['summary'].isnull()].count()

overall           478
verified          478
reviewTime        478
reviewerID        478
asin              478
style             252
reviewerName      478
reviewText          0
summary           478
unixReviewTime    478
vote               54
image             123
dtype: int64

In [19]:
# Check for records with null reviewText and where summary is only rating information.
reviews[reviews['reviewText'].isnull() & ~reviews['summary'].isnull() & ((reviews['summary'].str.lower() == "five stars") 
        | (reviews['summary'].str.lower() == "four stars") | (reviews['summary'].str.lower() == "three stars") 
        | (reviews['summary'].str.lower() == "two stars") | (reviews['summary'].str.lower() == "one star"))].count()

overall           426
verified          426
reviewTime        426
reviewerID        426
asin              426
style             221
reviewerName      426
reviewText          0
summary           426
unixReviewTime    426
vote               34
image              82
dtype: int64

In [20]:
# Drop rows with null reviewText and where summary is only rating information.
reviews.drop(reviews[reviews['reviewText'].isnull() & ~reviews['summary'].isnull() 
                     & ((reviews['summary'].str.lower() == "five stars") | (reviews['summary'].str.lower() == "four stars")
                     | (reviews['summary'].str.lower() == "three stars") | (reviews['summary'].str.lower() == "two stars") 
                     | (reviews['summary'].str.lower() == "one star"))].index, inplace=True)

In [21]:
# Assign summary values to null reviewText.
reviews.loc[reviews['reviewText'].isnull(), 'reviewText'] = reviews['summary']

# Combine summary and reviewText into reviewText.
# reviews.loc[~reviews['summary'].isnull(), 'reviewText'] = reviews['summary'] + ". " + reviews['reviewText']

# Assign remaining null summary values with fullstops.
reviews['summary'] = reviews['summary'].fillna(".")

# Remove newline characters and backslash before apostrophes.
reviews['reviewText'].replace("\n", " ", regex=True, inplace=True)
reviews['reviewText'].replace("\'", "'", regex=True, inplace=True)

# Remove urls.
# reviews[reviews['reviewText'].str.contains("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", 
#                                            regex=True, case=False)]
reviews['reviewText'].replace("http\S+|www.\S+", "", regex=True, inplace=True)
reviews['reviewText'].replace("[A-Za-z]+\.com", "", regex=True, inplace=True)

# Create a length column to store the length of the reviewText.
reviews['length'] = reviews['reviewText'].str.split().apply(len)

# Check for count of records where length of reviewText is between 5 and 128.
reviews[(reviews['length'] > 4) & (reviews['length'] < 129)].count()

overall           1482487
verified          1482487
reviewTime        1482487
reviewerID        1482487
asin              1482487
style              771240
reviewerName      1482415
reviewText        1482487
summary           1482487
unixReviewTime    1482487
vote               185085
image               25261
length            1482487
dtype: int64

In [22]:
# Keep records where length of reviewText is between 5 and 128, and drop the rest.
reviews.drop(reviews[(reviews['length'] < 5) | (reviews['length'] > 128)].index, inplace=True)

In [23]:
# Replace null values in vote column with zeroes.
reviews['vote'] = reviews['vote'].fillna(0)

# Replace commas used as thousands separator before converting type to integer.
reviews['vote'].replace(",", "", regex=True, inplace=True)
reviews['vote'].astype(int)

0          0
1          0
2          0
3          0
5          5
          ..
2070822    0
2070827    0
2070828    0
2070829    0
2070830    0
Name: vote, Length: 1482487, dtype: int32

In [24]:
# Check for duplicate products e.g. same asin, same main_cat, same brand and same title.
products[products.duplicated(subset=['asin', 'main_cat', 'brand', 'title'])].count()

category        12195
tech1           12195
description     12195
fit             12195
title           12195
also_buy        12195
image           12195
tech2           12195
brand           12195
feature         12195
rank            12195
also_view       12195
main_cat        12195
similar_item    12195
date            12195
price           12195
asin            12195
details         12195
dtype: int64

In [25]:
# Drop duplicate products e.g. same asin, same main_cat, same brand and same title.
products.drop_duplicates(subset=['asin', 'main_cat', 'brand', 'title'], inplace=True)

In [26]:
# Drop unwanted columns in both tables.
reviews.drop(unwanted_rev_cols, axis=1, inplace=True)
products.drop(unwanted_pdt_cols, axis=1, inplace=True)

In [27]:
# Merge both tables with inner join on asin or product Id.
merged = pd.merge(left=reviews, right=products, on='asin')
merged.shape

(1479593, 13)

In [28]:
merged['main_cat'].unique()

array(['Tools & Home Improvement', 'Office Products', 'Toys & Games',
       'Industrial & Scientific', 'Automotive', 'Sports & Outdoors',
       'Amazon Home',
       '<img src="https://images-na.ssl-images-amazon.com/images/G/01/nav2/images/gui/amazon-fashion-store-new._CB520838675_.png" class="nav-categ-image" alt="AMAZON FASHION"/>',
       'All Electronics', 'Camera & Photo', 'Home Audio & Theater',
       'Baby', 'Cell Phones & Accessories', 'Arts, Crafts & Sewing',
       'Pet Supplies', 'Musical Instruments', 'All Beauty', 'Grocery',
       'Car Electronics', 'Health & Personal Care', 'Computers', '',
       'Video Games', 'Amazon Devices',
       '<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>',
       'Appliances', 'Books', 'GPS & Navigation',
       '<img src="https://images-na.ssl-images-amazon.com/images/G/01/handmade/brand/logos/2018/subnav_logo._CB50

In [29]:
# Cleanup main_cat values.
merged.loc[merged['main_cat'] == '<img src="https://images-na.ssl-images-amazon.com/images/G/01/nav2/images/gui/amazon-fashion-store-new._CB520838675_.png" class="nav-categ-image" alt="AMAZON FASHION"/>', 
           'main_cat'] = "Amazon Fashion"
merged.loc[merged['main_cat'] == '<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>', 
           'main_cat'] = "Digital Music"
merged.loc[merged['main_cat'] == '<img src="https://images-na.ssl-images-amazon.com/images/G/01/handmade/brand/logos/2018/subnav_logo._CB502360610_.png" class="nav-categ-image" alt="Handmade"/>', 
           'main_cat'] = "Handmade"

In [30]:
# Check for records where main_cat is empty string.
merged[merged['main_cat'] == ""].count()

overall        1294
asin           1294
reviewText     1294
summary        1294
vote           1294
length         1294
description    1294
title          1294
brand          1294
feature        1294
rank           1294
main_cat       1294
price          1294
dtype: int64

In [31]:
# Drop records where main_cat is empty string.
merged.drop(merged[merged['main_cat'] == ""].index, inplace=True)

In [32]:
# Cleanup empty brand values.
merged.loc[merged['brand'] == "", 'brand'] = "None"
merged.shape

(1478299, 13)

## Prepare Bag of Words

In [33]:
# Convert list of English stop words to set.
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
stopwords_set = set(stop_words)

def create_doc(text):
    """ 
    Creates a document of lowercase words from input text.  Input text is first tokenized by text_to_word_sequence (Keras), 
    lemmatized (WordNetLemmatizer), and then removed of stop words.
    
    Parameters
    ----------
    text     : str
        Raw review text from Amazon reviews.
    
    Returns
    -------
    -        : str
        Document of lowercase words.
        
    """
    # Tokenize with function from Keras.
    tokens = text_to_word_sequence(text)
    
    # Lemmatize all tokens to base form.
    base_tokens = [WordNetLemmatizer().lemmatize(word) for word in tokens if len(word) > 3]
    
    # Remove stop words.
    doc_words = [word for word in base_tokens if not word in stopwords_set]
    
    return (" ".join(doc_words))

In [34]:
# Create new column in dataframe to hold documents.
merged['document'] = [create_doc(review) for review in merged['reviewText']]

In [35]:
# Compare reviewText and created document.
randomlist = sample(range(merged.shape[0]), 10)
for i in randomlist:
    print(merged.loc[i, ['reviewText']])
    print(merged.loc[i, ['document', 'length']], "\n")

reviewText    This air pressure regulator has intermittent problems with leaking air as if you were trying to lower the pressure when in fact I haven't touched it.  The only way to make it stop leaking is to tu...
Name: 717233, dtype: object
document    pressure regulator intermittent problem leaking trying lower pressure fact touched make stop leaking turn pressure zero turn back
length                                                                                                                                     47
Name: 717233, dtype: object 

reviewText    Gave this to my athletic trainer grandson, and now with football season, it's a daily use item.  You do have to remove the blade assembly to clean the cutting edge and remove tape residue.
Name: 75122, dtype: object
document    gave athletic trainer grandson football season daily item remove blade assembly clean cutting edge remove tape residue
length                                                                             

## Save Clean Data to File

In [36]:
# Save clean review and product data to file.
merged.to_csv("../data/reviews_clean.csv", index=False)