# Capstone Project: Amazon Review Classification (Part 1)
Author: **Steven Lee**

# Categorizing Amazon Reviews

User reviews on products and services can often provide potentially valuable feedback to sellers and service providers on various business related areas.  At the very least, for instance, the reviews could signal potential problems with the manufacture of goods, a dip in the quality of services, or some issue with deliveries.  Additionally, they could also provide business owners with useful ideas on how to improve products and services.  Above that, they could even sometimes help generate ideas of new products or services that are in demand.

The goal is to build a classification model to categorize reviews into meaningful multi-classes, and help inform on the multiple product aspects that customers find below par, meet expectations or lacking in certain regards.  This new model would have an Accuracy score above 85%.  Models included for comparison will include, Naive Bayes, Random Forest and Neural Networks.

Sentiment analysis merely attempts to see if a review is positive or negative.  While this is helpful, it only tells business owners the proportion of buyers who were happy or unsatisfied with their purchases.  This model will help the business owner gain more meaningful insights about their products.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Data" data-toc-modified-id="Import-Data-1">Import Data</a></span></li><li><span><a href="#Inspect-and-Clean-Data" data-toc-modified-id="Inspect-and-Clean-Data-2">Inspect and Clean Data</a></span></li><li><span><a href="#Prepare-Bag-of-Words" data-toc-modified-id="Prepare-Bag-of-Words-3">Prepare Bag of Words</a></span></li><li><span><a href="#Save-Clean-Data-to-File" data-toc-modified-id="Save-Clean-Data-to-File-4">Save Clean Data to File</a></span></li></ul></div>

## Import Data

In [1]:
# Import required libraries.
import numpy as np
import pandas as pd
import gzip
import json

from random import sample

# Import Tokenizer, Lemmatizer and stop words.
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

The following datasets are [updated versions](https://nijianmo.github.io/amazon/index.html) of the 2014 released Amazon review dataset.  For this project, the scope will be limited only to reviews for products under the **Tools and Home Improvement** main category.  I will also be using the smaller subset of the review data (roughly 2 mil.), which is extracted from the main data of greater than 9 mil. reviews.  The product meta data is included here to see if it can be merged with the final named entities to provide enhanced insights.

In [2]:
# Read in review and product datasets.
review_data = "../data/Tools_and_Home_Improvement_5.json.gz"
product_data = "../data/meta_Tools_and_Home_Improvement.json.gz"

In [3]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [4]:
%%time

reviews = getDF(review_data)
reviews.head(3)

Wall time: 26.8 s


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5.0,True,"01 28, 2018",AL19QO4XLBQPU,982085028,{'Style:': ' 1) IR30 POU (30A/3.4kW/110v)'},J. Mollenkamp,"returned, decided against this product",Five Stars,1517097600,,
1,5.0,True,"11 30, 2017",A1I7CVB7X3T81E,982085028,{'Style:': ' 3) IR260 POU (30A/6kW/220v)'},warfam,Awesome heater for the electrical requirements! Makes an awesome preheater for my talnkless system,Five Stars,1512000000,,
2,5.0,True,"09 12, 2017",A1AQXO4P5U674E,982085028,{'Style:': ' Style64'},gbieber2,Keeps the mist of your wood trim and on you. Bendable too.,Five Stars,1505174400,,


In [5]:
%%time

products = getDF(product_data)
products.head(3)

Wall time: 1min 51s


Unnamed: 0,category,tech1,description,fit,title,also_buy,image,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,details
0,"[Tools & Home Improvement, Lighting & Ceiling Fans, Lamps & Shades, Table Lamps]",,[collectible table lamp],,Everett's Cottage Table Lamp,[],[],,,[],"[>#3,780,135 in Tools & Home Improvement (See top 100), >#45,028 in Tools & Home Improvement > Lighting & Ceiling Fans > Lamps & Shades > Table Lamps]",[],Tools & Home Improvement,,"October 30, 2010",,001212835X,
1,"[Tools & Home Improvement, Lighting & Ceiling Fans, Novelty Lighting]","class=""a-keyvalue prodDetTable"" role=""presentation"">\n \n \n \n \n <tr>\n \n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n \tBrand\t\n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n Barnes & Noble\n </td>\n \n </tr>\n \n \n \n <tr>\n \n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n Item Weight\n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n 1.6 ounces\n </td>\n \n </tr>\n \n \n \n <tr>\n \n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n Package Dimensions\n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n 7.5 x 5.5 x 0.6 inches\n </td>\n \n </tr>\n \n \n \n <tr>\n \n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n Batteries\n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n 2 AA batteries required.\n </td>\n \n </tr>\n \n \n \n <tr>\n \n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n Power Source\t\n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n battery-powered\n </td>\n \n </tr>\n \n \n \n <tr>\n \n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n Batteries Included?\n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n No\n </td>\n \n </tr>\n \n \n \n <tr>\n \n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n Batteries Required?\n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n No\n </td>\n \n </tr>\n \n \n \n <tr>\n \n \n \n \n <th class=""a-color-secondary a-size-base prodDetSectionEntry"">\n \n \n \n <span class=""a-declarative"" data-action=""a-popover"" data-a-popover=""{&quot;cache&quot;:&quot;true&quot;,&quot;closeButton&quot;:&quot;true&quot;,&quot;name&quot;:&quot;\tType of Bulb\t&quot;,&quot;width&quot;:&quot;280&quot;,&quot;header&quot;:&quot;\tType of Bulb\t&quot;,&quot;position&quot;:&quot;triggerRight&quot;,&quot;scrollable&quot;:&quot;false&quot;,&quot;url&quot;:&quot;/gp/jewelry/technical-specs-help/?ie=UTF8&amp;hideLogo=1&amp;page_ident=1000516121&quot;}"">\n <a class=""a-link-normal"" target=""_blank"" rel=""noopener"" href=""/gp/jewelry/technical-specs-help/?ie=UTF8&hideLogo=1&page_ident=1000516121"">\n \tType of Bulb\t\n </a>\n </span>\n \n \n \n </th>\n \n \n \n \n \n <td class=""a-size-base"">\n Battery powered\n </td>\n \n </tr>\n \n \n \n",[Fun book light! Comes with two AAA batteries and a long-lasting LED lightbulb. Makes a great gift!],,Diary of a Wimpy Kid Book Light,[],"[https://images-na.ssl-images-amazon.com/images/I/51rylQjhYLL._SX38_SY50_CR,0,0,38,50_.jpg, https://images-na.ssl-images-amazon.com/images/I/51Svo2N%2B9DL._SX38_SY50_CR,0,0,38,50_.jpg, https://images-na.ssl-images-amazon.com/images/I/51VFEE2jlrL._SX38_SY50_CR,0,0,38,50_.jpg, https://images-na.ssl-images-amazon.com/images/I/51%2Bn%2BXsm1uL._SX38_SY50_CR,0,0,38,50_.jpg]",,Barnes & Noble,"[Easily clips to hardcover and paperback books, Comes with two AAA batteries, Long-lasting LED lightbulb]","[>#1,074,903 in Tools & Home Improvement (See top 100), >#37,631 in Tools & Home Improvement > Lighting & Ceiling Fans > Novelty Lighting]",[],Tools & Home Improvement,,"March 9, 2013",,0594510384,
2,"[Tools & Home Improvement, Paint, Wall Treatments & Supplies, Wall Stickers & Murals]",,"[A fun addition to any decor, The Beatles Yellow Submarine Wall Decals feature art from the 1968 animated classic starring the Fab Four in one of their most memorable adventures. Each package includes six sheets of colorful and stylish wall stickers. Great for Beatles fans of all ages! - Package: 11.25 x 11.25 in. - 6 sheets of decals: 11 x 11 in. - Shrink-wrapped - Made in the U.S.A.]",,Mudpuppy The Beatles Yellow Submarine Wall Decals,"[1481403621, B00EMLN7PS, B077NNCBTP, B01GWKH0FO, B01FUFGEHM, 1536201464, 1785863940, B003SJK6I6, B005S9JJOG, B07BTMLNFN, B01JH1GTUW, 1786707039, B0185QMG7K, B00KCNBED2, B07DV95WN6, 0735342431, B007MJS48C, 0763658545, B011N7IADC, B003VYAHRI]",[https://images-na.ssl-images-amazon.com/images/I/51IBDcZ5tJL._SS40_.jpg],,Mudpuppy,"[6 sheets of decals, Package: 11.25 x 11.25 in, Shrink wrapped]","[>#105,697 in Toys & Games (See Top 100 in Toys & Games), >#1,297 in Home & Kitchen > Home Dcor > Home Dcor Accents > Wall Stickers & Murals]","[B003VYAHRI, B00EMLN7PS, B01GWKH0FO, 0735344523, B00EMLB85Y, B00OZ95GDI, B004B49A6Q, B079KCKKVP, B003SJK6I6, B00DSD5D0S, B01FUFGEHM, B077NNCBTP, B07BTMLNFN, 0735342431, B00DSD5DUI, B0799GWBQ7, B075VBFLVK, 0735342415, B011N7IADC, B0711DTW35, B0185QMG7K, B01167FLK4, B001QXT2RW, B07BK5QZX1, B07B3LTKKQ, B07CWPPVD7, B01JH1GTUW, B00O5XDPXW, B073T669Y1, B071GDS37T, B011N7H4MU, B006OAFPJQ, B00AUKM4NG, B07CBYZC47, B071YCM88J, 1536201464, B0092HNX8I, B0181YGEBK, B007MJS48W, B00DSD5CPO, B073LX68H9, 1785863940, B00DSD5EAM, B003EUEOPK, B0731NJM5H, B06ZYW74RM, B0741DBMYM, B004UQO380, B00B2VXGNE]",Toys & Games,"class=""a-bordered a-horizontal-stripes a-spacing-extra-large a-size-base comparison_table"">\n\n\n\n \n \n \n \n \n <tr class=""comparison_table_image_row"">\n <td class=""comparison_table_first_col""></td>\n\n\n <th class=""comparison_image_title_cell"" role=""columnheader"">\n <div class=""a-row a-spacing-top-micro"">\n <center>\n <img alt=""Mudpuppy The Beatles Yellow Submarine Wall Decals"" src=""https://images-na.ssl-images-amazon.com/images/I/61St19N7ztL._SL500_AC_SS350_.jpg"" id=""comparison_image"">\n </center>\n </div>\n <div class=""a-row a-spacing-top-small"">\n <div id=""comparison_title"" class=""a-section a-spacing-none"">\n <span aria-hidden=""true"" class=""a-size-base a-color-base a-text-bold"">\n This item\n </span>\n <span aria-hidden=""true"" class=""a-size-base a-color-base"">Mudpuppy The Beatles Yellow Submarine Wall Decals</span>\n </div>\n \n \n </div>\n </th>\n\n\n \n <th class=""comparison_image_title_cell comparable_item0"" role=""columnheader"">\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/dp/B00EMLB85Y/ref=psdc_2445485011_t1_0735342989"">\n <div class=""a-row a-spacing-top-micro"">\n <center>\n <img alt="""" src=""https://images-na.ssl-images-amazon.com/images/I/31k%2BvDb3azL._SL500_AC_SS350_.jpg"" aria-hidden=""true"" id=""comparison_image0"">\n </center>\n </div>\n <div id=""comparison_title0"" class=""a-row a-spacing-top-small"">\n <span class=""a-size-base"">Beatles Yellow Submarine Wall Graphic Decal Sticker 25"" x 15""</span>\n </div>\n </a>\n \n \n </th>\n \n <th class=""comparison_image_title_cell comparable_item1"" role=""columnheader"">\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/dp/B004B49A6Q/ref=psdc_2445485011_t2_0735342989"">\n <div class=""a-row a-spacing-top-micro"">\n <center>\n <img alt="""" src=""https://images-na.ssl-images-amazon.com/images/I/51TVp4e%2BWnL._SL500_AC_SS350_.jpg"" aria-hidden=""true"" id=""comparison_image1"">\n </center>\n </div>\n <div id=""comparison_title1"" class=""a-row a-spacing-top-small"">\n <span class=""a-size-base"">Beatles - Vinyl Wall Art Words Lettering Decal Decor</span>\n </div>\n </a>\n \n \n </th>\n \n <th class=""comparison_image_title_cell comparable_item2"" role=""columnheader"">\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/dp/B00OZ95GDI/ref=psdc_2445485011_t3_0735342989"">\n <div class=""a-row a-spacing-top-micro"">\n <center>\n <img alt="""" src=""https://images-na.ssl-images-amazon.com/images/I/518QPoSEBHL._SL500_AC_SS350_.jpg"" aria-hidden=""true"" id=""comparison_image2"">\n </center>\n </div>\n <div id=""comparison_title2"" class=""a-row a-spacing-top-small"">\n <span class=""a-size-base"">Yellow Submarine Wall Decals, 30"" W by 17"" H, Ocean Wall Decals, Sea Life Decals, Underwater Nursery, The Beatles, Submarine Wall Decals, Kids Decals PLUS FREE WHITE HELLO DOOR DECAL WITH PURCHASE</span>\n </div>\n </a>\n \n \n </th>\n \n <th class=""comparison_image_title_cell comparable_item3"" role=""columnheader"">\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/dp/B00O5XDPXW/ref=psdc_2445485011_t4_0735342989"">\n <div class=""a-row a-spacing-top-micro"">\n <center>\n <img alt="""" src=""https://images-na.ssl-images-amazon.com/images/I/31pkrYar7GL._SL500_AC_SS350_.jpg"" aria-hidden=""true"" id=""comparison_image3"">\n </center>\n </div>\n <div id=""comparison_title3"" class=""a-row a-spacing-top-small"">\n <span class=""a-size-base"">Abbey Road The Beatles Quote Decal Wall Vinyl Art Sticker Music</span>\n </div>\n </a>\n \n \n </th>\n \n <th class=""comparison_image_title_cell comparable_item4"" role=""columnheader"">\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/dp/B004UQO380/ref=psdc_2445485011_t5_0735342989"">\n <div class=""a-row a-spacing-top-micro"">\n <center>\n <img alt="""" src=""https://images-na.ssl-images-amazon.com/images/I/51K3PBw54AL._SL500_AC_SS350_.jpg"" aria-hidden=""true"" id=""comparison_image4"">\n </center>\n </div>\n <div id=""comparison_title4"" class=""a-row a-spacing-top-small"">\n <span class=""a-size-base"">Beatles Promo - Vinyl Wall Art Decal Stickers Decor Graphics</span>\n </div>\n </a>\n \n \n </th>\n \n </tr>\n\n\n <tr></tr>\n\n\n \n \n <tr>\n <td class=""comparison_table_first_col""></td>\n\n\n <td class=""comparison_add_to_cart_button"">\n \n \n \n \n <span id=""comparison_add_to_cart_button"" class=""a-button a-spacing-small a-button-primary""><span class=""a-button-inner""><a id=""comparison_add_to_cart_button-announce"" href=""/gp/item-dispatch/ref=psdc_2445485011_a0_0735342989?ie=UTF8&amp;itemCount=1&amp;nodeID=165793011&amp;offeringID.1=VYNk5f3ZVSHHbuTvy4xjf6xRB7phfhwUtv7TjzwX5muXgzALxGBrLFh3hdmfSmjZDugooJF6OHudobjy7Meqse29fmaF2mnaQ%252BPpzQnYdRh4yFegnS9KtA%253D%253D&amp;storeID=toys-and-games&amp;session-id=144-7107030-8486903&amp;submit.addToCart=addToCart&amp;signInToHUC=0"" class=""a-button-text"" role=""button"">Add to Cart</a></span></span>\n \n \n </td>\n\n\n \n <td class=""a-text-left comparison_add_to_cart_button comparable_item0"">\n \n \n \n <span id=""comparison_add_to_cart_button0"" class=""a-button a-spacing-small a-button-primary""><span class=""a-button-inner""><a id=""comparison_add_to_cart_button0-announce"" href=""/gp/item-dispatch/ref=psdc_2445485011_a1_0735342989?ie=UTF8&amp;itemCount=1&amp;nodeID=1055398&amp;offeringID.1=VYNk5f3ZVSEMC0znTc9vNeJL%252BmSojGg5jO4yuE7VevkzakN7Z0D%252F7v8ag8j5ZZD89qmURZTX8Bmu4DilghpDnWSLi3DQVJqs05HjVxEAR%252B7L87GxJehdKF9XceUaKUyjQsV0dLa0ZfDYAfNINyhaNVQwR%252FGvx4n%252B&amp;storeID=home-garden&amp;session-id=144-7107030-8486903&amp;submit.addToCart=addToCart&amp;signInToHUC=0"" class=""a-button-text"" role=""button"">Add to Cart</a></span></span>\n \n \n </td>\n \n <td class=""a-text-left comparison_add_to_cart_button comparable_item1"">\n \n \n \n <span id=""comparison_add_to_cart_button1"" class=""a-button a-spacing-small a-button-primary""><span class=""a-button-inner""><a id=""comparison_add_to_cart_button1-announce"" href=""/gp/item-dispatch/ref=psdc_2445485011_a2_0735342989?ie=UTF8&amp;itemCount=1&amp;nodeID=228013&amp;offeringID.1=VYNk5f3ZVSEMC0znTc9vNT%252FcB%252FWPxqwBSWKVDSI36XxLQK1%252B4hI6qerlYxg44PHqOLvQk3u7Vv0BId842iX036usF7KRU8E%252BYHyDLAnJnS5dW8DAK1uEtzoTrfN%252FRA3jlmfZ4IFUfcLjIDzUy1Mz4g%253D%253D&amp;storeID=hi&amp;session-id=144-7107030-8486903&amp;submit.addToCart=addToCart&amp;signInToHUC=0"" class=""a-button-text"" role=""button"">Add to Cart</a></span></span>\n \n \n </td>\n \n <td class=""a-text-left comparison_add_to_cart_button comparable_item2"">\n \n \n <span id=""comparison_see_details_button2"" class=""a-button a-spacing-small""><span class=""a-button-inner""><a id=""comparison_see_details_button2-announce"" href=""/dp/B00OZ95GDI/ref=psdc_2445485011_t3_0735342989"" class=""a-button-text"" role=""button"">\n See Details\n </a></span></span>\n \n \n \n </td>\n \n <td class=""a-text-left comparison_add_to_cart_button comparable_item3"">\n \n \n \n <span id=""comparison_add_to_cart_button3"" class=""a-button a-spacing-small a-button-primary""><span class=""a-button-inner""><a id=""comparison_add_to_cart_button3-announce"" href=""/gp/item-dispatch/ref=psdc_2445485011_a4_0735342989?ie=UTF8&amp;itemCount=1&amp;nodeID=1055398&amp;offeringID.1=VYNk5f3ZVSEMC0znTc9vNX9l4bU%252B31ZEQeaDxRPHVKRrxuJFEJUm4OSnNxg4Ev2%252FSkjP0GWgpEp2dU5U5We9EOb%252BC4j9YgY7K%252B5pK7AmLM4%252Fmenb215%252FMna2H3C9G9XqF0qqOut6hZ7hiqRRMt5TvOUETSzE9qwt&amp;storeID=home-garden&amp;session-id=144-7107030-8486903&amp;submit.addToCart=addToCart&amp;signInToHUC=0"" class=""a-button-text"" role=""button"">Add to Cart</a></span></span>\n \n \n </td>\n \n <td class=""a-text-left comparison_add_to_cart_button comparable_item4"">\n \n \n \n <span id=""comparison_add_to_cart_button4"" class=""a-button a-spacing-small a-button-primary""><span class=""a-button-inner""><a id=""comparison_add_to_cart_button4-announce"" href=""/gp/item-dispatch/ref=psdc_2445485011_a5_0735342989?ie=UTF8&amp;itemCount=1&amp;nodeID=228013&amp;offeringID.1=VYNk5f3ZVSEMC0znTc9vNYtzIAUhHn3Ai8fK%252BeMaCBpwg7LXDtZXm8T0xGLz2BRECCt0MMdWwTA%252FpIQH45iZo855vYjm9C5KxNnUPUs0j%252BfncuJhoDREWFq5XYaTMr7X8Ey3Gqa8UvkoRioo%252BJostg%253D%253D&amp;storeID=hi&amp;session-id=144-7107030-8486903&amp;submit.addToCart=addToCart&amp;signInToHUC=0"" class=""a-button-text"" role=""button"">Add to Cart</a></span></span>\n \n \n </td>\n \n </tr>\n\n\n <tr id=""comparison_custormer_rating_row"">\n \n \n \n \n\n <th class=""comparison_attribute_name_column comparison_table_first_col"" role=""rowheader"">\n <span class=""a-size-base a-color-base"">Customer Rating</span>\n </th>\n\n\n <td class=""comparison_baseitem_column"">\n \n \n \n <span>\n <span class=""a-declarative"" data-action=""a-popover"" data-a-popover=""{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot; /gp/customer-reviews/widgets/average-customer-review/popover/ref=acr_dpComparsion__popover?contextId=dpComparsion&amp;asin=0735342989 &quot;,&quot;restoreFocusOnHide&quot;:&quot;false&quot;}"">\n <i class=""a-icon a-icon-star a-star-3-5 a-spacing-none""><span class=""a-icon-alt"">3 out of 5 stars</span></i>\n </span>\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/product-reviews/0735342989/ref=psdc_2445485011_r0_0735342989?_encoding=UTF8&amp;showViewpoints=1"">(10)</a>\n <span class=""a-letter-space""></span>\n </span>\n </td>\n\n\n \n <td class=""comparison_sim_items_column comparable_item0"">\n \n \n \n <span>\n <span class=""a-declarative"" data-action=""a-popover"" data-a-popover=""{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot; /gp/customer-reviews/widgets/average-customer-review/popover/ref=acr_dpComparsion__popover?contextId=dpComparsion&amp;asin=B00EMLB85Y &quot;,&quot;restoreFocusOnHide&quot;:&quot;false&quot;}"">\n <i class=""a-icon a-icon-star a-star-4-5 a-spacing-none""><span class=""a-icon-alt"">4 out of 5 stars</span></i>\n </span>\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/product-reviews/B00EMLB85Y/ref=psdc_2445485011_r1_0735342989?_encoding=UTF8&amp;showViewpoints=1"">(3)</a>\n <span class=""a-letter-space""></span>\n </span>\n </td>\n \n <td class=""comparison_sim_items_column comparable_item1"">\n \n \n \n <span>\n <span class=""a-declarative"" data-action=""a-popover"" data-a-popover=""{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot; /gp/customer-reviews/widgets/average-customer-review/popover/ref=acr_dpComparsion__popover?contextId=dpComparsion&amp;asin=B004B49A6Q &quot;,&quot;restoreFocusOnHide&quot;:&quot;false&quot;}"">\n <i class=""a-icon a-icon-star a-star-4 a-spacing-none""><span class=""a-icon-alt"">4 out of 5 stars</span></i>\n </span>\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/product-reviews/B004B49A6Q/ref=psdc_2445485011_r2_0735342989?_encoding=UTF8&amp;showViewpoints=1"">(29)</a>\n <span class=""a-letter-space""></span>\n </span>\n </td>\n \n <td class=""comparison_sim_items_column comparable_item2"">\n \n \n \n <span>\n <span class=""a-declarative"" data-action=""a-popover"" data-a-popover=""{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot; /gp/customer-reviews/widgets/average-customer-review/popover/ref=acr_dpComparsion__popover?contextId=dpComparsion&amp;asin=B00OZ95GDI &quot;,&quot;restoreFocusOnHide&quot;:&quot;false&quot;}"">\n <i class=""a-icon a-icon-star a-star-2 a-spacing-none""><span class=""a-icon-alt"">2 out of 5 stars</span></i>\n </span>\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/product-reviews/B00OZ95GDI/ref=psdc_2445485011_r3_0735342989?_encoding=UTF8&amp;showViewpoints=1"">(1)</a>\n <span class=""a-letter-space""></span>\n </span>\n </td>\n \n <td class=""comparison_sim_items_column comparable_item3"">\n \n \n \n <span>\n <span class=""a-declarative"" data-action=""a-popover"" data-a-popover=""{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot; /gp/customer-reviews/widgets/average-customer-review/popover/ref=acr_dpComparsion__popover?contextId=dpComparsion&amp;asin=B00O5XDPXW &quot;,&quot;restoreFocusOnHide&quot;:&quot;false&quot;}"">\n <i class=""a-icon a-icon-star a-star-5 a-spacing-none""><span class=""a-icon-alt"">5 out of 5 stars</span></i>\n </span>\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/product-reviews/B00O5XDPXW/ref=psdc_2445485011_r4_0735342989?_encoding=UTF8&amp;showViewpoints=1"">(1)</a>\n <span class=""a-letter-space""></span>\n </span>\n </td>\n \n <td class=""comparison_sim_items_column comparable_item4"">\n \n \n \n <span>\n <span class=""a-declarative"" data-action=""a-popover"" data-a-popover=""{&quot;max-width&quot;:&quot;700&quot;,&quot;closeButton&quot;:&quot;false&quot;,&quot;position&quot;:&quot;triggerBottom&quot;,&quot;url&quot;:&quot; /gp/customer-reviews/widgets/average-customer-review/popover/ref=acr_dpComparsion__popover?contextId=dpComparsion&amp;asin=B004UQO380 &quot;,&quot;restoreFocusOnHide&quot;:&quot;false&quot;}"">\n <i class=""a-icon a-icon-star a-star-3-5 a-spacing-none""><span class=""a-icon-alt"">3 out of 5 stars</span></i>\n </span>\n <a class=""a-link-normal"" target=""_self"" rel=""noopener"" href=""/product-reviews/B004UQO380/ref=psdc_2445485011_r5_0735342989?_encoding=UTF8&amp;showViewpoints=1"">(19)</a>\n <span class=""a-letter-space""></span>\n </span>\n </td>\n \n </tr>\n\n\n <tr id=""comparison_price_row"">\n <th class=""comparison_attribute_name_column comparison_table_first_col"" role=""rowheader"">\n <span class=""a-size-base a-color-base"">Price</span>\n </th>\n\n\n <td class=""comparison_baseitem_column"">\n \n \n \n \n \n \n \n <span class=""a-price"" data-a-size=""l"" data-a-color=""base""><span class=""a-offscreen"">$20.52</span><span aria-hidden=""true""><span class=""a-price-symbol"">$</span><span class=""a-price-whole"">20<span class=""a-price-decimal"">�</span></span><span class=""a-price-fraction"">52</span></span></span>\n \n \n \n \n \n </td>\n\n\n \n <td class=""comparison_sim_items_column comparable_item0"">\n \n \n \n \n \n \n <span class=""a-price"" data-a-size=""l"" data-a-color=""base""><span class=""a-offscreen"">$21.70</span><span aria-hidden=""true""><span class=""a-price-symbol"">$</span><span class=""a-price-whole"">21<span class=""a-price-decimal"">�</span></span><span class=""a-price-fraction"">70</span></span></span>\n \n \n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item1"">\n \n \n \n \n \n \n <span class=""a-price"" data-a-size=""l"" data-a-color=""base""><span class=""a-offscreen"">$27.97</span><span aria-hidden=""true""><span class=""a-price-symbol"">$</span><span class=""a-price-whole"">27<span class=""a-price-decimal"">�</span></span><span class=""a-price-fraction"">97</span></span></span>\n \n \n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item2"">\n \n \n \n \n \n \n <span class=""a-price"" data-a-size=""l"" data-a-color=""base""><span class=""a-offscreen"">$19.99</span><span aria-hidden=""true""><span class=""a-price-symbol"">$</span><span class=""a-price-whole"">19<span class=""a-price-decimal"">�</span></span><span class=""a-price-fraction"">99</span></span></span>\n \n \n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item3"">\n \n \n \n \n \n \n <span class=""a-price"" data-a-size=""l"" data-a-color=""base""><span class=""a-offscreen"">$13.76</span><span aria-hidden=""true""><span class=""a-price-symbol"">$</span><span class=""a-price-whole"">13<span class=""a-price-decimal"">�</span></span><span class=""a-price-fraction"">76</span></span></span>\n \n \n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item4"">\n \n \n \n \n \n \n <span class=""a-price"" data-a-size=""l"" data-a-color=""base""><span class=""a-offscreen"">$22.97</span><span aria-hidden=""true""><span class=""a-price-symbol"">$</span><span class=""a-price-whole"">22<span class=""a-price-decimal"">�</span></span><span class=""a-price-fraction"">97</span></span></span>\n \n \n \n \n \n </td>\n \n </tr>\n\n\n \n <tr id=""comparison_shipping_info_row"">\n <th class=""comparison_attribute_name_column comparison_table_first_col"" role=""rowheader"">\n <span class=""a-size-base a-color-base"">Shipping</span>\n </th>\n\n \n <td class=""comparison_baseitem_column"">\n \n \n \n \n <span class=""a-size-base a-color-base"">Eligible for FREE Shipping</span>\n \n \n \n \n \n \n </td>\n\n \n \n <td class=""comparison_sim_items_column comparable_item0"">\n \n \n \n <span class=""a-size-base a-color-base"">Eligible for FREE Shipping</span>\n \n \n \n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item1"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-price a-text-bold"">$5.50</span>\n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item2"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-price a-text-bold"">$5.99</span>\n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item3"">\n \n \n \n <span class=""a-size-base a-color-base"">Eligible for FREE Shipping</span>\n \n \n \n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item4"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-price a-text-bold"">$5.50</span>\n \n \n \n </td>\n \n </tr>\n\n\n\n <tr id=""comparison_sold_by_row"">\n <th class=""comparison_attribute_name_column comparison_table_first_col"" role=""rowheader"">\n <span class=""a-size-base a-color-base"">Sold By</span>\n </th>\n\n \n <td class=""comparison_baseitem_column"">\n \n \n \n <span class=""a-size-base a-color-base"">Amazon.com</span>\n \n \n \n </td>\n\n\n \n <td class=""comparison_sim_items_column comparable_item0"">\n \n \n \n <a class=""a-spacing-top-small a-link-normal"" target=""_self"" rel=""noopener"" href=""/gp/help/seller/at-a-glance.html/ref=psdc_2445485011_s1_0735342989?ie=UTF8&amp;seller=A26N3EJ2VC0EL1"">Boston Decal Works LLC</a>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item1"">\n \n \n \n <a class=""a-spacing-top-small a-link-normal"" target=""_self"" rel=""noopener"" href=""/gp/help/seller/at-a-glance.html/ref=psdc_2445485011_s2_0735342989?ie=UTF8&amp;seller=AC6TRIE4VL9UK"">The Custom Vinyl Shop</a>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item2"">\n \n \n \n <a class=""a-spacing-top-small a-link-normal"" target=""_self"" rel=""noopener"" href=""/gp/help/seller/at-a-glance.html/ref=psdc_2445485011_s3_0735342989?ie=UTF8&amp;seller=A3C69RP9OO6269"">Decor Designs Decals</a>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item3"">\n \n \n \n <a class=""a-spacing-top-small a-link-normal"" target=""_self"" rel=""noopener"" href=""/gp/help/seller/at-a-glance.html/ref=psdc_2445485011_s4_0735342989?ie=UTF8&amp;seller=A2AQJTKFLAYYUJ"">John D Izard</a>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item4"">\n \n \n \n <a class=""a-spacing-top-small a-link-normal"" target=""_self"" rel=""noopener"" href=""/gp/help/seller/at-a-glance.html/ref=psdc_2445485011_s5_0735342989?ie=UTF8&amp;seller=AC6TRIE4VL9UK"">The Custom Vinyl Shop</a>\n \n \n </td>\n \n </tr>\n\n\n \n <tr>\n <th class=""a-span3 comparison_attribute_name_column comparison_table_first_col"" role=""rowheader"">\n <span class=""a-size-base a-color-base"">Color</span>\n </th>\n \n\n <td class=""comparison_baseitem_column"">\n \n \n \n <span class=""a-color-secondary""></span>\n \n \n \n </td>\n\n\n \n <td class=""comparison_sim_items_column comparable_item0"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">Yellow</span>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item1"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">Black</span>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item2"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">Multi</span>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item3"">\n \n \n \n \n \n <span class=""a-color-secondary""></span>\n \n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item4"">\n \n \n \n \n \n <span class=""a-color-secondary""></span>\n \n \n \n </td>\n \n </tr>\n \n <tr>\n <th class=""a-span3 comparison_attribute_name_column comparison_table_first_col"" role=""rowheader"">\n <span class=""a-size-base a-color-base"">Material Type</span>\n </th>\n \n\n <td class=""comparison_baseitem_column"">\n \n \n \n <span class=""a-color-secondary""></span>\n \n \n \n </td>\n\n\n \n <td class=""comparison_sim_items_column comparable_item0"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">Vinyl</span>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item1"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">vinyl</span>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item2"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">vinyl</span>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item3"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">Vinyl</span>\n \n \n </td>\n \n <td class=""comparison_sim_items_column comparable_item4"">\n \n \n \n \n \n \n <span class=""a-size-base a-color-base"">vinyl</span>\n \n \n </td>\n \n </tr>\n \n",,$20.52,0735342989,


In [6]:
# Check number of records and columns.
reviews.shape, products.shape

((2070831, 12), (571535, 18))

In [7]:
# Check for data types and nulls.
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2070831 entries, 0 to 2070830
Data columns (total 12 columns):
 #   Column          Dtype  
---  ------          -----  
 0   overall         float64
 1   verified        bool   
 2   reviewTime      object 
 3   reviewerID      object 
 4   asin            object 
 5   style           object 
 6   reviewerName    object 
 7   reviewText      object 
 8   summary         object 
 9   unixReviewTime  int64  
 10  vote            object 
 11  image           object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 191.6+ MB


In [8]:
# Check for data types and nulls.
products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 571535 entries, 0 to 571534
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   category      571535 non-null  object
 1   tech1         571535 non-null  object
 2   description   571535 non-null  object
 3   fit           571535 non-null  object
 4   title         571535 non-null  object
 5   also_buy      571535 non-null  object
 6   image         571535 non-null  object
 7   tech2         571535 non-null  object
 8   brand         571535 non-null  object
 9   feature       571535 non-null  object
 10  rank          571535 non-null  object
 11  also_view     571535 non-null  object
 12  main_cat      571535 non-null  object
 13  similar_item  571535 non-null  object
 14  date          571535 non-null  object
 15  price         571535 non-null  object
 16  asin          571535 non-null  object
 17  details       571444 non-null  object
dtypes: object(18)
memory usa

## Inspect and Clean Data

|Required|Feature|Type|Description|
|:--|:--|:--|:--|
|o|`overall`|float|Rating of the product|
|x|`verified`|object|If review has been verified|
|x|`reviewTime`|object|Time of the review (raw)|
|x|`reviewerID`|object|ID of the reviewer e.g. A2SUAM1J3GNN3B|
|o|`asin`|object|ID of the product e.g. 0000013714|
|x|`style`|object|Dictionary of the product metadata e.g. "Format" is "Hardcover"|
|x|`reviewerName`|object|Name of the reviewer|
|o|`reviewText`|object|Text of the review|
|x|`summary`|float|Summary of the review|
|x|`unixReviewTime`|integer|Time of the review (unix time)|
|x|`vote`|object|Helpful votes of the review|
|x|`image`|object|Images that users post after they have received the product|

|Required|Feature|Type|Description|
|:--|:--|:--|:--|
|x|`category`|object|List of categories the product belongs to|
|x|`tech1`|object|First technical detail table of the product|
|x|`description`|object|Description of the product|
|x|`fit`|object||
|x|`title`|object|Name of the product|
|x|`also_buy`|object|Related products (also bought, also viewed, bought together, buy after viewing)|
|x|`image`|object|url of the product image|
|x|`tech2`|object|Second technical detail table of the product|
|o|`brand`|object|Brand name|
|x|`feature`|object|Bullet-point format features of the product|
|x|`rank`|object|Sales rank information|
|x|`also_view`|object|images that users post after they have received the product|
|o|`main_cat`|object|Main category of the product|
|x|`similar_item`|object|similar product table|
|x|`date`|object|Date|
|x|`price`|object|Price in US dollars (at time of crawl)|
|o|`asin`|object|ID of the product, e.g. 0000031852|
|x|`details`|object|images that users post after they have received the product|

In [9]:
# Identify unwanted columns in both dataframes for dropping.
unwanted_rev_cols = ['verified', 'reviewTime', 'reviewerID', 'style', 'reviewerName', 'summary', 'unixReviewTime', 
                     'vote', 'image']
unwanted_pdt_cols = ['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'image', 'tech2', 'feature', 'rank', 
                     'also_view', 'similar_item', 'date', 'price', 'details']

In [10]:
reviews.shape

(2070831, 12)

In [11]:
# Examine the various data columns to better understand the data.
randomlist = sample(range(reviews.shape[0]), 10)
for i in randomlist:
    print(reviews.loc[i, ['reviewText']], "\n")

reviewText    Purchased for son but I bought one last year because it is tiny and fits on my key chain without adding a lot of weight.  I use mine all the time for all kinds of things but mostly for opening packages and removing tags from stuff while in the car.  Every mom should have one!  I have had mine for over a year and nothing on it has broken.
Name: 52863, dtype: object 

reviewText    OK
Name: 311227, dtype: object 

reviewText    The rubber on the old stopper broke down to the point it was in two pieces.  The funny thing was that it broke in such a way that it would still seal the tub but would restrict the water leaving the tub when opened.  I opted to replace the cartridge assembly because the rubber gasket by itself was more expensive and you might as well get a new spring at the same time.  Replacement was easy.  Simple unscrew the upper cap, unscrew the old cartridge assembly, install the new one and screw the old upper cap to the new cartridge assembly.  Works perfectly

In [12]:
# Check for duplicate reviews e.g. same reviewer, same product, same review and same summary.
reviews[reviews.duplicated(subset=['reviewerID', 'reviewerName', 'asin', 'reviewText', 'summary'])].count()

overall     100252
verified    100252
             ...  
vote         14568
image         1293
Length: 12, dtype: int64

In [13]:
# Drop duplicate reviews e.g. same reviewer, same product, same review and same summary.
reviews.drop_duplicates(subset=['reviewerID', 'reviewerName', 'asin', 'reviewText', 'summary'], inplace=True)

In [14]:
# Check count of verified reviews.
reviews['verified'].value_counts()

True     1809779
False     160800
Name: verified, dtype: int64

In [15]:
# Check for null reviewText and summary values.
reviews['reviewText'].isnull().sum(), reviews['summary'].isnull().sum()

(508, 271)

In [16]:
# Check for records with both null summary and reviewText.  These records will be dropped.
reviews[reviews['summary'].isnull() & reviews['reviewText'].isnull()].count()

overall     30
verified    30
            ..
vote         0
image        0
Length: 12, dtype: int64

In [17]:
# Drop rows with both null summary and reviewText.
reviews.drop(reviews[reviews['summary'].isnull() & reviews['reviewText'].isnull()].index, inplace=True)

In [18]:
# Check for records with null reviewText but summary has data.
reviews[reviews['reviewText'].isnull() & ~reviews['summary'].isnull()].count()

overall     478
verified    478
           ... 
vote         54
image       123
Length: 12, dtype: int64

In [19]:
# Check for records with null reviewText and where summary is only rating information.
reviews[reviews['reviewText'].isnull() & ~reviews['summary'].isnull() & ((reviews['summary'].str.lower() == "five stars") 
        | (reviews['summary'].str.lower() == "four stars") | (reviews['summary'].str.lower() == "three stars") 
        | (reviews['summary'].str.lower() == "two stars") | (reviews['summary'].str.lower() == "one star"))].count()

overall     426
verified    426
           ... 
vote         34
image        82
Length: 12, dtype: int64

In [20]:
# Drop rows with null reviewText and where summary is only rating information.
reviews.drop(reviews[reviews['reviewText'].isnull() & ~reviews['summary'].isnull() 
                     & ((reviews['summary'].str.lower() == "five stars") | (reviews['summary'].str.lower() == "four stars")
                     | (reviews['summary'].str.lower() == "three stars") | (reviews['summary'].str.lower() == "two stars") 
                     | (reviews['summary'].str.lower() == "one star"))].index, inplace=True)

In [21]:
# Assign summary values to null reviewText.
reviews.loc[reviews['reviewText'].isnull(), 'reviewText'] = reviews['summary']

# Assign remaining null summary values with fullstops.
reviews['summary'] = reviews['summary'].fillna(".")

# Remove newline characters and backslash before apostrophes.
reviews['reviewText'].replace("\n", " ", regex=True, inplace=True)
reviews['reviewText'].replace("\'", "'", regex=True, inplace=True)

# Remove urls.
# reviews[reviews['reviewText'].str.contains("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", 
#                                            regex=True, case=False)]
reviews['reviewText'].replace("http\S+|www.\S+", "", regex=True, inplace=True)
reviews['reviewText'].replace("[A-Za-z]+\.com", "", regex=True, inplace=True)

# Create a length column to store the length of the reviewText.
reviews['word_cnt'] = reviews['reviewText'].str.split().apply(len)

# Check for count of records where length of reviewText is between 5 and 128.
reviews[(reviews['word_cnt'] > 4) & (reviews['word_cnt'] < 129)].count()

overall     1482487
verified    1482487
             ...   
image         25261
word_cnt    1482487
Length: 13, dtype: int64

In [22]:
# Keep records where length of reviewText is between 5 and 128, and drop the rest.
reviews.drop(reviews[(reviews['word_cnt'] < 5) | (reviews['word_cnt'] > 128)].index, inplace=True)

In [23]:
# # Replace null values in vote column with zeroes.
# reviews['vote'] = reviews['vote'].fillna(0)

# # Replace commas used as thousands separator before converting type to integer.
# reviews['vote'].replace(",", "", regex=True, inplace=True)
# reviews['vote'].astype(int)

In [24]:
# Check for duplicate products e.g. same asin, same main_cat, same brand and same title.
products[products.duplicated(subset=['asin', 'main_cat', 'brand', 'title'])].count()

category    12195
tech1       12195
            ...  
asin        12195
details     12195
Length: 18, dtype: int64

In [25]:
# Drop duplicate products e.g. same asin, same main_cat, same brand and same title.
products.drop_duplicates(subset=['asin', 'main_cat', 'brand', 'title'], inplace=True)

In [26]:
# Drop unwanted columns in both tables.
reviews.drop(unwanted_rev_cols, axis=1, inplace=True)
products.drop(unwanted_pdt_cols, axis=1, inplace=True)

In [27]:
# Merge both tables with inner join on asin or product Id.
merged = pd.merge(left=reviews, right=products, on='asin')
merged.shape

(1479593, 6)

In [28]:
merged['main_cat'].unique()

array(['Tools & Home Improvement', 'Office Products', 'Toys & Games',
       'Industrial & Scientific', 'Automotive', 'Sports & Outdoors',
       'Amazon Home',
       '<img src="https://images-na.ssl-images-amazon.com/images/G/01/nav2/images/gui/amazon-fashion-store-new._CB520838675_.png" class="nav-categ-image" alt="AMAZON FASHION"/>',
       'All Electronics', 'Camera & Photo', 'Home Audio & Theater',
       'Baby', 'Cell Phones & Accessories', 'Arts, Crafts & Sewing',
       'Pet Supplies', 'Musical Instruments', 'All Beauty', 'Grocery',
       'Car Electronics', 'Health & Personal Care', 'Computers', '',
       'Video Games', 'Amazon Devices',
       '<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>',
       'Appliances', 'Books', 'GPS & Navigation',
       '<img src="https://images-na.ssl-images-amazon.com/images/G/01/handmade/brand/logos/2018/subnav_logo._CB50

In [29]:
# Cleanup main_cat values.
merged.loc[merged['main_cat'] == '<img src="https://images-na.ssl-images-amazon.com/images/G/01/nav2/images/gui/amazon-fashion-store-new._CB520838675_.png" class="nav-categ-image" alt="AMAZON FASHION"/>', 
           'main_cat'] = "Amazon Fashion"
merged.loc[merged['main_cat'] == '<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>', 
           'main_cat'] = "Digital Music"
merged.loc[merged['main_cat'] == '<img src="https://images-na.ssl-images-amazon.com/images/G/01/handmade/brand/logos/2018/subnav_logo._CB502360610_.png" class="nav-categ-image" alt="Handmade"/>', 
           'main_cat'] = "Handmade"

In [30]:
# Check for records where main_cat is empty string.
merged[merged['main_cat'] == ""].count()

overall     1294
asin        1294
            ... 
brand       1294
main_cat    1294
Length: 6, dtype: int64

In [31]:
# Drop records where main_cat is empty string.
merged.drop(merged[merged['main_cat'] == ""].index, inplace=True)

In [32]:
# Cleanup empty brand values.
merged.loc[merged['brand'] == "", 'brand'] = "None"
merged.loc[merged['brand'].isnull(), 'brand'] = "None"
merged.shape

(1478299, 6)

In [33]:
# Drop asin as no longer needed and reset index.
merged.drop('asin', axis=1, inplace=True)
merged.reset_index(drop=True, inplace=True)

## Prepare Bag of Words

In [34]:
# Convert list of English stop words to set.
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
stopwords_set = set(stop_words)

def create_doc(text):
    """ 
    Creates a document of lowercase words from input text.  Input text is first tokenized by text_to_word_sequence (Keras), 
    lemmatized (WordNetLemmatizer), and then removed of stop words.
    
    Parameters
    ----------
    text     : str
        Raw review text from Amazon reviews.
    
    Returns
    -------
    -        : str
        Document of lowercase words.
        
    """
    # Tokenize with function from Keras.
    tokens = text_to_word_sequence(text)
    
    # Lemmatize all tokens to base form.
    base_tokens = [WordNetLemmatizer().lemmatize(word) for word in tokens if len(word) > 3]
    
    # Remove stop words.
    doc_words = [word for word in base_tokens if not word in stopwords_set]
    
    return (" ".join(doc_words))

In [35]:
# Create new column in dataframe to hold documents.
merged['document'] = [create_doc(review) for review in merged['reviewText']]

In [37]:
# Compare reviewText and created document.
randomlist = sample(range(merged.shape[0]), 10)
for i in randomlist:
    print(merged.loc[i, ['reviewText']])
    print(merged.loc[i, ['document', 'word_cnt']], "\n")

reviewText    I bought two of these.  One for a lamp and one for a fan.  Both work great.  I love being able to just walk up to the fan and step on the switch and it starts up.  My lamp has a odd inline switch that hangs behind the end table next to the sofa, so with this switch, I just walk up and click the switch with my foot as it lays under the end table.
Name: 220158, dtype: object
document    bought lamp work great love able walk step switch start lamp inline switch hang behind table next sofa switch walk click switch foot lay table
word_cnt                                                                                                                                                76
Name: 220158, dtype: object 

reviewText    Arrived on time, and reasonable price
Name: 1304102, dtype: object
document    arrived time reasonable price
word_cnt                                6
Name: 1304102, dtype: object 

reviewText    I really like Cree bulbs and have them in many locations in 

## Save Clean Data to File

In [38]:
# Save clean review and product data to file.
merged.to_csv("../data/reviews_clean.csv", index=False)