# Amazon product catalog


## Problem scope

What do Amazon reviews say about the product, and can reviews be used reliably to predict the product category?

## Open questions / workflow

1. Predict the rating based on item desc.: regression w/ language data

2. How well reviewed something is

3. Figure out product, product contents, product tags, document per row + brand + company type, product category, description

4. Probability that the thing we labeled is actually in that class?


## Data imports

In [1]:
#libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt, \
seaborn as sns

### Training data - _to be scrapped_

In [2]:
#fetch data for TRAIN
orders = pd.read_csv('../data/01-Jan-2008_to_18-May-2021.csv')

In [3]:
#peek at data for TRAIN
orders.head(2)

Unnamed: 0,Order Date,Order ID,Title,Category,ASIN/ISBN,UNSPSC Code,Website,Release Date,Condition,Seller,...,Carrier Name & Tracking Number,Item Subtotal,Item Subtotal Tax,Item Total,Tax Exemption Applied,Tax Exemption Type,Exemption Opt-Out,Buyer Name,Currency,Group Name
0,12/15/08,002-9208212-4089021,,,B0006HGNJ4,,Amazon.com,,,Watch Values,...,USPS(9101805213907299379333),$24.95,$0.00,$24.95,,,,Veronica,USD,
1,03/23/09,058-3723913-1124538,The Wall,ABIS_MUSIC,B000006TRV,55111512.0,Amazon.com,1994-01-01T00:00,new,megahitrecords,...,,$15.24,$0.00,$15.24,,,,Veronica,USD,


In [4]:
#understand prospective features
orders.columns

Index(['Order Date', 'Order ID', 'Title', 'Category', 'ASIN/ISBN',
       'UNSPSC Code', 'Website', 'Release Date', 'Condition', 'Seller',
       'Seller Credentials', 'List Price Per Unit', 'Purchase Price Per Unit',
       'Quantity', 'Payment Instrument Type', 'Purchase Order Number',
       'PO Line Number', 'Ordering Customer Email', 'Shipment Date',
       'Shipping Address Name', 'Shipping Address Street 1',
       'Shipping Address Street 2', 'Shipping Address City',
       'Shipping Address State', 'Shipping Address Zip', 'Order Status',
       'Carrier Name & Tracking Number', 'Item Subtotal', 'Item Subtotal Tax',
       'Item Total', 'Tax Exemption Applied', 'Tax Exemption Type',
       'Exemption Opt-Out', 'Buyer Name', 'Currency', 'Group Name'],
      dtype='object')

In [5]:
#understand size of data
orders.shape

(2242, 36)

In [6]:
#data types we are dealing with and presence of nulls
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2242 entries, 0 to 2241
Data columns (total 36 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Order Date                      2242 non-null   object 
 1   Order ID                        2242 non-null   object 
 2   Title                           2163 non-null   object 
 3   Category                        2163 non-null   object 
 4   ASIN/ISBN                       2242 non-null   object 
 5   UNSPSC Code                     2163 non-null   float64
 6   Website                         2242 non-null   object 
 7   Release Date                    77 non-null     object 
 8   Condition                       2241 non-null   object 
 9   Seller                          2241 non-null   object 
 10  Seller Credentials              5 non-null      object 
 11  List Price Per Unit             2242 non-null   object 
 12  Purchase Price Per Unit         22

In [7]:
#descriptive stats
orders.describe().round(4).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
UNSPSC Code,2163.0,49090790.0,9088133.0,10111300.0,50000000.0,50202300.0,53102500.5,78130000.0
Quantity,2242.0,1.1026,0.4331,0.0,1.0,1.0,1.0,7.0
Purchase Order Number,0.0,,,,,,,
PO Line Number,0.0,,,,,,,
Shipping Address Street 2,641.0,504.0,0.0,504.0,504.0,504.0,504.0,504.0
Tax Exemption Type,0.0,,,,,,,
Group Name,0.0,,,,,,,


### Product dataset 1

In [8]:
#import data from mktg dataset
#from https://stackoverflow.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe
#read_csv('path_to_file', sep='\t')
products = pd.read_csv(
    '../data/marketing_sample_for_amazon_com-amazon_product_data__20200401_20200630__30k_data.tsv',
sep='\t')

In [9]:
#peek
products.head(2)

Unnamed: 0,Uniq Id,Crawl Timestamp,Dataset Origin,Product Id,Product Barcode,Product Company Type Source,Product Brand Source,Product Brand Normalised Source,Product Name Source,Match Rank,...,Product Currency,Product Available Inventory,Product Image Url,Product Model Number,Product Tags,Product Contents,Product Rating,Product Reviews Count,Bsr,Joining Key
0,690e47123b04e15df768a9f54f1654e8,2020-05-13 11:02:04 +0000,,ed177532df52b8b2fdba6cff731b9d00,,Competitor,,,,,...,USD,999999999,https://images-na.ssl-images-amazon.com/images...,,100% Linen Table Runner 14x90 Tailored with Mi...,100% Pure Linen Imported 100% Pure Linen table...,3.5,4,"#83,321 in Kitchen & Dining (See Top 100 in Ki...",78af1f66f1e365647484688360621e5e
1,b02a91cfdae2c5596b568e771b402176,2020-06-27 15:31:29 +0000,,8756f065784790599a218174bc932395,UPC: 799460760240,Competitor,,,,,...,USD,999999999,https://images-na.ssl-images-amazon.com/images...,NL-F-5-2,2 pcs/set LONG Home Sauna Spa Exfoliating Nylo...,The nylon exfoliating cloth is the best bath p...,0.0,0,"#175,862 in Beauty & Personal Care (See Top 10...",83e8d81095fe218d5bb580194669e0c4


In [10]:
products.columns

Index(['Uniq Id', 'Crawl Timestamp', 'Dataset Origin', 'Product Id',
       'Product Barcode', 'Product Company Type Source',
       'Product Brand Source', 'Product Brand Normalised Source',
       'Product Name Source', 'Match Rank', 'Match Score', 'Match Type',
       'Retailer', 'Product Category', 'Product Brand', 'Product Name',
       'Product Price', 'Sku', 'Upc', 'Product Url', 'Market',
       'Product Description', 'Product Currency',
       'Product Available Inventory', 'Product Image Url',
       'Product Model Number', 'Product Tags', 'Product Contents',
       'Product Rating', 'Product Reviews Count', 'Bsr', 'Joining Key'],
      dtype='object')

In [11]:
products.shape

(29984, 32)

In [12]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29984 entries, 0 to 29983
Data columns (total 32 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Uniq Id                          29984 non-null  object 
 1   Crawl Timestamp                  29984 non-null  object 
 2   Dataset Origin                   0 non-null      float64
 3   Product Id                       29984 non-null  object 
 4   Product Barcode                  6291 non-null   object 
 5   Product Company Type Source      29984 non-null  object 
 6   Product Brand Source             71 non-null     object 
 7   Product Brand Normalised Source  71 non-null     object 
 8   Product Name Source              71 non-null     object 
 9   Match Rank                       0 non-null      float64
 10  Match Score                      0 non-null      float64
 11  Match Type                       0 non-null      float64
 12  Retailer          

In [13]:
products.describe().round(2).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Dataset Origin,0.0,,,,,,,
Match Rank,0.0,,,,,,,
Match Score,0.0,,,,,,,
Match Type,0.0,,,,,,,
Product Price,23553.0,41.29,121.16,0.01,11.29,18.5,34.99,7400.0
Product Available Inventory,29984.0,909996400.0,268150900.0,111111100.0,1000000000.0,999999999.0,1000000000.0,999999999.0
Product Rating,29984.0,3.12,1.98,0.0,0.0,4.1,4.6,5.0
Product Reviews Count,29984.0,53.56,341.98,0.0,0.0,3.0,20.0,24231.0


### Product dataset 2

In [14]:
#import ecom dataset
ecom = pd.read_csv('../data/amazon_co-ecommerce_sample.csv')

In [15]:
ecom.shape

(10000, 17)

In [16]:
ecom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   uniq_id                                      10000 non-null  object 
 1   product_name                                 10000 non-null  object 
 2   manufacturer                                 9993 non-null   object 
 3   price                                        8565 non-null   object 
 4   number_available_in_stock                    7500 non-null   object 
 5   number_of_reviews                            9982 non-null   object 
 6   number_of_answered_questions                 9235 non-null   float64
 7   average_review_rating                        9982 non-null   object 
 8   amazon_category_and_sub_category             9310 non-null   object 
 9   customers_who_bought_this_item_also_bought   8938 non-null   object 
 10 

In [17]:
ecom.head()

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,amazon_category_and_sub_category,customers_who_bought_this_item_also_bought,description,product_information,product_description,items_customers_buy_after_viewing_this_item,customer_questions_and_answers,customer_reviews,sellers
0,eac7efa5dbd3d667f26eb3d3ab504464,Hornby 2014 Catalogue,Hornby,£3.42,5 new,15,1.0,4.9 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Product Description Hornby 2014 Catalogue Box ...,Technical Details Item Weight640 g Product Dim...,Product Description Hornby 2014 Catalogue Box ...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Does this catalogue detail all the previous Ho...,Worth Buying For The Pictures Alone (As Ever) ...,"{""seller""=>[{""Seller_name_1""=>""Amazon.co.uk"", ..."
1,b17540ef7e86e461d37f3ae58b7b72ac,FunkyBuys® Large Christmas Holiday Express Fes...,FunkyBuys,£16.99,,2,1.0,4.5 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,Size Name:Large FunkyBuys® Large Christmas Hol...,Technical Details Manufacturer recommended age...,Size Name:Large FunkyBuys® Large Christmas Hol...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,can you turn off sounds // hi no you cant turn...,Four Stars // 4.0 // 18 Dec. 2015 // By\n \...,"{""seller""=>{""Seller_name_1""=>""UHD WHOLESALE"", ..."
2,348f344247b0c1a935b1223072ef9d8a,CLASSIC TOY TRAIN SET TRACK CARRIAGES LIGHT EN...,ccf,£9.99,2 new,17,2.0,3.9 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Classic-Train-Lights-B...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,Technical Details Manufacturer recommended age...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,http://www.amazon.co.uk/Train-With-Tracks-Batt...,What is the gauge of the track // Hi Paul.Trut...,**Highly Recommended!** // 5.0 // 26 May 2015 ...,"{""seller""=>[{""Seller_name_1""=>""DEAL-BOX"", ""Sel..."
3,e12b92dbb8eaee78b22965d2a9bbbd9f,HORNBY Coach R4410A BR Hawksworth Corridor 3rd,Hornby,£39.99,,1,2.0,5.0 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,,Hornby 00 Gauge BR Hawksworth 3rd Class W 2107...,Technical Details Item Weight259 g Product Dim...,Hornby 00 Gauge BR Hawksworth 3rd Class W 2107...,,,I love it // 5.0 // 22 July 2013 // By\n \n...,
4,e33a9adeed5f36840ccc227db4682a36,Hornby 00 Gauge 0-4-0 Gildenlow Salt Co. Steam...,Hornby,£32.19,,3,2.0,4.7 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Hornby-R6367-RailRoad-...,Product Description Hornby RailRoad 0-4-0 Gild...,Technical Details Item Weight159 g Product Dim...,Product Description Hornby RailRoad 0-4-0 Gild...,http://www.amazon.co.uk/Hornby-R2672-RailRoad-...,,Birthday present // 5.0 // 14 April 2014 // By...,


## Cleaning

Let's clean up our columns, including price, so that they can converge properly.

In [18]:
#strip price of the pound sign
ecom['price'] = ecom['price'].str.strip('£')
ecom['price'].head(2)

0     3.42
1    16.99
Name: price, dtype: object

In [19]:
#strip out "out of 5 stars" from average_review_rating
ecom['average_review_rating'] = ecom['average_review_rating'].str.strip(' out of 5 stars')
ecom['average_review_rating'].head(2)

0    4.9
1     4.
Name: average_review_rating, dtype: object

In [20]:
#strip out the word "new" from number in stock -- but not actually useful for training, so can skip

In [21]:
#strip out commas from number_of_reviews column
ecom['number_of_reviews'] = ecom['number_of_reviews'].str.strip(",")
ecom['number_of_reviews'].sort_values(ascending=False)

3182     99
7179     99
1541     98
6616     98
133      97
       ... 
6452    NaN
7133    NaN
7866    NaN
8923    NaN
9833    NaN
Name: number_of_reviews, Length: 10000, dtype: object

In [22]:
#fill in NaN's with 0's
ecom = ecom.fillna(0)
ecom.head(3)

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,amazon_category_and_sub_category,customers_who_bought_this_item_also_bought,description,product_information,product_description,items_customers_buy_after_viewing_this_item,customer_questions_and_answers,customer_reviews,sellers
0,eac7efa5dbd3d667f26eb3d3ab504464,Hornby 2014 Catalogue,Hornby,3.42,5 new,15,1.0,4.9,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Product Description Hornby 2014 Catalogue Box ...,Technical Details Item Weight640 g Product Dim...,Product Description Hornby 2014 Catalogue Box ...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Does this catalogue detail all the previous Ho...,Worth Buying For The Pictures Alone (As Ever) ...,"{""seller""=>[{""Seller_name_1""=>""Amazon.co.uk"", ..."
1,b17540ef7e86e461d37f3ae58b7b72ac,FunkyBuys® Large Christmas Holiday Express Fes...,FunkyBuys,16.99,0,2,1.0,4.0,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,Size Name:Large FunkyBuys® Large Christmas Hol...,Technical Details Manufacturer recommended age...,Size Name:Large FunkyBuys® Large Christmas Hol...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,can you turn off sounds // hi no you cant turn...,Four Stars // 4.0 // 18 Dec. 2015 // By\n \...,"{""seller""=>{""Seller_name_1""=>""UHD WHOLESALE"", ..."
2,348f344247b0c1a935b1223072ef9d8a,CLASSIC TOY TRAIN SET TRACK CARRIAGES LIGHT EN...,ccf,9.99,2 new,17,2.0,3.9,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Classic-Train-Lights-B...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,Technical Details Manufacturer recommended age...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,http://www.amazon.co.uk/Train-With-Tracks-Batt...,What is the gauge of the track // Hi Paul.Trut...,**Highly Recommended!** // 5.0 // 26 May 2015 ...,"{""seller""=>[{""Seller_name_1""=>""DEAL-BOX"", ""Sel..."


In [23]:
#force num on number_of_reviews
#df[["a", "b"]] = df[["a", "b"]].apply(pd.to_numeric) from https://stackoverflow.com/questions/15891038/change-column-type-in-pandas

ecom[['number_of_reviews', 'number_of_answered_questions', 'price', 'average_review_rating']].apply(
    pd.to_numeric)

ecom.dtypes


ValueError: Unable to parse string "1,040" at position 8872

## EDA

In [24]:
#let's analyze the categories
orders['Category'].value_counts()[:50]

GROCERY                             170
VEGETABLE                           132
FRUIT                                83
SKIN_CLEANING_AGENT                  81
ABIS_BOOK                            66
INCONTINENCE_PROTECTOR               66
DAIRY_BASED_DRINK                    64
HEALTH_PERSONAL_CARE                 56
BABY_PRODUCT                         53
SKIN_CLEANING_WIPE                   50
PAPER_TOWEL                          37
BREAD                                37
MEAT                                 34
DAIRY_BASED_CHEESE                   33
BEAUTY                               31
TOILET_PAPER                         30
NUTRITIONAL_SUPPLEMENT               30
LAUNDRY_DETERGENT                    29
SKIN_MOISTURIZER                     26
PANTS                                25
CONDITIONER                          24
WASTE_BAG                            24
UNDERPANTS                           22
SHOES                                22
TOOTH_CLEANING_AGENT                 21


In [25]:
ecom.columns

Index(['uniq_id', 'product_name', 'manufacturer', 'price',
       'number_available_in_stock', 'number_of_reviews',
       'number_of_answered_questions', 'average_review_rating',
       'amazon_category_and_sub_category',
       'customers_who_bought_this_item_also_bought', 'description',
       'product_information', 'product_description',
       'items_customers_buy_after_viewing_this_item',
       'customer_questions_and_answers', 'customer_reviews', 'sellers'],
      dtype='object')

In [26]:
ecom['price'].value_counts().sort_values(ascending=False)[:15]

0        1435
9.99      189
4.99      140
14.99     132
5.99      126
6.99      126
7.99      125
12.99     124
2.99      118
3.99      114
19.99     112
11.99      89
8.99       82
1.99       78
10.99      77
Name: price, dtype: int64

In [27]:
ecom.describe()

Unnamed: 0,number_of_answered_questions
count,10000.0
mean,1.6946
std,2.46774
min,0.0
25%,1.0
50%,1.0
75%,2.0
max,39.0


In [28]:
ecom['price'].mean()

TypeError: can only concatenate str (not "int") to str

In [29]:
ecom['number_available_in_stock'].value_counts()

0         2500
2 new     1337
3 new      981
4 new      753
5 new      590
          ... 
86 new       1
70 new       1
66 new       1
55 new       1
78 new       1
Name: number_available_in_stock, Length: 90, dtype: int64

In [30]:
ecom['number_of_answered_questions'].mean()

1.6946

In [31]:
ecom['number_of_reviews'].mean()

TypeError: can only concatenate str (not "int") to str

## Feature engineering

We need to convert things like ratings so that the modeling works correctly.

The plan:

* Categorize / one hot encode the ratings column 

In [None]:
#one-hot encode ratings column

In [None]:
#move out new vs. used items into sep. columns (binarize)

## NLP

In [None]:
#vectorize text

## Modeling - classifier

In [None]:
#reviews
#price
#category

## Evaluation

## Conclusions & next steps