# Amazon product catalog


## Problem scope

What do Amazon reviews say about the product, and can reviews be used reliably to predict the product category?

## Open questions / workflow

1. Predict the rating based on item desc.: regression w/ language data

2. How well reviewed something is

3. Figure out product, product contents, product tags, document per row + brand + company type, product category, description

4. Probability that the thing we labeled is actually in that class?


## Data imports

In [1]:
#libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, re

### Product dataset

In [2]:
#import ecom dataset
ecom = pd.read_csv('../data/amazon_co-ecommerce_sample.csv')

In [3]:
#check data size
ecom.shape

(10000, 17)

In [4]:
#check data types
ecom.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 17 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   uniq_id                                      10000 non-null  object 
 1   product_name                                 10000 non-null  object 
 2   manufacturer                                 9993 non-null   object 
 3   price                                        8565 non-null   object 
 4   number_available_in_stock                    7500 non-null   object 
 5   number_of_reviews                            9982 non-null   object 
 6   number_of_answered_questions                 9235 non-null   float64
 7   average_review_rating                        9982 non-null   object 
 8   amazon_category_and_sub_category             9310 non-null   object 
 9   customers_who_bought_this_item_also_bought   8938 non-null   object 
 10 

In [5]:
#peek at dataframe
ecom.head()

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,amazon_category_and_sub_category,customers_who_bought_this_item_also_bought,description,product_information,product_description,items_customers_buy_after_viewing_this_item,customer_questions_and_answers,customer_reviews,sellers
0,eac7efa5dbd3d667f26eb3d3ab504464,Hornby 2014 Catalogue,Hornby,£3.42,5 new,15,1.0,4.9 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Product Description Hornby 2014 Catalogue Box ...,Technical Details Item Weight640 g Product Dim...,Product Description Hornby 2014 Catalogue Box ...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Does this catalogue detail all the previous Ho...,Worth Buying For The Pictures Alone (As Ever) ...,"{""seller""=>[{""Seller_name_1""=>""Amazon.co.uk"", ..."
1,b17540ef7e86e461d37f3ae58b7b72ac,FunkyBuys® Large Christmas Holiday Express Fes...,FunkyBuys,£16.99,,2,1.0,4.5 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,Size Name:Large FunkyBuys® Large Christmas Hol...,Technical Details Manufacturer recommended age...,Size Name:Large FunkyBuys® Large Christmas Hol...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,can you turn off sounds // hi no you cant turn...,Four Stars // 4.0 // 18 Dec. 2015 // By\n \...,"{""seller""=>{""Seller_name_1""=>""UHD WHOLESALE"", ..."
2,348f344247b0c1a935b1223072ef9d8a,CLASSIC TOY TRAIN SET TRACK CARRIAGES LIGHT EN...,ccf,£9.99,2 new,17,2.0,3.9 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Classic-Train-Lights-B...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,Technical Details Manufacturer recommended age...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,http://www.amazon.co.uk/Train-With-Tracks-Batt...,What is the gauge of the track // Hi Paul.Trut...,**Highly Recommended!** // 5.0 // 26 May 2015 ...,"{""seller""=>[{""Seller_name_1""=>""DEAL-BOX"", ""Sel..."
3,e12b92dbb8eaee78b22965d2a9bbbd9f,HORNBY Coach R4410A BR Hawksworth Corridor 3rd,Hornby,£39.99,,1,2.0,5.0 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,,Hornby 00 Gauge BR Hawksworth 3rd Class W 2107...,Technical Details Item Weight259 g Product Dim...,Hornby 00 Gauge BR Hawksworth 3rd Class W 2107...,,,I love it // 5.0 // 22 July 2013 // By\n \n...,
4,e33a9adeed5f36840ccc227db4682a36,Hornby 00 Gauge 0-4-0 Gildenlow Salt Co. Steam...,Hornby,£32.19,,3,2.0,4.7 out of 5 stars,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Hornby-R6367-RailRoad-...,Product Description Hornby RailRoad 0-4-0 Gild...,Technical Details Item Weight159 g Product Dim...,Product Description Hornby RailRoad 0-4-0 Gild...,http://www.amazon.co.uk/Hornby-R2672-RailRoad-...,,Birthday present // 5.0 // 14 April 2014 // By...,


## Cleaning

Let's clean up our columns, including price, so that they can converge properly.

### Remove extraneous characters

From the pound signs in the prices, to prices listed as ranges, we have some issues in our data formats. Let's fix these.

For price ranges, I will merely consume the lower bound of the price.

In [6]:
#Substantial datatype conversion help here from https://pbpython.com/currency-cleanup.html

def clean_price(x):
    
    if isinstance(x, str):
#If the value is a string, then remove currency symbol, delimiters and anything 
#else following the price; otherwise, the value is numeric and can be converted as is.
        
        return(x.replace('£', '').replace(',', '').split(' - ')[0])
    #strip price of the pound sign
    #we need to get rid of price ranges. I will just consume the LOWER BOUND in case of a price range.

    return(x)

In [7]:
#apply my price function to convert the price column type
ecom['price'] = ecom['price'].apply(clean_price).astype('float')

In [8]:
#strip out "out of 5 stars" from average_review_rating
ecom['average_review_rating'] = ecom['average_review_rating'].str.strip(' out of 5 stars')
ecom['average_review_rating'].head(2)

0    4.9
1     4.
Name: average_review_rating, dtype: object

In [9]:
#strip out the word "new" from number_available_in_stock; we'll handle this during feature eng

In [10]:
#strip out commas from number_of_reviews column
ecom['number_of_reviews'] = ecom['number_of_reviews'].str.replace(",", "")

In [11]:
ecom['number_of_reviews'].sort_values(ascending=False)[:30]

3182    99
7179    99
1541    98
6616    98
133     97
4153    97
9434    97
6642    96
6552    96
3305    95
5166    95
6878    95
7183    95
1678    94
9476    94
6066    93
792     92
664     92
1812    92
7326    92
9441    91
4280    91
7422    91
9830    91
7219    91
9945    91
3840    90
4688     9
2207     9
159      9
Name: number_of_reviews, dtype: object

In [12]:
#double check that our digits converted correctly
ecom.loc[ecom['number_of_reviews']=='1040']

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,amazon_category_and_sub_category,customers_who_bought_this_item_also_bought,description,product_information,product_description,items_customers_buy_after_viewing_this_item,customer_questions_and_answers,customer_reviews,sellers
8872,c751a76dd7668f78b4222b5547e7249b,TOMY Pop-Up Pirate,Tomy,9.99,59 new,1040,11.0,4.0,Characters & Brands > Tomy,http://www.amazon.co.uk/Hungry-Hippos-Elefun-F...,Style Name:Pop-Up-Pirate/T7028 Product Descrip...,Technical Details Brand Tomy Model NumberT7028...,Style Name:Pop-Up-Pirate/T7028 Product Descrip...,,Is this good for 5 and 6 year old kids? // Hi ...,Crazy fun // 4.0 // 9 Sept. 2007 // By\n \n...,"{""seller""=>[{""Seller_name_1""=>""Amazon.co.uk"", ..."


### Converting object columns to numeric where applicable

We had 4 columns originally that seemed like good candidates for conversion to numeric:

* price
* number_of_reviews
* number_of_answered_questions
* average_review_rating

In [13]:
#force num type on number_of_reviews, number_of_answered_questions, price, average_review_rating
ecom['number_of_answered_questions'] = ecom[['number_of_answered_questions']].apply(pd.to_numeric)

In [14]:
ecom['average_review_rating'] = ecom['average_review_rating'].astype(float)

In [15]:
#first remove nulls for this to work
ecom.dropna(subset=['number_of_reviews'], inplace=True)

#convert
ecom['number_of_reviews'] = ecom['number_of_reviews'].astype(int)

In [16]:
#confirm conversions
ecom.dtypes

uniq_id                                         object
product_name                                    object
manufacturer                                    object
price                                          float64
number_available_in_stock                       object
number_of_reviews                                int64
number_of_answered_questions                   float64
average_review_rating                          float64
amazon_category_and_sub_category                object
customers_who_bought_this_item_also_bought      object
description                                     object
product_information                             object
product_description                             object
items_customers_buy_after_viewing_this_item     object
customer_questions_and_answers                  object
customer_reviews                                object
sellers                                         object
dtype: object

### Imputing nulls

For most of the fields, imputing with a 0 seems to make sense -- no reviews is no different than 0 reviews.

For price, however, we'll want to impute with the mean, now that we've successfully converted it to numeric.

As for average review rating, in the absence of one, we will simply drop those few observations since that's a valuable column and I'd want to be careful extending any kind of mean to it, since it is possibly our target.

In [17]:
#`dropna.()` nulls without an average_review_rating since we need those for training, and there are few
ecom.dropna(subset=['average_review_rating'], inplace=True)

In [18]:
#fill price with average of column
ecom['price'] = ecom['price'].fillna(ecom['price'].mean())

In [19]:
#fill in NaN's with 0's for everything else
ecom = ecom.fillna(0)
ecom.head(3)

Unnamed: 0,uniq_id,product_name,manufacturer,price,number_available_in_stock,number_of_reviews,number_of_answered_questions,average_review_rating,amazon_category_and_sub_category,customers_who_bought_this_item_also_bought,description,product_information,product_description,items_customers_buy_after_viewing_this_item,customer_questions_and_answers,customer_reviews,sellers
0,eac7efa5dbd3d667f26eb3d3ab504464,Hornby 2014 Catalogue,Hornby,3.42,5 new,15,1.0,4.9,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Product Description Hornby 2014 Catalogue Box ...,Technical Details Item Weight640 g Product Dim...,Product Description Hornby 2014 Catalogue Box ...,http://www.amazon.co.uk/Hornby-R8150-Catalogue...,Does this catalogue detail all the previous Ho...,Worth Buying For The Pictures Alone (As Ever) ...,"{""seller""=>[{""Seller_name_1""=>""Amazon.co.uk"", ..."
1,b17540ef7e86e461d37f3ae58b7b72ac,FunkyBuys® Large Christmas Holiday Express Fes...,FunkyBuys,16.99,0,2,1.0,4.0,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,Size Name:Large FunkyBuys® Large Christmas Hol...,Technical Details Manufacturer recommended age...,Size Name:Large FunkyBuys® Large Christmas Hol...,http://www.amazon.co.uk/Christmas-Holiday-Expr...,can you turn off sounds // hi no you cant turn...,Four Stars // 4.0 // 18 Dec. 2015 // By\n \...,"{""seller""=>{""Seller_name_1""=>""UHD WHOLESALE"", ..."
2,348f344247b0c1a935b1223072ef9d8a,CLASSIC TOY TRAIN SET TRACK CARRIAGES LIGHT EN...,ccf,9.99,2 new,17,2.0,3.9,Hobbies > Model Trains & Railway Sets > Rail V...,http://www.amazon.co.uk/Classic-Train-Lights-B...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,Technical Details Manufacturer recommended age...,BIG CLASSIC TOY TRAIN SET TRACK CARRIAGE LIGHT...,http://www.amazon.co.uk/Train-With-Tracks-Batt...,What is the gauge of the track // Hi Paul.Trut...,**Highly Recommended!** // 5.0 // 26 May 2015 ...,"{""seller""=>[{""Seller_name_1""=>""DEAL-BOX"", ""Sel..."


In [20]:
ecom.isnull().sum()

uniq_id                                        0
product_name                                   0
manufacturer                                   0
price                                          0
number_available_in_stock                      0
number_of_reviews                              0
number_of_answered_questions                   0
average_review_rating                          0
amazon_category_and_sub_category               0
customers_who_bought_this_item_also_bought     0
description                                    0
product_information                            0
product_description                            0
items_customers_buy_after_viewing_this_item    0
customer_questions_and_answers                 0
customer_reviews                               0
sellers                                        0
dtype: int64

## EDA

Interesting possible questions for exploration:

* Is there a correlation between number of reviews and number of questions?
* Is there a relationship between the number of sellers and the star rating?
* Is there a relationship between the star rating and the number of reviews?
* Is there a relationship between the product description and the star rating?
* Is there a relationship between a product being sold as new vs. used and its rating?
* Is there a relaionship between the product price and average rating?
* Do certain categories rank higher than others?

In [21]:
ecom.columns

Index(['uniq_id', 'product_name', 'manufacturer', 'price',
       'number_available_in_stock', 'number_of_reviews',
       'number_of_answered_questions', 'average_review_rating',
       'amazon_category_and_sub_category',
       'customers_who_bought_this_item_also_bought', 'description',
       'product_information', 'product_description',
       'items_customers_buy_after_viewing_this_item',
       'customer_questions_and_answers', 'customer_reviews', 'sellers'],
      dtype='object')

In [22]:
ecom['price'].value_counts().sort_values(ascending=False)[:15]
#do as dist, instead

20.265209    1432
9.990000      189
4.990000      140
14.990000     132
5.990000      126
6.990000      125
7.990000      125
12.990000     123
2.990000      118
3.990000      114
19.990000     112
11.990000      89
8.990000       82
1.990000       78
10.990000      77
24.990000      69
13.990000      62
29.990000      61
16.990000      52
17.990000      42
15.990000      41
7.950000       41
9.950000       40
0.990000       39
39.990000      39
15.000000      37
Name: price, dtype: int64

In [23]:
ecom.describe()

Unnamed: 0,price,number_of_reviews,number_of_answered_questions,average_review_rating
count,9982.0,9982.0,9982.0,9982.0
mean,20.265209,9.139952,1.69425,2.096584
std,42.911606,33.728145,2.469224,2.173103
min,0.01,1.0,0.0,0.0
25%,5.95,1.0,1.0,0.0
50%,12.99,2.0,1.0,0.0
75%,20.265209,6.0,2.0,4.2
max,2439.92,1399.0,39.0,4.9


In [24]:
ecom['number_of_reviews'].value_counts()

1      4315
2      1427
3       768
4       524
5       351
       ... 
379       1
108       1
124       1
132       1
243       1
Name: number_of_reviews, Length: 194, dtype: int64

In [25]:
ecom['number_available_in_stock'].value_counts()[:10]
#replace w/ a search for ~like new, ~like used, and then get a count

0        2497
2 new    1336
3 new     980
4 new     752
5 new     589
6 new     475
1 new     403
7 new     369
8 new     292
9 new     207
Name: number_available_in_stock, dtype: int64

In [26]:
#groupby product ID to reveal number of verbal reviews

## Feature engineering

We need to convert things like ratings so that the modeling works correctly.

The plan:

* Categorize / one hot encode the ratings column 

In [27]:
#one-hot encode ratings column on a scale of 1-5? Or is target?

In [28]:
#move out new vs. used items into sep. columns (binarize)

## NLP

Open questions:

* Sentiment of reviews
* Does sentiment correspond to star rating?

In [29]:
#vectorize text

## Modeling - classifier

In [30]:
#reviews
#price
#category

## Evaluation

## Conclusions & next steps