# Amazon Product Review Dataset Analysis

In this notebook, we will perform exploratory data analysis on [amazon product review dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html), which include millions of amazon customer reviews in 2 decades since 1995.

This dataset is a source of rich information for ML, NLP projects. In order to focus on running the analysis locally without using cluster, we will focus on a smaller portion of the entire dataset for this notebook.

## Pre-requisites
You're assumed to be somewhat familiar with
* linux command line
* aq_tool
* aws API
* essentia command

## Environment and Note
This tutorial is run locally on one Ec2 instance, and will be focusing on use of `aq_tools` without udb.



## Data Source

Data is located in a S3 bucket named `wataru-essentia-proto` (CHANGE THIS TO PUBLIC S3 BUCKET LATER), in a zipped tsv format.

In order to scan and fetch the data, and take a look at them, we'll need to
1. set essentia's datastore to the s3 bucket
2. create data category that includes the targetted data
3. look at the summary of the category

**Selecting datastore**<br>
Starting with setting the s3 bucket as the datastore with `ess select` command.

In [3]:
ess select s3://wataru-essentia-proto

**Creating Category**<br>
Our dataset is located under `/tsv/` directory in the bucket, so using `ess ls` command will display the files inside of the directory.

In [6]:
# piping the output to head to get the top 10 lines, for simplicity
ess ls /tsv/ | head -n 10

 230M Aug 01 18:55    /tsv/amazon_reviews_multilingual_DE_v1_00.tsv.gz
  67M Aug 01 18:55    /tsv/amazon_reviews_multilingual_FR_v1_00.tsv.gz
  90M Aug 01 18:55    /tsv/amazon_reviews_multilingual_JP_v1_00.tsv.gz
 333M Aug 01 18:55    /tsv/amazon_reviews_multilingual_UK_v1_00.tsv.gz
 1.4G Aug 01 18:55    /tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz
 618M Aug 01 18:55    /tsv/amazon_reviews_us_Apparel_v1_00.tsv.gz
 555M Aug 01 18:55    /tsv/amazon_reviews_us_Automotive_v1_00.tsv.gz
 340M Aug 01 18:55    /tsv/amazon_reviews_us_Baby_v1_00.tsv.gz
 871M Aug 01 18:56    /tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz
 2.6G Aug 01 18:56    /tsv/amazon_reviews_us_Books_v1_00.tsv.gz


In this notebook we'll only use the multilingual dataset (Denmark, France, Japan, UK and US) on the top 5 lines. <br>
Glob pattern to these files in the S3 buckets is 
```/tsv/amazon_reviews_multilingual_*_v1_00.tsv.gz```

With `ess category add` command,

In [7]:
# creating the category with name of amazon_reviews. This will return error if it already exist.
ess category add amazon_reviews "/tsv/amazon_reviews_multilingual_*_v1_00.tsv.gz" 

2019-10-28 18:43:24 ip-10-10-1-118 ess[3471]: Fetching file list from datastore.
2019-10-28 18:43:24 ip-10-10-1-118 ess[3471]: Examining largest matched file to determine compression type: /tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz
2019-10-28 18:43:25 ip-10-10-1-118 ess[3471]: Probing largest matched file to determine data configuration: /tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz


**Taking a look at the detail**<br>
Now we have a category, let's take a look at it's summary including schemas and sample data with `ess summary`

In [8]:
ess summary amazon_reviews

Name:        amazon_reviews
Pattern:     /tsv/amazon_reviews_multilingual_*_v1_00.tsv.gz
Exclude:     None
Date Format: auto
Date Regex:  
Archive:     
Delimiter:   Tab
# of files:  5
Total size:  2.1GB
File range:  1970-01-01 - 1970-01-01
# columns:   15
Column Spec: S:marketplace I:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S,esc:review_body S:review_date
Pkey: 
Schema: S:marketplace I:customer_id S:review_id S:product_id I:product_parent S:product_title S:product_category I:star_rating I:helpful_votes I:total_votes S:vine S:verified_purchase S:review_headline S,esc:review_body S:review_date
Preprocess:  
usecache:    False
Comment:    

First few lines:
marketplace	customer_id	review_id	product_id	product_parent	product_title	product_category	star_rating	helpful_votes	total_votes	vine	verified_purchase	review_headline	review_body	review_date
US	5309

Essentia was able to extract the data schema automatially along with each column name. According to the meta data file, each column represents the followings.

* **marketplace**: 2 letter country code of the marketplace where the review was written
* **customer_id**: Random identifier that can be used to aggregate reviews written by a single author.
* **review_id**: The unique ID of the review
* **product_id**: The unique Product ID the review pertains to. In the multilingual dataset the reviews
* **product_parent**: Random identifier that can be used to aggregate reviews for the same product.
* **product_title**: title of the product
* **product_category**: Broad product category that can be used to group reviews (also used to group the dataset Countinto coherent parts)
* **star_rating**: The 1–5 star rating of the review.
* **helpful_votes**: Number of helpful votes.
* **total_votes**: Number of total votes the review received.
* **vine**: Review was written as part of the Vine program
* **verified_purchase**: the review is on a verified purchase
* **review_headline**: The title of the review
* **review_body**: the review text
* **review_date**: The date the review was written.

Looks like we can extract some useful information from columns such as `marketplace`, `product_category`, `star_rating`, `review_date`. 

## Exploratory Data Analysis
**Note:**<br>
There are multiple ways of counting unique values in aq_tools, such as 
* Using `aq_cnt`
* set up udb, and use `aq_udb -cnt` option.
On here, we'll be using `aq_cnt` command.

### Occurence Counts on Columns

Now we know what the data looks like, we can get started on analying the data. We'll start with looking at star rating's distribution in the dataset.

**star rating**<br>

We can display distribution of `star_rating` with `aq_cnt`.

* `ess stream amazon_reviews --bulk --master`
    * `--bulk` and `--master` was set in order to process 5 files as 1 and run `aq_cnt` on it one time. By default `ess` command runs given commands serially one time for each file.
* `aq_cnt -f,+1,tsv,eok...`
    * `eok` was added to skip the invalid row, as well as headers between files (4 headers will be present in the stream from files).

In [27]:
ess stream amazon_reviews --bulk --master "*" "*" "aq_cnt -f,+1,tsv,eok - -d %cols -kX - rating star_rating"

"star_rating","count"
2,396197
1,567938
3,766981
4,1772963
5,6330030


The plot for this looks like following.

<img src="img/star_distro.png">

**Marketplace**<br>

You can apply similar command in order to get insights on numbers of total reviews on each marketplace.

In [28]:
ess stream amazon_reviews --bulk --master "*" "*" "aq_cnt -f,+1,tsv,eok - -d %cols -kX - marketplace marketplace"

"marketplace","count"
"US",6931013
"UK",1707480
"JP",262430
"FR",254075
"DE",679111


marketplace counts plot

<img src="img/marketplace_count.png">

**Product Category**<br>

Counts of each product category across the marketplace and time.

In [29]:
ess stream amazon_reviews --bulk --master "*" "*" "aq_cnt -f,+1,tsv,eok - -d %cols -kX - category product_category"

"product_category","count"
"Mobile_Electronics",184
"Outdoors",3195
"Tools",7510
"Apparel",127
"Grocery",20
"Pet Products",51
"Beauty",55
"Software",204
"Furniture",101
"Lawn and Garden",1973
"Automotive",1534
"Baby",11960
"Kitchen",2365
"Camera",36786
"Video",56804
"Personal_Care_Appliances",705
"Home Entertainment",37807
"Office Products",4057
"Musical Instruments",15963
"Home",7008
"Health & Personal Care",1434
"Shoes",12559
"Luggage",475
"Wireless",36228
"Sports",9694
"Digital_Ebook_Purchase",1558331
"Digital_Music_Purchase",163296
"Toys",117348
"Video Games",22489
"Electronics",27068
"Music",1426187
"PC",95778
"Mobile_Apps",1773737
"Watches",17169
"Digital_Video_Download",1116575
"Video DVD",2066643
"Books",1194816
"Home Improvement",5873


Just like that, you can take a look at any categorical column's occurence counts.

### Time Series

This dataset contains reviews since 1995, which I was not even aware that amazon store existed. 
We'd like to explore characteristics and patterns of reviews changed overtime. 

**Breaking down timestamp into month and year**<br>
Currently timestamp of the review (year, month and date) are stored in one string in `review_date` column. Let's break it down into year and month, so that it can be grouped by this value.

Using aq_pp's `-mapc/f` option to extract and map year and month, and we'll output `review_id`, `reviwew_date`, `year`, `month` columns only for clearity.

In [40]:
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_pp -f,+1,tsv - -d %cols -mapf review_date "%%year:4-4%%-%%month:2-2%%%*" \
-mapc s:year %%YEAR%% -mapc s:month %%MONTH%% -c review_id review_date year month' | head -n 6

"review_id","review_date","year","month"
"RVOG49N0H1FB6","2014-08-01","2014","08"
"RNCMD6OLTP4HM","2014-12-04","2014","12"
"R4AUOBI8YC0R8","2014-12-04","2014","12"
"R1VSHIJ1RHIBTE","2015-07-16","2015","07"
"R3JBLVALWSLCZD","2014-02-08","2014","02"


Now the data is ready for the time series analysis.

**Numbers of reviews over years**<br>
Using the command from the last cell, we'll have it output `review_id` and `year`, and pipe it into `aq_cnt` command.
Because `review_id` is a unique identifier for each review, we'll count numbers of unique review ids group by year.<br>
Lastly using `aq_ord` command to sort the output by year.

In [51]:
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_pp -f,+1,tsv - -d %cols -mapf review_date "%%year:4-4%%-%%month:2-2%%%*" \
-mapc s:year %%YEAR%% -mapc s:month %%MONTH%% -c review_id year' | \
aq_cnt -f,+1 - -d S:review_id S:year -g year -k annual_reviews review_id | \
aq_ord -f,+1 - -d S:year I:count I:annual_reviews -sort year

"year","count","annual_reviews"
2019-10-28 21:46:49 ip-10-10-1-118 ess[5045]: ***Error*** Multiple errors.  See task.log for more details
"2004",13,13
"2005",69,69
"2006",44,44
"2007",170,170
"2008",951,951
"2009",1860,1860
"2010",2098,2098
"2011",2590,2590
"2012",6634,6634
"2013",24629,24629
"2014",28438,28438
"2015",18275,18275


**Numbers of Reviews over the years in each month**<br>

We can analyze numbers of reviews over the years, in each month by counting it group by year, and month. <br>
Afterwards, we can use `aq_rst` to organize the result into table with rows of months and columns of years.


In [60]:
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_pp -f,+1,tsv - -d %cols -mapf review_date "%%year:4-4%%-%%month:2-2%%%*" \
-mapc s:year %%YEAR%% -mapc s:month %%MONTH%% -c review_id year month' | \
aq_cnt -f,+1 - -d S:review_id S:year S:month -g year month -k reviews review_id | \
aq_rst -f,+1 - -d S:month S:year I:reviews -key month -lab year -val reviews

2019-10-28 22:22:50 ip-10-10-1-118 ess[5293]: ***Error*** Multiple errors.  See task.log for more details
"month","01","02","03","04","05","06","07","08","09","10","11","12"
"2004",0,0,0,3,1,0,0,1,2,0,1,5
"2005",5,17,4,7,5,9,3,9,3,0,3,4
"2006",1,3,6,2,3,0,5,4,7,1,3,9
"2007",7,4,8,4,2,6,8,7,8,4,49,63
"2008",109,50,62,57,64,82,66,94,92,84,101,90
"2009",130,174,163,130,140,111,198,134,137,177,161,205
"2010",235,169,130,162,152,129,154,189,188,222,183,185
"2011",220,227,217,196,180,171,225,162,214,218,250,310
"2012",436,335,312,368,389,331,400,397,545,533,962,1626
"2013",2174,1731,2088,1931,1906,1734,1856,1926,2045,2117,2425,2696
"2014",3019,2180,2395,2285,2266,1938,2072,2295,2290,2336,2450,2912
"2015",2913,2188,2238,2396,2246,2137,2110,2047,0,0,0,0


<img src="img/year_month.png">
*Heatmap, month vs year*

**Numbers of Reviews over months**<br>

This is useful to check if there's monthly / seasonal trends in numbers of reviews.
Let's try to dig little deeper, and gain numbers of each star_rating for each month. We can do this by further grouping record by `star_rating` value on `aq_cnt` command, then uing `aq_rst` to transform into pivot table.

In [57]:
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_pp -f,+1,tsv - -d %cols -mapf review_date "%%year:4-4%%-%%month:2-2%%%*" \
-mapc s:year %%YEAR%% -mapc s:month %%MONTH%% -c review_id star_rating month' | \
aq_cnt -f,+1 - -d S:review_id I:star_rating S:month -g month star_rating -k monthly review_id | \
aq_rst -f,+1 - -d S:month I:star_rating I:monthly -key star_rating -lab month -val monthly

2019-10-28 22:16:00 ip-10-10-1-118 ess[5236]: ***Error*** Multiple errors.  See task.log for more details
"star_rating","01","02","03","04","05","06","07","08","09","10","11","12"
1,472,394,433,440,419,432,457,478,293,322,465,460
2,294,248,260,246,262,210,253,257,206,197,255,269
3,609,470,538,502,483,447,470,490,380,391,445,505
4,1371,1022,1089,1050,1078,928,1067,1078,895,832,856,1131
5,6503,4944,5303,5303,5112,4631,4850,4962,3757,3950,4567,5740


<img src="img/rating_monthly.png">
*Heatmap of rating vs month*

### Numbers of Reviews Left per Customer

Let's take a look at how many reviews are left by one customers.<br>
To get this value, we need to take 2 steps,
1. count numbers of reviews left by each customer by grouping by `customer_id`. Let's call this `reviews_per_customer`.
2. count numbers of unique `customer_id` in each `reviews_per_customer`

**Note**<br>
Output of the commands are piped into `head` command to display top 20 result only for clearity. Feel free to remove it and run it to display the full result.

In [1]:
# step 1
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_cnt -f,+1,eok,tsv - -d %cols -g customer_id -k reviews_per_customer review_id' | head -n 20

"customer_id","row","reviews_per_customer"
146004,2,2
254421,1,1
565563,1,1
52303,1,1
1162122,1,1
108745,1,1
1325068,1,1
515450,1,1
3733178,1,1
5662728,1,1
1117977,3,3
6454258,1,1
1772983,1,1
8556468,1,1
2288113,1,1
4599064,1,1
12461968,1,1
5724518,1,1
13397066,1,1


In [3]:
# step2: pipe the result from step1 into another aq_cnt.
# sort at the last line
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_cnt -f,+1,eok,tsv - -d %cols -g customer_id -k reviews_per_customer review_id | \
 aq_cnt -f,+1 - -d I:customer_id I:row I:reviews_per_customer -g reviews_per_customer -k numbers_customer customer_id | \
 aq_ord -f,+1 - -d I:reviews_per_customer I:row I:numbers_customer -sort reviews_per_customer' | head -n 20

"reviews_per_customer","row","numbers_customer"
1,4074808,4074808
2,851376,851376
3,305377,305377
4,142319,142319
5,76569,76569
6,45982,45982
7,30036,30036
8,20271,20271
9,14798,14798
10,10746,10746
11,8224,8224
12,6567,6567
13,5181,5181
14,4315,4315
15,3528,3528
16,2887,2887
17,2401,2401
18,2102,2102
19,1848,1848


<img src="img/reviews_customers.png">

*Numbers of reviews vs Numbers of customer*

You can observe that most customers only leave 1 review, and up to about 5.
Note that this visualization only covers up to customers with 30 reviews, but the maximum number of reviews left by one customer was 3162.
This might indicates that some of the reviews / customers were fraudulent.

### Verifying Customer Reviews

This dataset comes with column `verified_purchase`, which means a customer actually purchased the product from amazon prior to leaving the review.
We will compare the numbers of reviews left by each customer, between verified reviews vs non-verified reviews.

**Verified Reviews**<br>
In the following steps and corresponding line in a cell below, we'll gain average numbers of reviews left per customer for verified purchases only.
1. make essentia data stream in bulk mode
2. filter out verified records only
3. get numbers of reviews left by each customer by counting numbers of unique review_id per customer_id

In [1]:
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_pp -f,+1,eok,tsv - -d %cols -filt "verified_purchase == \"Y\"" -c review_id customer_id | \
 aq_cnt -f,+1 - -d S:review_id I:customer_id -g customer_id -k num_reviews review_id | head -n 20'

"customer_id","row","num_reviews"
565563,1,1
52303,1,1
1162122,1,1
108745,1,1
1325068,1,1
515450,1,1
3733178,1,1
5662728,1,1
1117977,3,3
6454258,1,1
1772983,1,1
8556468,1,1
12461968,1,1
5724518,1,1
13397066,1,1
10015518,1,1
10319957,1,1
10409415,1,1
11092682,1,1


Taking the average of `num_reviews` will give us 1.60, meaning that customer whose purchases are verified, left 1.6 reviews in average.<br>
Let's take a look at un-verified purchases.

**Un-varified Reviews**<br>

In [2]:
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_pp -f,+1,eok,tsv - -d %cols -filt "verified_purchase == \"N\"" -c review_id customer_id | \
 aq_cnt -f,+1 - -d S:review_id I:customer_id -g customer_id -k num_reviews review_id | head -n 20'

"customer_id","row","num_reviews"
146004,2,2
254421,1,1
1576648,1,1
2016966,1,1
2288113,1,1
4599064,1,1
28075342,1,1
30067162,1,1
30755027,1,1
43540573,1,1
51921580,1,1
432832,1,1
5491151,1,1
10517269,1,1
10641330,1,1
36536216,1,1
16640269,1,1
51996373,1,1
160760,1,1


Unverified reviewers gives on average 1.7 reviews per customers. This is not much of difference. Digging deeper though, below is the boxplot of the numbers of reviews per customer id versus verification status.

<img src="img/review_status.png">

Note that y axis of this plot is rescaled with log10, but notice how skewed the unverified graph is. Maximum numbers of reviews left by one customer for un-verified purchase was 3161. You can see that small numbers of users are leaving large amount of reviews, making it seems less legitimate.


### Star Ratings by Product Categories

Finally, let's investigate numbers of each star ratings by each categories. 
We need to group the records by `product_category` first, and count occurence of each star ratings within.

In [75]:
ess stream amazon_reviews --bulk --master "*" "*" \
'aq_cnt -f,+1,eok,tsv - -d %cols -g product_category -kX - keyName star_rating | \
 aq_ord -f,+1 - -d S:product_category I:star_rating I:Count -sort product_category star_rating'

"product_category","star_rating","Count"
"Apparel",1,6
"Apparel",2,7
"Apparel",3,12
"Apparel",4,17
"Apparel",5,85
"Automotive",1,157
"Automotive",2,76
"Automotive",3,137
"Automotive",4,233
"Automotive",5,931
"Baby",1,626
"Baby",2,470
"Baby",3,818
"Baby",4,1759
"Baby",5,8287
"Beauty",1,6
"Beauty",3,3
"Beauty",4,9
"Beauty",5,37
"Books",1,68170
"Books",2,53808
"Books",3,90924
"Books",4,193814
"Books",5,788100
"Camera",1,1852
"Camera",2,1447
"Camera",3,3136
"Camera",4,7462
"Camera",5,22889
"Digital_Ebook_Purchase",1,58749
"Digital_Ebook_Purchase",2,59250
"Digital_Ebook_Purchase",3,132614
"Digital_Ebook_Purchase",4,327890
"Digital_Ebook_Purchase",5,979828
"Digital_Music_Purchase",1,4400
"Digital_Music_Purchase",2,3725
"Digital_Music_Purchase",3,8102
"Digital_Music_Purchase",4,24341
"Digital_Music_Purchase",5,122728
"Digital_Video_Download",1,71347
"Digital_Video_Download",2,53363
"Digital_Video_Download",3,98355
"Digital_Video_Download",4,216916
"Digital_Video_Download",5,676594
"Electronic

<img src="img/category_stars.png">
*category and numbers of star ratings*


**Visualizations for this notebook was done with python, but [PivotBillions](https://pivotbillions.com/) is a great big data EDA, ETL, visualization tool you can use without coding as well!**