# Visualize Amazon Customer Reviews Dataset - Part 2

### Dataset columns:

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [1]:
%%bash
pip install -q --upgrade pip
pip install -q pandas==0.23.0
pip install -q numpy==1.14.3
pip install -q matplotlib==3.0.3
pip install -q seaborn==0.8.1
pip install -q PyAthena==1.8.0

In [2]:
# Imports & Settings

import boto3
import sagemaker

import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Set Athena database & table 
database_name = 'dsoaws'
table_name = 'amazon_reviews_parquet'

In [6]:
# PyAthena imports
from pyathena import connect
from pyathena.pandas_cursor import PandasCursor
from pyathena.util import as_pandas

In [7]:
# Set S3 staging directory -- this is a temporary directory used for Athena queries
s3_staging_dir = 's3://{0}/athena/staging'.format(bucket)

### 1. Which product categories are the highest rated by average rating?

In [9]:
# SQL statement
statement = """
SELECT product_category, AVG(star_rating) AS avg_star_rating
FROM {}.{} 
GROUP BY product_category 
ORDER BY avg_star_rating DESC
""".format(database_name, table_name)

print(statement)


SELECT product_category, AVG(star_rating) AS avg_star_rating
FROM dsoaws.amazon_reviews_parquet 
GROUP BY product_category 
ORDER BY avg_star_rating DESC



In [11]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,product_category,avg_star_rating
0,Gift Card,4.731363
1,Digital_Music_Purchase,4.642891
2,Music,4.436624
3,Books,4.341658
4,Grocery,4.312219


In [None]:
# TODO: Visualization

### 2. Which product categories have the most reviews?

In [12]:
# SQL statement
statement = """
SELECT product_category, COUNT(star_rating) AS count_star_rating 
FROM {}.{}
GROUP BY product_category 
ORDER BY count_star_rating DESC
""".format(database_name, table_name)

print(statement)


SELECT product_category, COUNT(star_rating) AS count_star_rating 
FROM dsoaws.amazon_reviews_parquet
GROUP BY product_category 
ORDER BY count_star_rating DESC



In [13]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,product_category,count_star_rating
0,Books,19531329
1,Digital_Ebook_Purchase,17622415
2,Wireless,9002021
3,PC,6908554
4,Home,6221559


In [None]:
# TODO: Visualization

### 3. When did each product category become available in the catalog based on the date of the first review?

In [14]:
# SQL statement -- CHRIS
statement = """
SELECT product_category, MIN(DATE_FORMAT(DATE_ADD('day', review_date, DATE_PARSE('1970-01-01','%Y-%m-%d')), '%Y-%m-%d')) AS first_review_date 
FROM {}.{}
GROUP BY product_category
ORDER BY first_review_date
""".format(database_name, table_name)

print(statement)


SELECT product_category, MIN(DATE_FORMAT(DATE_ADD('day', review_date, DATE_PARSE('1970-01-01','%Y-%m-%d')), '%Y-%m-%d')) AS first_review_date 
FROM dsoaws.amazon_reviews_parquet
GROUP BY product_category
ORDER BY first_review_date



In [None]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

In [16]:
# SQL statement -- ANTJE
statement = """
SELECT product_category, MIN(review_date) AS first_review_date
FROM {}.{}
GROUP BY product_category
ORDER BY first_review_date 
""".format(database_name, table_name)

print(statement)


SELECT product_category, MIN(review_date) AS first_review_date
FROM dsoaws.amazon_reviews_parquet
GROUP BY product_category
ORDER BY first_review_date 



In [17]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,product_category,first_review_date
0,Books,1995-06-24
1,Video,1995-11-11
2,Music,1995-11-11
3,Video DVD,1996-07-08
4,Toys,1997-01-05


In [None]:
# TODO: Visualization

### 4. What is the breakdown of ratings (1-5) per product category?  


In [18]:
# SQL statement 
statement = """
SELECT product_category,
         star_rating,
         COUNT(*) AS count_reviews
FROM {}.{}
GROUP BY  product_category, star_rating
ORDER BY  product_category ASC, star_rating DESC, count_reviews
""".format(database_name, table_name)

print(statement)


SELECT product_category,
         star_rating,
         COUNT(*) AS count_reviews
FROM dsoaws.amazon_reviews_parquet
GROUP BY  product_category, star_rating
ORDER BY  product_category ASC, star_rating DESC, count_reviews



In [19]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,product_category,star_rating,count_reviews
0,Apparel,5,3320566
1,Apparel,4,1147237
2,Apparel,3,623471
3,Apparel,2,369601
4,Apparel,1,445458


In [None]:
# TODO: Visualization

#### With this information, we can also quickly group by star ratings and count the reviews for each rating (5, 4, 3, 2, 1): 

In [20]:
# SQL statement 
statement = """
SELECT star_rating,
         COUNT(*) AS count_reviews
FROM dsoaws.amazon_reviews_parquet
GROUP BY  star_rating
ORDER BY  star_rating DESC, count_reviews 
""".format(database_name, table_name)

print(statement)


SELECT star_rating,
         COUNT(*) AS count_reviews
FROM dsoaws.amazon_reviews_parquet
GROUP BY  star_rating
ORDER BY  star_rating DESC, count_reviews 



In [21]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,star_rating,count_reviews
0,5,93200812
1,4,26223470
2,3,12133927
3,2,7304430
4,1,12099639


In [None]:
# TODO: Visualization

#### Stacked percentage horizontal bar plot showing proportion of star ratings per product category

In [None]:
# See 01_notebook

In [None]:
# TODO: Visualization

### 5. Which star ratings (1-5) are the most helpful?

In [22]:
# SQL statement 
statement = """
SELECT star_rating,
         AVG(helpful_votes) AS avg_helpful_votes
FROM {}.{}
GROUP BY  star_rating
ORDER BY  star_rating DESC
""".format(database_name, table_name)

print(statement)


SELECT star_rating,
         AVG(helpful_votes) AS avg_helpful_votes
FROM dsoaws.amazon_reviews_parquet
GROUP BY  star_rating
ORDER BY  star_rating DESC



In [23]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,star_rating,avg_helpful_votes
0,5,1.672698
1,4,1.678697
2,3,2.04809
3,2,2.506635
4,1,3.684641


In [None]:
# TODO: Visualization

### 6. Which products have the most helpful reviews?  How long are those reviews?

In [24]:
# SQL statement 
statement = """
SELECT product_title,
         helpful_votes,
         star_rating,
         LENGTH(review_body) AS review_body_length,
         SUBSTR(review_body, 1, 100) AS review_body_substr
FROM {}.{}
ORDER BY helpful_votes DESC LIMIT 10 
""".format(database_name, table_name)

print(statement)


SELECT product_title,
         helpful_votes,
         star_rating,
         LENGTH(review_body) AS review_body_length,
         SUBSTR(review_body, 1, 100) AS review_body_substr
FROM dsoaws.amazon_reviews_parquet
ORDER BY helpful_votes DESC LIMIT 10 



In [25]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,product_title,helpful_votes,star_rating,review_body_length,review_body_substr
0,Kindle: Amazon's Original Wireless Reading Dev...,47524,5,12906,"This is less a \\""pros and cons\\"" review than..."
1,"BIC Cristal For Her Ball Pen, 1.0mm, Black, 16...",41393,5,863,Someone has answered my gentle prayers and FIN...
2,The Mountain Kids 100% Cotton Three Wolf Moon ...,41278,5,1566,This item has wolves on it which makes it intr...
3,"Kindle Keyboard 3G, Free 3G + Wi-Fi, 6"" E Ink ...",31924,5,23069,UPDATE NOVEMBER 2011:<br /><br />My review is ...
4,"Kindle Fire HD 7"", Dolby Audio, Dual-Band Wi-Fi",31417,4,13594,I've been an iPad user since the original came...


In [None]:
# TODO: Visualization

### 7. For which 2 products does the same individual love and hate?

### 8. Which customers have written the most helpful reviews?  
And how many reviews have they written?  
Across how many categories?  What is their average star rating?

In [26]:
# SQL statement 
statement = """
SELECT customer_id,
       ROUND(AVG(helpful_votes),1) AS avg_helpful_votes,
       COUNT(*) AS review_count,
  COUNT(DISTINCT product_category) AS 
product_category_count,
       ROUND(AVG(star_rating),1) AS avg_star_rating
FROM {}.{}
GROUP BY  customer_id
HAVING count(*) > 100
ORDER BY avg_helpful_votes DESC LIMIT 10;
""".format(database_name, table_name)

print(statement)


SELECT customer_id,
       ROUND(AVG(helpful_votes),1) AS avg_helpful_votes,
       COUNT(*) AS review_count,
  COUNT(DISTINCT product_category) AS 
product_category_count,
       ROUND(AVG(star_rating),1) AS avg_star_rating
FROM dsoaws.amazon_reviews_parquet
GROUP BY  customer_id
HAVING count(*) > 100
ORDER BY avg_helpful_votes DESC LIMIT 10;



In [27]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,customer_id,avg_helpful_votes,review_count,product_category_count,avg_star_rating
0,53025525,294.9,143,18,4.4
1,50629044,258.2,105,24,3.8
2,52621867,169.7,105,23,3.4
3,43346653,158.8,103,7,4.1
4,42793407,145.3,133,25,4.1


In [None]:
# TODO: Visualization

### 9. What is the ratio of positive to negative reviews?

In [30]:
# SQL statement 
statement = """
SELECT (CAST(positive_review_count AS DOUBLE) / CAST(negative_review_count AS DOUBLE)) AS positive_to_negative_sentiment_ratio
FROM (
  SELECT count(*) AS positive_review_count
  FROM {}.{}
  WHERE star_rating >= 4
), (
  SELECT count(*) AS negative_review_count
  FROM {}.{}
  WHERE star_rating < 4
)
""".format(database_name, table_name, database_name, table_name)

print(statement)


SELECT (CAST(positive_review_count AS DOUBLE) / CAST(negative_review_count AS DOUBLE)) AS positive_to_negative_sentiment_ratio
FROM (
  SELECT count(*) AS positive_review_count
  FROM dsoaws.amazon_reviews_parquet
  WHERE star_rating >= 4
), (
  SELECT count(*) AS negative_review_count
  FROM dsoaws.amazon_reviews_parquet
  WHERE star_rating < 4
)



In [31]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,positive_to_negative_sentiment_ratio
0,3.786679


In [None]:
# TODO: Visualization

### 10. How have ratings changed over time?  ie. has the average rating become more or less critical over the years?

In [35]:
# SQL statement 
statement = """
SELECT year, ROUND(AVG(star_rating),4) AS avg_rating
FROM {}.{}
GROUP BY year
ORDER BY year
""".format(database_name, table_name)

print(statement)


SELECT year, ROUND(AVG(star_rating),4) AS avg_rating
FROM dsoaws.amazon_reviews_parquet
GROUP BY year
ORDER BY year



In [36]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,year,avg_rating
0,1995,4.6169
1,1996,4.6003
2,1997,4.4344
3,1998,4.3607
4,1999,4.2819


In [None]:
# TODO: Visualization

### 11. Which customers are abusing the review system by repeatedly reviewing the same product more than once?  What was their average star rating for each product?

In [40]:
# SQL statement 
statement = """
SELECT customer_id, product_category, product_title, 
ROUND(AVG(star_rating),4) AS avg_star_rating, COUNT(*) AS review_count 
FROM dsoaws.amazon_reviews_parquet 
GROUP BY customer_id, product_category, product_title 
HAVING COUNT(*) > 1 
ORDER BY review_count DESC
LIMIT 5
""".format(database_name, table_name)

print(statement)


SELECT customer_id, product_category, product_title, 
ROUND(AVG(star_rating),4) AS avg_star_rating, COUNT(*) AS review_count 
FROM dsoaws.amazon_reviews_parquet 
GROUP BY customer_id, product_category, product_title 
HAVING COUNT(*) > 1 
ORDER BY review_count DESC
LIMIT 5



In [41]:
cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()
cursor.execute(statement)

# Load query results into Pandas DataFrame and show results
df = as_pandas(cursor)
df.head()

Unnamed: 0,customer_id,product_category,product_title,avg_star_rating,review_count
0,38118182,Video DVD,Pearl Harbor,4.2308,130
1,33132919,Video DVD,Shania Twain - Up (Live in Chicago),5.0,110
2,52895956,Books,Frankenstein,3.0215,93
3,32330663,Music,Reprise,5.0,82
4,43622173,Music,Sinner,4.9756,82


In [None]:
# TODO: Visualization