# Visualize Amazon Customer Reviews Dataset - Part 2

### Dataset columns:

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

In [2]:
%%bash
pip install -q --upgrade pip
pip install -q pandas==0.23.0
pip install -q numpy==1.14.3
pip install -q matplotlib==3.0.3
pip install -q seaborn==0.8.1
pip install -q PyAthena==1.8.0

In [3]:
# Imports & Settings

import boto3
import sagemaker

import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'

# Get region 
session = boto3.session.Session()
region_name = session.region_name

# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

# Set Athena database & table 
database_name = 'dsoaws'
table_name = 'amazon_reviews_parquet'

### 1. Which product categories are the highest rated by average rating?

In [8]:
# SQL statement
statement = """
SELECT product_category, AVG(star_rating) AS avg_star_rating
FROM {}.{} 
GROUP BY product_category 
ORDER BY avg_star_rating DESC
""".format(database_name, table_name)

print(statement)


SELECT product_category, AVG(star_rating) AS avg_star_rating
FROM dsoaws.amazon_reviews_parquet 
GROUP BY product_category 
ORDER BY avg_star_rating DESC



In [9]:
# TODO: Visualization

### 2. Which product categories have the most reviews?

In [10]:
# SQL statement
statement = """
SELECT product_category, COUNT(star_rating) AS count_star_rating 
FROM {}.{}
GROUP BY product_category 
ORDER BY count_star_rating DESC
""".format(database_name, table_name)

print(statement)


SELECT product_category, COUNT(star_rating) AS count_star_rating 
FROM dsoaws.amazon_reviews_parquet
GROUP BY product_category 
ORDER BY count_star_rating DESC



In [None]:
# TODO: Visualization

### 3. When did each product category become available in the catalog based on the date of the first review?

In [12]:
# SQL statement -- CHRIS
statement = """
SELECT product_category, MIN(DATE_FORMAT(DATE_ADD('day', review_date, DATE_PARSE('1970-01-01','%Y-%m-%d')), '%Y-%m-%d')) AS first_review_date 
FROM {}.{}
GROUP BY product_category
ORDER BY first_review_date
""".format(database_name, table_name)

print(statement)


SELECT product_category, MIN(DATE_FORMAT(DATE_ADD('day', review_date, DATE_PARSE('1970-01-01','%Y-%m-%d')), '%Y-%m-%d')) AS first_review_date 
FROM dsoaws.amazon_reviews_parquet
GROUP BY product_category
ORDER BY first_review_date



In [11]:
# SQL statement -- ANTJE
statement = """
SELECT product_category, MIN(review_date) AS first_review_date
FROM {}.{}
GROUP BY product_category
ORDER BY first_review_date 
""".format(database_name, table_name)

print(statement)


SELECT product_category, MIN(review_date) AS first_review_date
FROM dsoaws.amazon_reviews_parquet
GROUP BY product_category
ORDER BY first_review_date 



In [5]:
# TODO: Visualization

### 4.

### 5. 