# 09a - Amazon Customer Reviews - EDA

In [None]:
!pip install pandas-profiling --quiet

> [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling) generates profile reports from a pandas dataframe

In [None]:
!pip install pyathena --quiet

> [PyAthena](https://github.com/laughingman7743/PyAthena/) is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena. We will use that with `pandas.read_sql()` to create dataframes for EDA purpose.

In [None]:
from pyathena import connect
import pandas as pd
import pandas_profiling
import seaborn as sns

import sagemaker

sns.set()

## Athena connection

Replace the value of `s3_ouput` with your bucket name

In [None]:
# SageMaker execution IAM role has taken care the authentication and authorization
workgroup = 'primary'
s3_output = 's3://athena.out.yourname'

conn = connect(work_group=workgroup, s3_staging_dir=s3_output)

## Exploration

In [None]:
%%time

# Randomly sample 10% of observations
df = pd.read_sql("""
SELECT
  *
FROM reviews.parquet
TABLESAMPLE BERNOULLI (10)
WHERE
  product_category='Baby'
""", conn)

In [None]:
# Check the memory usage
# Adjust the sampling ratio if the dataset is too big for interactive analysis
df.info()

### Profile report

In [None]:
# requires pandas-profiling
df.profile_report()

## Prepare the dataset for Autopilot

In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
prefix='autopilot'
s3_output=f's3://{bucket}/{prefix}/'

print(s3_output)

The following query takes 30s to process in Athena, and 7 minutes to read into a dataframe.

In [None]:
%%time

df2 = pd.read_sql("""
SELECT
  star_rating,
  review_body
FROM reviews.parquet
WHERE
  product_category='Baby'
AND
  review_body IS NOT NULL
""", conn)

Ignoring class imbalance at the moment

In [None]:
# Autopilot expects a single CSV file, with header, without quotes
df2.to_csv('reviews.csv', index=False, header=True)

<font color=orange>Copy</font> the dataset location

In [None]:
# Upload the file to s3
sagemaker_session.upload_data(path='reviews.csv', key_prefix='data')