# Explore Amazon Customer Reviews Dataset with Spark and AWS Glue Interactive Sessions
For this notebook, we will query a subset of reviews for the Digital Software, Digital Video Games, and Gift Card product categories.  We will also show the results for the entire dataset.

# Dataset Column Descriptions

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.
- `year`: The year derived from the review date.

In [None]:
import psutil

notebook_memory = psutil.virtual_memory()

if notebook_memory.total < 32 * 1024 * 1024:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True
    print(notebook_memory)

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

In [5]:
%store -r ingest_create_athena_table_parquet_passed

In [6]:
try:
    ingest_create_athena_table_parquet_passed
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL PREVIOUS NOTEBOOKS")
    print("You did not convert into Parquet data.        ")
    print("++++++++++++++++++++++++++++++++++++++++++++++")

In [7]:
print(ingest_create_athena_table_parquet_passed)

In [8]:
if not ingest_create_athena_table_parquet_passed:
    print("++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN ALL PREVIOUS NOTEBOOKS")
    print("You did not convert into Parquet data.        ")
    print("++++++++++++++++++++++++++++++++++++++++++++++")
else:
    print("[OK]")

In [10]:
%stop_session

Stopping session: 2ca8c9cf-0379-4942-83f1-04b93f69363c
Stopped session.


In [13]:
%additional_python_modules seaborn
%number_of_workers 5

Additional python modules to be included:
seaborn
Previous number of workers: 5
Setting new number of workers to: 5


In [1]:
spark

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::079002598131:role/service-role/AmazonSageMaker-ExecutionRole-20220804T150518
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 9aa91879-7874-41b9-a177-aaee939f4759
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.2
--enable-glue-datacatalog true
--additional-python-modules seaborn
Waiting for session 9aa91879-7874-41b9-a177-aaee939f4759 to get into ready status...
Session 9aa91879-7874-41b9-a177-aaee939f4759 has been created.
<pyspark.sql.session.SparkSession object at 0x7f1d04fb4e10>


In [2]:
database_name = "default"
table_name = "amazon_reviews_parquet"




In [3]:
product_category = "Digital_Software"

df = spark.sql("""
    SELECT * FROM {}.{}
    WHERE product_category = '{}' LIMIT 100
""".format(database_name, table_name, product_category)
)

df.show()


+-----------+-----------+--------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|year|product_category|
+-----------+-----------+--------------+----------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+----+----------------+
|         US|   52227601|R3I00Z3RS1192X|B000YMR5X4|     234295632|TurboTax Premier ...|          5|            4|          5|   N|                N|Great download ex...|I have used used ...|      13923|2008|Digital_Software|
|         US|   50526871|R1JKRJ049ZHGV9|B008S0IMCC|     534964191| Quicken Deluxe 2013|          4| 

In [4]:
product_category = "Digital_Software"

df = spark.sql("""
    SELECT review_body FROM {}.{}
    WHERE product_category = '{}' LIMIT 100
""".format(database_name, table_name, product_category)
)

df.show()

+--------------------+
|         review_body|
+--------------------+
|This program stat...|
|This is a complic...|
|This the worst pr...|
|I bought Music Ma...|
|I had to get them...|
|I had concerns ab...|
|I use this produc...|
|[[ASIN:B002K7C1HG...|
|I give it a one b...|
|I have been using...|
|Norton Internet S...|
|I just purchased ...|
|Hi. I am getting ...|
|This key card doe...|
|I had a program t...|
|I have been impre...|
|I have tried this...|
|Very used to this...|
|I have been using...|
|Bought this for a...|
+--------------------+
only showing top 20 rows


# Set Seaborn Parameters

In [5]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set_style = "seaborn-whitegrid"

sns.set(
    rc={
        "font.style": "normal",
        "axes.facecolor": "white",
        "grid.color": ".8",
        "grid.linestyle": "-",
        "figure.facecolor": "white",
        "figure.titlesize": 20,
        "text.color": "black",
        "xtick.color": "black",
        "ytick.color": "black",
        "axes.labelcolor": "black",
        "axes.grid": True,
        "axes.labelsize": 10,
        "xtick.labelsize": 10,
        "font.size": 10,
        "ytick.labelsize": 10,
    }
)

UnknownMagic: unknown magic command 'matplotlib'


# Helper Code to Display Values on Bars

In [None]:
# def show_values_barplot(axs, space):
#     def _show_on_plot(ax):
#         for p in ax.patches:
#             _x = p.get_x() + p.get_width() + float(space)
#             _y = p.get_y() + p.get_height()
#             value = round(float(p.get_width()), 2)
#             ax.text(_x, _y, value, ha="left")

#     if isinstance(axs, np.ndarray):
#         for idx, ax in np.ndenumerate(axs):
#             _show_on_plot(ax)
#     else:
#         _show_on_plot(axs)

# 1. Which Product Categories are Highest Rated by Average Rating?

In [None]:
# %%time

# # SQL statement
# statement = """
#     SELECT product_category, AVG(star_rating) AS avg_star_rating
#     FROM {}.{} 
#     WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')
#     GROUP BY product_category 
#     ORDER BY avg_star_rating DESC
# """.format(
#     database_name, table_name
# )

# print(statement)

# df = pd.read_sql(statement, conn)
# df.head(5)

In [5]:
df = spark.sql("""SELECT product_category, AVG(star_rating) AS avg_star_rating 
    FROM {}.{} 
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')
    GROUP BY product_category 
    ORDER BY avg_star_rating DESC""".format(database_name, table_name)
)
df.show()

+-------------------+------------------+
|   product_category|   avg_star_rating|
+-------------------+------------------+
|          Gift_Card| 4.731363105858364|
|Digital_Video_Games|3.8531262248076406|
|   Digital_Software|3.5393303553935973|
+-------------------+------------------+


In [None]:
# df = spark.read.parquet('s3://amazon-reviews-pds/parquet/')


# from pyspark.sql.functions import avg, desc

# df_avg_rating_per_product_category = df.groupBy('product_category') \
#                                        .agg(avg('star_rating').alias('avg_star_rating')) \
#                                        .sort(desc('avg_star_rating'))
# df_avg_rating_per_product_category.show(50)

In [None]:
# # Store number of categories
# num_categories = df.shape[0]
# print(num_categories)

# # Store average star ratings
# average_star_ratings = df

## Visualization for a Subset of Product Categories

In [None]:
# # Create plot
# barplot = sns.barplot(y="product_category", x="avg_star_rating", data=df, saturation=1)

# if num_categories < 10:
#     sns.set(rc={"figure.figsize": (10.0, 5.0)})

# # Set title and x-axis ticks
# plt.title("Average Rating by Product Category")
# plt.xticks([1, 2, 3, 4, 5], ["1-Star", "2-Star", "3-Star", "4-Star", "5-Star"])

# # Helper code to show actual values afters bars
# show_values_barplot(barplot, 0.1)

# plt.xlabel("Average Rating")
# plt.ylabel("Product Category")

# # Export plot if needed
# plt.tight_layout()
# # plt.savefig('avg_ratings_per_category.png', dpi=300)

# # Show graphic
# plt.show(barplot)

## Visualization for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following visualization:

<img src="img/c5-01.png"  width="80%" align="left">

# 2. Which Product Categories Have the Most Reviews?

In [5]:
df = spark.sql("""
    SELECT product_category, COUNT(star_rating) AS count_star_rating 
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')    
    GROUP BY product_category 
    ORDER BY count_star_rating DESC
""".format(database_name, table_name)
)

df.show()

+-------------------+-----------------+
|   product_category|count_star_rating|
+-------------------+-----------------+
|          Gift_Card|           149086|
|Digital_Video_Games|           145431|
|   Digital_Software|           102084|
+-------------------+-----------------+


In [None]:
# # Store counts
# count_ratings = df["count_star_rating"]

# # Store max ratings
# max_ratings = df["count_star_rating"].max()
# print(max_ratings)

TypeError: 'Column' object is not callable


## Visualization for a Subset of Product Categories

In [None]:
# # Create Seaborn barplot
# barplot = sns.barplot(y="product_category", x="count_star_rating", data=df, saturation=1)

# if num_categories < 10:
#     sns.set(rc={"figure.figsize": (10.0, 5.0)})

# # Set title
# plt.title("Number of Ratings per Product Category for Subset of Product Categories")

# # Set x-axis ticks to match scale
# if max_ratings > 200000:
#     plt.xticks([100000, 1000000, 5000000, 10000000, 15000000, 20000000], ["100K", "1m", "5m", "10m", "15m", "20m"])
#     plt.xlim(0, 20000000)
# elif max_ratings <= 200000:
#     plt.xticks([50000, 100000, 150000, 200000], ["50K", "100K", "150K", "200K"])
#     plt.xlim(0, 200000)

# plt.xlabel("Number of Ratings")
# plt.ylabel("Product Category")

# plt.tight_layout()

# # Export plot if needed
# # plt.savefig('ratings_per_category.png', dpi=300)

# # Show the barplot
# plt.show(barplot)

## Visualization for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following visualization:

<img src="img/c5-02.png"  width="80%" align="left">

# 3. When did each product category become available in the Amazon catalog based on the date of the first review?

In [None]:
# SQL statement
df = spark.sql("""
    SELECT product_category, MIN(year) AS first_review_year
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')    
    GROUP BY product_category
    ORDER BY first_review_year 
""".format(database_name, table_name)
)

df.show()

In [None]:
# def get_x_y(df):
#     """ Get X and Y coordinates; return tuple """
#     series = df["first_review_year"].value_counts().sort_index()
#     # new_series = series.reindex(range(1,21)).fillna(0).astype(int)
#     return series.index, series.values

In [None]:
# X, Y = get_x_y(df)

## Visualization for a Subset of Product Categories

In [None]:
# fig = plt.figure(figsize=(12, 5))
# ax = plt.gca()

# ax.set_title("Number Of First Product Category Reviews Per Year for Subset of Categories")
# ax.set_xlabel("Year")
# ax.set_ylabel("Count")

# ax.plot(X, Y, color="black", linewidth=2, marker="o")
# ax.fill_between(X, [0] * len(X), Y, facecolor="lightblue")

# ax.locator_params(integer=True)

# ax.set_xticks(range(1995, 2016, 1))
# ax.set_yticks(range(0, max(Y) + 2, 1))

# plt.xticks(rotation=45)

# # fig.savefig('first_reviews_per_year.png', dpi=300)
# plt.show()

## Visualization for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following visualization:

<img src="img/c4-04.png"  width="80%" align="left">

# 4. What is the breakdown of ratings (1-5) per product category?  


In [None]:
# SQL statement
df = spark.sql("""
    SELECT product_category, star_rating, COUNT(*) AS count_reviews
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')    
    GROUP BY  product_category, star_rating
    ORDER BY  product_category ASC, star_rating DESC, count_reviews
""".format(database_name, table_name)
)

df.show()

+----------------+-----------+-------------+
|product_category|star_rating|count_reviews|
+----------------+-----------+-------------+
|         Apparel|          5|      3320651|
|         Apparel|          4|      1147254|
|         Apparel|          3|       623483|
|         Apparel|          2|       369608|
|         Apparel|          1|       445464|
|      Automotive|          5|      2301688|
|      Automotive|          4|       526898|
|      Automotive|          3|       240023|
|      Automotive|          2|       147843|
|      Automotive|          1|       300024|
|            Baby|          5|      1078545|
|            Baby|          4|       289129|
|            Baby|          3|       150753|
|            Baby|          2|       101427|
|            Baby|          1|       145039|
|          Beauty|          5|      3254946|
|          Beauty|          4|       741443|
|          Beauty|          3|       398405|
|          Beauty|          2|       264029|
|         

## Prepare for Stacked Percentage Horizontal Bar Plot Showing Proportion of Star Ratings per Product Category

In [10]:
# # Create grouped DataFrames by category and by star rating
# grouped_category = df.groupby("product_category")
# grouped_star = df.groupby("star_rating")

# # Create sum of ratings per star rating
# df_sum = df.groupby(["star_rating"]).sum()

# # Calculate total number of star ratings
# total = df_sum["count_reviews"].sum()
# print(total)

AnalysisException: Cannot resolve column name "count_reviews" among (star_rating, sum(star_rating), sum(count_reviews))


In [None]:
# # Create dictionary of product categories and array of star rating distribution per category
# distribution = {}
# count_reviews_per_star = []
# i = 0

# for category, ratings in grouped_category:
#     count_reviews_per_star = []
#     for star in ratings["star_rating"]:
#         count_reviews_per_star.append(ratings.at[i, "count_reviews"])
#         i = i + 1
#     distribution[category] = count_reviews_per_star

# # Check if distribution has been created succesfully
# print(distribution)

In [None]:
# # Check if distribution keys are set correctly to product categories
# print(distribution.keys())

In [None]:
# # Check if star rating distributions are set correctly
# print(distribution.items())

In [None]:
# # Sort distribution by average rating per category
# sorted_distribution = {}

# average_star_ratings.iloc[:, 0]
# for index, value in average_star_ratings.iloc[:, 0].items():
#     sorted_distribution[value] = distribution[value]

In [None]:
# df_sorted_distribution_pct = pd.DataFrame(sorted_distribution).transpose().apply(
#     lambda num_ratings: num_ratings/sum(num_ratings)*100, axis=1
# )
# df_sorted_distribution_pct.columns=['5', '4', '3', '2', '1']
# df_sorted_distribution_pct

## Visualization for a Subset of Product Categories

In [None]:
# categories = df_sorted_distribution_pct.index

# # Plot bars
# if len(categories) > 10:
#     plt.figure(figsize=(10,10))
# else: 
#     plt.figure(figsize=(10,5))

# df_sorted_distribution_pct.plot(kind="barh", 
#                                 stacked=True, 
#                                 edgecolor='white',
#                                 width=1.0,
#                                 color=['green', 
#                                        'orange', 
#                                        'blue', 
#                                        'purple', 
#                                        'red'])

# plt.title("Distribution of Reviews Per Rating Per Category", 
#           fontsize='16')

# plt.legend(bbox_to_anchor=(1.04,1), 
#            loc="upper left",
#            labels=['5-Star Ratings', 
#                    '4-Star Ratings', 
#                    '3-Star Ratings', 
#                    '2-Star Ratings', 
#                    '1-Star Ratings'])

# plt.xlabel("% Breakdown of Star Ratings", fontsize='14')
# plt.gca().invert_yaxis()
# plt.tight_layout()

# plt.show()

## Visualization for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following visualization:

<img src="img/c5-04.png"  width="70%" align="left">

# 5. How Many Reviews per Star Rating? (5, 4, 3, 2, 1) 

In [11]:
# SQL statement
df = spark.sql("""
    SELECT star_rating, COUNT(*) AS count_reviews
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')
    GROUP BY star_rating
    ORDER BY star_rating DESC, count_reviews 
""".format(database_name, table_name)
)

df.show()

+-----------+-------------+
|star_rating|count_reviews|
+-----------+-------------+
|          5|       256796|
|          4|        46958|
|          3|        23093|
|          2|        16208|
|          1|        53546|
+-----------+-------------+


## Results for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following result:

<img src="img/star_rating_count_all.png"  width="25%" align="left">

In [None]:
# chart = df.plot.bar(
#     x="star_rating", y="count_reviews", rot="0", figsize=(10, 5), title="Review Count by Star Ratings", legend=False
# )

# plt.xlabel("Star Rating")
# plt.ylabel("Review Count")

# plt.show(chart)

## Results for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following result:


<img src="img/star_rating_count_all_bar_chart.png"  width="70%" align="left">

# 6. How Did Star Ratings Change Over Time?
Is there a drop-off point for certain product categories throughout the year?

## Average Star Rating Across All Product Categories

In [35]:
# SQL statement
df = spark.sql("""
    SELECT year, ROUND(AVG(star_rating),4) AS avg_rating
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')    
    GROUP BY year
    ORDER BY year
""".format(database_name, table_name)
)

df.show()

+----+----------+
|year|avg_rating|
+----+----------+
|2004|       4.5|
|2005|    3.2759|
|2006|     3.375|
|2007|      3.95|
|2008|    2.8966|
|2009|    3.7288|
|2010|    3.7614|
|2011|    3.9808|
|2012|    4.0955|
|2013|     4.008|
|2014|    4.2026|
|2015|    4.1125|
+----+----------+


In [None]:
# df["year"] = pd.to_datetime(df["year"], format="%Y").dt.year

## Visualization for a Subset of Product Categories

In [None]:
# fig = plt.gcf()
# fig.set_size_inches(12, 5)

# fig.suptitle("Average Star Rating Over Time (Across Subset of Product Categories)")

# ax = plt.gca()
# # ax = plt.gca().set_xticks(df['year'])
# ax.locator_params(integer=True)
# ax.set_xticks(df["year"].unique())

# df.plot(kind="line", x="year", y="avg_rating", color="red", ax=ax)

# # plt.xticks(range(1995, 2016, 1))
# # plt.yticks(range(0,6,1))
# plt.xlabel("Years")
# plt.ylabel("Average Star Rating")
# plt.xticks(rotation=45)

# # fig.savefig('average-rating.png', dpi=300)
# plt.show()

## Visualization for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following visualization:

<img src="img/c4-06.png"  width="70%" align="left">

## Average Star Rating Per Product Categories Across Time

In [12]:
# SQL statement
df = spark.sql("""
    SELECT product_category, year, ROUND(AVG(star_rating), 4) AS avg_rating_category
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')    
    GROUP BY product_category, year
    ORDER BY year 
""".format(database_name, table_name)
)

df.show()

+-------------------+----+-------------------+
|   product_category|year|avg_rating_category|
+-------------------+----+-------------------+
|          Gift_Card|2004|                4.5|
|          Gift_Card|2005|             3.2759|
|          Gift_Card|2006|             3.2857|
|Digital_Video_Games|2006|                4.0|
|          Gift_Card|2007|               3.95|
|Digital_Video_Games|2008|                2.0|
|          Gift_Card|2008|             3.3043|
|   Digital_Software|2008|             2.7333|
|   Digital_Software|2009|             2.7603|
|          Gift_Card|2009|             3.9389|
|Digital_Video_Games|2009|             3.8924|
|Digital_Video_Games|2010|             3.7338|
|          Gift_Card|2010|              4.307|
|   Digital_Software|2010|             3.1268|
|          Gift_Card|2011|             4.5916|
|   Digital_Software|2011|             3.4667|
|Digital_Video_Games|2011|             3.6484|
|          Gift_Card|2012|             4.7119|
|   Digital_S

## Visualization

In [None]:
# def plot_categories(df):
#     df_categories = df["product_category"].unique()
#     for category in df_categories:
#         # print(category)
#         df_plot = df.loc[df["product_category"] == category]
#         df_plot.plot(
#             kind="line",
#             x="year",
#             y="avg_rating_category",
#             c=np.random.rand(
#                 3,
#             ),
#             ax=ax,
#             label=category,
#         )

In [None]:
# fig = plt.gcf()
# fig.set_size_inches(12, 5)

# fig.suptitle("Average Star Rating Over Time Across Subset Of Categories")

# ax = plt.gca()

# ax.locator_params(integer=True)
# ax.set_xticks(df["year"].unique())

# plot_categories(df)

# plt.xlabel("Year")
# plt.ylabel("Average Star Rating")
# plt.legend(bbox_to_anchor=(0, -0.15, 1, 0), loc=2, ncol=2, mode="expand", borderaxespad=0)

# # fig.savefig('average_rating_category_all_data.png', dpi=300)
# plt.show()

## Visualization for All Product Categories
If you ran this same query across all product categories, you would see the following visualization:

<img src="img/average_rating_category_all_data.png"  width="70%" align="left">

# 7. Which Star Ratings (1-5) are Most Helpful?

In [14]:
# SQL statement
df = spark.sql("""
    SELECT star_rating, AVG(helpful_votes) AS avg_helpful_votes
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')
    GROUP BY  star_rating
    ORDER BY  star_rating ASC
""".format(database_name, table_name)
)

df.show()

+-----------+------------------+
|star_rating| avg_helpful_votes|
+-----------+------------------+
|          1|4.8907294662533145|
|          2| 2.493028134254689|
|          3|1.5595635040921492|
|          4|1.0709357298010989|
|          5|0.5324966120967616|
+-----------+------------------+


## Results for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following result:

<img src="img/star_rating_helpful_all.png"  width="25%" align="left">

## Visualization for a Subset of Product Categories

In [None]:
# chart = df.plot.bar(
#     x="star_rating", y="avg_helpful_votes", rot="0", figsize=(10, 5), title="Helpfulness Of Star Ratings", legend=False
# )

# plt.xlabel("Star Rating")
# plt.ylabel("Average Helpful Votes")

# plt.show(chart)

## Visualization for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following visualization:

<img src="img/c4-08.png"  width="60%" align="left">

# 8. Which Products have Most Helpful Reviews?  How Long are the Most Helpful Reviews?

In [15]:
# SQL statement
df = spark.sql("""
    SELECT product_title, helpful_votes, star_rating,
           LENGTH(review_body) AS review_body_length,
           SUBSTR(review_body, 1, 100) AS review_body_substr
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')
    ORDER BY helpful_votes DESC LIMIT 10 
""".format(database_name, table_name)
)

df.show()

+--------------------+-------------+-----------+------------------+--------------------+
|       product_title|helpful_votes|star_rating|review_body_length|  review_body_substr|
+--------------------+-------------+-----------+------------------+--------------------+
|Amazon.com eGift ...|         5987|          1|              3498|I think I am just...|
|TurboTax Deluxe F...|         5363|          1|              3696|I have been a loy...|
|SimCity - Limited...|         5068|          1|              2478|Guess what? If yo...|
|SimCity - Limited...|         3789|          1|              1423|How would you fee...|
|Microsoft Office ...|         2955|          1|              4932|I have never been...|
|SimCity - Limited...|         2509|          5|              1171|You'd think I'd b...|
|TurboTax Deluxe F...|         2439|          1|              3710|Although a long t...|
|Playstation Netwo...|         2384|          5|               190|$49.99 for $50 of...|
|Amazon eGift Card...

## Results for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following result:

<img src="img/most_helpful_all.png"  width="90%" align="left">

# 9. What is the Ratio of Positive (5, 4) to Negative (3, 2 ,1) Reviews?

In [17]:
# SQL statement
df = spark.sql("""
    SELECT (CAST(positive_review_count AS DOUBLE) / CAST(negative_review_count AS DOUBLE)) AS positive_to_negative_sentiment_ratio
    FROM (
      SELECT count(*) AS positive_review_count
      FROM {}.{}
      WHERE star_rating >= 4 and product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')

    ), (
      SELECT count(*) AS negative_review_count
      FROM {}.{}
      WHERE star_rating < 4 and product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')
    )
""".format(database_name, table_name, database_name, table_name)
)

df.show()

+------------------------------------+
|positive_to_negative_sentiment_ratio|
+------------------------------------+
|                   3.271554277467231|
+------------------------------------+


## Results for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following result:

<img src="img/ratio_all.png"  width="25%" align="left">

# 10. Which Customers are Abusing the Review System by Repeatedly Reviewing the Same Product?  What Was Their Average Star Rating for Each Product?

In [18]:
# SQL statement
spark.sql("""
    SELECT customer_id, product_category, product_title, 
    ROUND(AVG(star_rating),4) AS avg_star_rating, COUNT(*) AS review_count 
    FROM {}.{} 
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')    
    GROUP BY customer_id, product_category, product_title 
    HAVING COUNT(*) > 1 
    ORDER BY review_count DESC
    LIMIT 5
""".format(database_name, table_name)
)

df.show()

+------------------------------------+
|positive_to_negative_sentiment_ratio|
+------------------------------------+
|                   3.271554277467231|
+------------------------------------+


## Result for All Product Categories
If you ran this same query across all product categories (150+ million reviews), you would see the following result:
  
<img src="img/athena-abuse-all.png"  width="60%" align="left">

# 11. What is the distribution of review lengths (number of words)?

In [19]:
df = spark.sql("""
    SELECT CARDINALITY(SPLIT(review_body, ' ')) as num_words
    FROM {}.{}
    WHERE product_category in ('Digital_Software', 'Gift_Card', 'Digital_Video_Games')
""".format(database_name, table_name)
)

df.show()

+---------+
|num_words|
+---------+
|      107|
|        4|
|      170|
|       18|
|       21|
|        6|
|       64|
|       10|
|       28|
|        5|
|      206|
|        2|
|      146|
|        5|
|      200|
|      132|
|       45|
|       11|
|      146|
|      150|
+---------+
only showing top 20 rows


In [20]:
# summary = df["num_words"].describe(percentiles=[0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.00])
# summary

TypeError: 'Column' object is not callable


In [None]:
# df["num_words"].plot.hist(xticks=[0, 16, 32, 64, 128, 256], bins=100, range=[0, 256]).axvline(
#     x=summary["80%"], c="red"
# )

![](./img/max_seq_length_viz.png)

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>