# Goal

The goal of this notebook is to explore Amazon reviews and see if bias exist between different reviewers.  

### Notes/Thoughts/Reminders

### This is an ongoing analysis  

Need to separate the analysis between series and movies  
Need to test the mean score and compare to two groups  
Compare mean and calculate variance  
Run t-test or Wilcoxon test as appropriate

Run to see if there is concensus group of reviewers that review most similarly.

## Environment Setup and Dependencies

In [None]:
# Dependencies
import os

# set spark version
spark_version = 'spark-3.0.3'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark


# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

0% [Working]            Ign:1 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [Con                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [Connecting to archive.ubuntu.com (91.189.91.38)] [Waiting for headers] [Con                                                                               Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:5 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:6 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bi

In [None]:
# setup pyspark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AmazonAnalysis")\
  .config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar")\
  .getOrCreate()

# Extract 
*  Create connectiont to S3 bucket and file

DATA COLUMNS:   
marketplace- 2 letter country code of the marketplace where the review was written.  
customer_id       - Random identifier that can be used to aggregate reviews written by a single author.  
review_id         - The unique ID of the review.  
product_id        - The unique Product ID the review pertains to. In the multilingual dataset the reviews  
                    for the same product in different countries can be grouped by the same product_id.  
product_parent    - Random identifier that can be used to aggregate reviews for the same product.  
product_title     - Title of the product.  
product_category  - Broad product category that can be used to group reviews 
                    (also used to group the dataset into coherent parts).  
star_rating       - The 1-5 star rating of the review.  
helpful_votes     - Number of helpful votes.  
total_votes       - Number of total votes the review received.  
vine              - Review was written as part of the Vine program.  
verified_purchase - The review is on a verified purchase.  
review_headline   - The title of the review.  
review_body       - The review text.  
review_date       - The date the review was written.  

DATA FORMAT  
Tab ('\t') separated text file, without quote or escape characters.  
First line in each file is header; 1 line corresponds to 1 record.  

In [None]:
# add files to pyspark
from pyspark import SparkFiles

# Load file
# https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz" 
filename = "amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz"
spark.sparkContext.addFile(url)

# read file
df = spark.read.csv(SparkFiles.get(filename), header=True, inferSchema=True, sep='\t', timestampFormat="mm/dd/yy")
df.show(10)

+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|    product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   12190288|R3FU16928EP5TC|B00AYB1482|     668895143|Enlightened: Seas...|Digital_Video_Dow...|          5|            0|          0|   N|                Y|I loved it and I ...|I loved it and I ...| 2015-08-31|
|         US|   30549954|R1IZHHS1MH3AQ4|B00KQD28OM|     246219280|             Vicious|Digital_Video_Dow

In [None]:
df.groupby('vine').count().show()

+----+-------+
|vine|  count|
+----+-------+
|   N|4057147|
+----+-------+



# Transform
*  Remove bad and duplicated records
*  Check number of records left
*  Convert column datatypes if needed

In [None]:
# size of dataframe (rows)
print(df.count())

# drop incomplete records
df = df.dropna()
print(df.count())

# drop duplicated records (if any; should be none)
df = df.dropDuplicates()
print(df.count())

# check datatypes
df.printSchema()

4057147
4056518
4056518
root
 |-- marketplace: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: integer (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: string (nullable = true)



In [None]:
# convert review-date to date format
from pyspark.sql.functions import to_date, col
complete_table = df.withColumn("review_date", to_date(col("review_date"),"yyyy-MM-dd").alias("review_date"))

# check change
complete_table.printSchema()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: integer (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: date (nullable = true)



## General Review
- What questions do I formulate from looking at the data?

In [None]:
complete_table.show(5)

+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|    product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   51950426|R1005KN8L3OP23|B00COTH9VI|     956367867|Seeking Asian Female|Digital_Video_Dow...|          5|            1|          1|   N|                Y|What a great docu...|What a great docu...| 2015-04-07|
|         US|   42507369|R1008R0427X1FG|B009KHHELW|      41559476|Duck Dynasty Seas...|Digital_Video_Dow

In [None]:
complete_table.agg({'review_date': 'max'}).show()

+----------------+
|max(review_date)|
+----------------+
|      2015-08-31|
+----------------+



In [None]:
complete_table.agg({'review_date': 'min'}).show()

+----------------+
|min(review_date)|
+----------------+
|      2000-10-04|
+----------------+



In [None]:
complete_table.groupby('vine').agg({'vine':'count', 'star_rating':'mean', 'star_rating':'stddev'}).show()

Review Comments:
*  Most columns are pretty self-explanitory
*  Intiail thought is that I need to specifically target the `customer_id`, `product_title`, and `vine` status to determine bias relationships
*  Need to check to see what these columns mean or affect the data:
  *  `product_parent`
  *  `vine`
  *  `verified_purchase`
*  Should probably go to Amazon and check out one of the titles and the reviews section since that is where this info should come from.

## Customer Inspection

In [None]:
from pyspark.sql.functions import col, asc,desc
top_customers = complete_table.groupby('customer_id').agg({'customer_id': 'count', 'star_rating': 'mean', 'star_rating':'stddev'}).sort(col('count(customer_id)').desc())
top_customers.show()

+-----------+-------------------+------------------+
|customer_id|stddev(star_rating)|count(customer_id)|
+-----------+-------------------+------------------+
|   43430756| 1.1466950029602812|              2745|
|   39122522| 1.4677304463865763|               707|
|   30160665| 1.3412714675001378|               579|
|   49382242| 1.1040682379826534|               527|
|   50605810| 0.7162294922476529|               496|
|   12714026| 1.2588258102987135|               474|
|   20052283| 0.8957312551423887|               469|
|    5291529| 0.8632297187933043|               434|
|   17486470| 1.4211032675673525|               402|
|   44167709|  1.075516254256177|               393|
|   22263100| 0.9857063376341949|               386|
|   42398245|  1.031430340340169|               374|
|   50818682| 1.1828674484054744|               357|
|   27106921| 1.6445530191414834|               338|
|   41926755| 1.1660943116267561|               334|
|   12653036|  1.116638223221243|             

Above shows that the reviewers with a significant number of reviews score movies very differently.  It's not clear how the reviewers would score the same set of movies but taking a sample of over 300 movies the average rating ranged between 2.58 to 4.64.  Analyzing for survey response bias will be a separate notebook since that is a more involved process and there is a chance that with the number of reviews here that some interesting information can be extracted.  

## Product Parent Inspection

In [None]:
# not sure what product_parent means; check several values
review_list = complete_table.filter('product_parent == 534732318')
review_list.show()

+-----------+-----------+--------------+----------+--------------+-------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|product_title|    product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+-------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   21405234|R228VZZCNNJZY9|B01489L5LQ|     534732318|  After Words|Digital_Video_Dow...|          3|            4|          8|   N|                Y|Four stars go to ...|The best of the m...| 2015-08-23|
|         US|   37232123|R2RVE4TNX16LM7|B01489L5LQ|     534732318|  After Words|Digital_Video_Dow...|          5|            3|     

In [None]:
review_product_parent_record = complete_table.filter('product_parent == 853694223')
review_product_parent_record.show()

+-----------+-----------+--------------+----------+--------------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent| product_title|    product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   51739130|R1TIXBUT10IETQ|B00I3MQNWG|     853694223|Bosch Season 1|Digital_Video_Dow...|          5|            0|          0|   N|                Y|          Five Stars|Great series.<br ...| 2015-08-29|
|         US|   36609931|R1I5DO9T0YY62H|B00I3MQNWG|     853694223|Bosch Season 1|Digital_Video_Dow...|          5|            0|

In [None]:
review_product_parent_record.groupby('product_title').agg({'product_title':'count'}).show()

+--------------------+--------------------+
|       product_title|count(product_title)|
+--------------------+--------------------+
|Bosch Season 1 [U...|                1056|
|Chapter Three:  B...|                  21|
|Chapter Four:  Fu...|                  38|
|Chapter Seven:  L...|                  43|
|Bosch: The Offici...|                 225|
|Chapter Ten:  Us ...|                 400|
|Chapter Two:  Los...|                  49|
|Chapter Five:  Ma...|                  38|
|               Pilot|                9351|
|Bosch: Behind the...|                  97|
|Chapter Six:  Don...|                  30|
| Bosch: A Look Ahead|                  31|
|Chapter Nine: The...|                  85|
|Chapter Eight:  H...|                  65|
|      Bosch Season 1|               53158|
+--------------------+--------------------+



Product Parent Comments:
*  shows that product parent is a grouping of related movie titles
*  product parent is probably a better filter metric for standard movies but not series.  
*  Do sequels to movies fall into the same product parent code?

In [None]:
top_product_parent_expanded = complete_table.groupby('product_parent').agg({'product_parent': 'count', 'star_rating': 'mean', 'star_rating':'stddev', 'product_title':'max'}).sort(col('count(product_parent)').desc())
top_product_parent_expanded.show()

+--------------+---------------------+-------------------+--------------------+
|product_parent|count(product_parent)|stddev(star_rating)|  max(product_title)|
+--------------+---------------------+-------------------+--------------------+
|     853694223|                64687| 0.7329521807324126|               Pilot|
|     360747388|                24950| 1.1258722151064684|Why Do We Cover t...|
|     192466294|                24144| 0.5700737413464739|Downton Abbey: Or...|
|     459613388|                23975| 0.4972830325471113|Episode 8 (Origin...|
|     730000855|                18240| 0.6925583220789996|            Veterans|
|      82685115|                17550| 0.9197108370275767|You Have Insulted...|
|     756881760|                16604|  1.213892423084143|             Zingers|
|     593966951|                16410| 1.1536516590701158|Under The Dome, S...|
|      47146773|                15618| 1.3262660051632356|The After Exclusi...|
|     682981764|                14573| 0

In [None]:
top_reviewed_titles = complete_table.groupby('product_parent', 'product_title').agg({'product_title':'count', 'product_parent':'max', 'star_rating':'mean'}).sort(col('count(product_title)').desc())
top_reviewed_titles.show()

+--------------+--------------------+-------------------+------------------+--------------------+
|product_parent|       product_title|max(product_parent)|  avg(star_rating)|count(product_title)|
+--------------+--------------------+-------------------+------------------+--------------------+
|     853694223|      Bosch Season 1|          853694223| 4.606098799804357|               53158|
|     192466294|Downton Abbey Sea...|          192466294| 4.857869514040693|               23788|
|     459613388|Downton Abbey Sea...|          459613388| 4.873770078210411|               23782|
|     360747388|Transparent Season 1|          360747388| 4.463614140654381|               21272|
|     730000855|  Justified Season 1|          730000855| 4.679647684467095|               18052|
|     593966951|Under The Dome, S...|          593966951| 4.153612511309293|               15474|
|     682981764|Downton Abbey Sea...|          682981764| 4.892263246899661|               14192|
|     666093513|Cata

## Vine Reviewer Inspection

In [None]:
complete_table.groupby('vine').agg({'vine':'count', 'star_rating':'mean', 'star_rating':'stddev'}).show()


+----+-------------------+-----------+
|vine|stddev(star_rating)|count(vine)|
+----+-------------------+-----------+
|   N| 1.2234969947632452|    4056518|
+----+-------------------+-----------+



In [None]:
complete_table.filter(complete_table.vine == Y).describe()


NameError: ignored

In [None]:
complete_table.filter('vine == N').describe()

# Feature Modification  
*  Identify series
*  One-Hot-Encode `star_rating`

## Identify Series Shows

In [None]:
from pyspark.sql.functions import col, asc,desc, lit
type_list = complete_table.product_title.contains("Season")

In [None]:
type_list.dtypes

Column<b'contains(product_title, Season)[dtypes]'>

In [None]:
complete_table = complete_table.withColumn('isSeries', type_list)
complete_table.show()

+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+--------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|    product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|isSeries|
+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+--------+
|         US|   51950426|R1005KN8L3OP23|B00COTH9VI|     956367867|Seeking Asian Female|Digital_Video_Dow...|          5|            1|          1|   N|                Y|What a great docu...|What a great docu...| 2015-04-07|   false|
|         US|   42507369|R1008R0427X1FG|B009KHHELW|      41559476|Du

In [None]:
complete_table.groupby('isSeries').agg({'isSeries':'count', 'star_rating':'mean', 'helpful_votes':'sum'}).show()

+--------+-----------------+------------------+---------------+
|isSeries| avg(star_rating)|sum(helpful_votes)|count(isSeries)|
+--------+-----------------+------------------+---------------+
|    true|4.602062651843032|            360543|        1553825|
|   false|3.952329750392877|           1667937|        2502693|
+--------+-----------------+------------------+---------------+



## Separate `star_rating` into own columns

In [None]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType

for i in [1,2,3,4,5]:
    function = udf(lambda item: 
                   1 if item == i else 0, 
                   IntegerType())
    new_column_name = "is"+'_'+ str(i)
    complete_table = complete_table.withColumn(new_column_name, function(col("star_rating")))
complete_table.show()



+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+--------+----+----+----+----+----+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|    product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|isSeries|is_1|is_2|is_3|is_4|is_5|
+-----------+-----------+--------------+----------+--------------+--------------------+--------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+--------+----+----+----+----+----+
|         US|   51950426|R1005KN8L3OP23|B00COTH9VI|     956367867|Seeking Asian Female|Digital_Video_Dow...|          5|            1|          1|   N|                Y|What a great docu...|What a great docu...| 2015-04-07|   

## Determine length of review

In [None]:
def count_words(x):
  return len(x.split(" "))

countWords = udf(count_words,"int")
complete_table = complete_table.withColumn('feedback_length',countWords)
complete_table.show()

AssertionError: ignored

+--------+-----------------+------------------+---------------+
|isSeries| avg(star_rating)|sum(helpful_votes)|count(isSeries)|
+--------+-----------------+------------------+---------------+
|    true|4.602062651843032|            360543|        1553825|
|   false|3.952329750392877|           1667937|        2502693|
+--------+-----------------+------------------+---------------+

