In [None]:
# Extract and transform Amazon Reviews data and Analysis of Vine reviews

# Amazon Reviews Dataset - Extract, Transform and Analysis of Vine Reviews 
Author: Rosie Gianan
#### Dataset source URL:
https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt
    
#### Dataset used for this extract, transform and analysis:
https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Luggage_v1_00.tsv.gz
    
#### DATA COLUMNS:
-	marketplace       - 2 letter country code of the marketplace where the review was written.
-	customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
-	review_id         - The unique ID of the review.
-	product_id        - The unique Product ID the review pertains to. In the multilingual dataset the reviews
-	for the same product in different countries can be grouped by the same product_id.
-	product_parent    - Random identifier that can be used to aggregate reviews for the same product.
-	product_title     - Title of the product.
-	product_category  - Broad product category that can be used to group reviews 
-	(also used to group the dataset into coherent parts).
-	star_rating       - The 1-5 star rating of the review.
-	helpful_votes     - Number of helpful votes.
-	total_votes       - Number of total votes the review received.
-	vine              - Review was written as part of the Vine program.
-	verified_purchase - The review is on a verified purchase.
-	review_headline   - The title of the review.
-	review_body       - The review text.
-	review_date       - The date the review was written.

#### DATA FORMAT
-	Tab ('\t') separated text file, without quote or escape characters.
-	First line in each file is header; 1 line corresponds to 1 record.

In [1]:
import os
# Find the latest version of spark 3.2  from http://www.apache.org/dist/spark/ and enter as the spark version
# For example:
spark_version = 'spark-3.2.2'
# spark_version = 'spark-3.<enter version>'

os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:8 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:14 http://archive.ubuntu.com/ubuntu b

## Postgres connection

In [2]:
# Postgres connection
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

--2022-11-19 20:20:15--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar’


2022-11-19 20:20:15 (10.5 MB/s) - ‘postgresql-42.2.9.jar’ saved [914037/914037]



## Spark session

In [3]:
# Start Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CloudETL").config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar").getOrCreate()

## Extract the data from AWS S3

In [4]:
# Read in data from S3 Buckets
from pyspark import SparkFiles
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Luggage_v1_00.tsv.gz"
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("amazon_reviews_us_Luggage_v1_00.tsv.gz"), sep="\t", header=True)

# Show DataFrame
df.show()

+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|review_date|
+-----------+-----------+--------------+----------+--------------+--------------------+----------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+
|         US|   40884699| R9CO86UUJCAW5|B00VGTN02Y|     786681372|Teenage Mutant Ni...|         Luggage|          3|            0|          0|   N|                Y|my review of this...|my review of this...| 2015-08-31|
|         US|   23208852|R3PR8X6QGVJ8B1|B005KIWL0E|     618251799|Kenneth Cole Reac...|         Luggage|          5|    

In [5]:
# Count the number of records(rows) in the dataset
df.count()

348657

## Transform the data

In [6]:
# Remove duplicate rows
df = df.dropna(how='any')
df = df.dropDuplicates()
df.count()

348613

In [7]:
# Print the column names and types
df.dtypes

[('marketplace', 'string'),
 ('customer_id', 'string'),
 ('review_id', 'string'),
 ('product_id', 'string'),
 ('product_parent', 'string'),
 ('product_title', 'string'),
 ('product_category', 'string'),
 ('star_rating', 'string'),
 ('helpful_votes', 'string'),
 ('total_votes', 'string'),
 ('vine', 'string'),
 ('verified_purchase', 'string'),
 ('review_headline', 'string'),
 ('review_body', 'string'),
 ('review_date', 'string')]

In [8]:
# Convert string to INT
from pyspark.sql.types import IntegerType
df = df.withColumn("customer_id", df["customer_id"].cast(IntegerType()))
df = df.withColumn("product_parent", df["product_parent"].cast(IntegerType()))
df = df.withColumn("star_rating", df["star_rating"].cast(IntegerType()))
df = df.withColumn("helpful_votes", df["helpful_votes"].cast(IntegerType()))
df = df.withColumn("total_votes", df["total_votes"].cast(IntegerType()))
df.dtypes

[('marketplace', 'string'),
 ('customer_id', 'int'),
 ('review_id', 'string'),
 ('product_id', 'string'),
 ('product_parent', 'int'),
 ('product_title', 'string'),
 ('product_category', 'string'),
 ('star_rating', 'int'),
 ('helpful_votes', 'int'),
 ('total_votes', 'int'),
 ('vine', 'string'),
 ('verified_purchase', 'string'),
 ('review_headline', 'string'),
 ('review_body', 'string'),
 ('review_date', 'string')]

### Create the df for vine_table

In [9]:
# Create the df for vine_table
df_vine_table = df.select([ "review_id"
                          , "star_rating"
                          , "helpful_votes"
                          , "total_votes"
                          , "vine"
                          ]).dropDuplicates(["review_id"])
df_vine_table.show()

+--------------+-----------+-------------+-----------+----+
|     review_id|star_rating|helpful_votes|total_votes|vine|
+--------------+-----------+-------------+-----------+----+
|R1000QXOTMQIU4|          1|           18|         20|   N|
|R1001T49IL9HAU|          5|            2|          3|   N|
|R1002WVSKO1QHG|          5|            0|          0|   N|
|R1002ZGISG3B05|          4|            2|          2|   N|
|R1004N9KZKSQSY|          4|            0|          0|   N|
|R1006PZC1D8T21|          5|            0|          1|   N|
|R1006VZRSTX6GT|          4|            1|          1|   N|
|R1007N4WRJCE2S|          5|            1|          1|   N|
|R100844D7ATZKI|          4|            1|          1|   N|
| R1008ROGV9HPR|          5|            0|          0|   N|
|R10090Y4VX54G9|          5|           34|         37|   N|
|R1009PQF75PXP8|          2|            3|          4|   N|
|R100CH5V0LFAFF|          5|            0|          0|   N|
|R100D8W6AAGSFA|          5|            

In [10]:
df_vine_table.dtypes

[('review_id', 'string'),
 ('star_rating', 'int'),
 ('helpful_votes', 'int'),
 ('total_votes', 'int'),
 ('vine', 'string')]

In [11]:
df_vine_table.count()

348613

In [12]:
# create vine paid df
df_vine_paid = df_vine_table.filter('vine = "Y"')
df_vine_paid.show()

+--------------+-----------+-------------+-----------+----+
|     review_id|star_rating|helpful_votes|total_votes|vine|
+--------------+-----------+-------------+-----------+----+
|R10K89GQOXHP7G|          4|            0|          0|   Y|
|R10SROQYM7GU29|          5|            0|          0|   Y|
|R10UFCNIZ8XMVD|          4|            1|          1|   Y|
|R10VUE4T3RBX5R|          5|            1|          1|   Y|
|R11HWLSG36EATO|          5|            0|          1|   Y|
|R12O3R9EX9C0BZ|          5|            0|          0|   Y|
|R12QV9N7VK7I4O|          4|            0|          0|   Y|
|R13RAWMKH88CFZ|          4|            0|          0|   Y|
|R13V2UJF8N0134|          5|            2|          2|   Y|
|R1405CHZB74TQR|          5|            0|          0|   Y|
|R144U7PJ79YR7R|          4|            0|          0|   Y|
|R14568FBGOPC5R|          4|            0|          0|   Y|
|R1484EBZLMVXM2|          5|            1|          2|   Y|
|R149ISO7W6AEVG|          5|            

In [None]:
df_vine_paid.count()

904

In [13]:
# Create vine unpaid df
df_vine_unpaid = df_vine_table.filter('vine = "N"')
df_vine_unpaid.show()

+--------------+-----------+-------------+-----------+----+
|     review_id|star_rating|helpful_votes|total_votes|vine|
+--------------+-----------+-------------+-----------+----+
|R1000QXOTMQIU4|          1|           18|         20|   N|
|R1001T49IL9HAU|          5|            2|          3|   N|
|R1002WVSKO1QHG|          5|            0|          0|   N|
|R1002ZGISG3B05|          4|            2|          2|   N|
|R1004N9KZKSQSY|          4|            0|          0|   N|
|R1006PZC1D8T21|          5|            0|          1|   N|
|R1006VZRSTX6GT|          4|            1|          1|   N|
|R1007N4WRJCE2S|          5|            1|          1|   N|
|R100844D7ATZKI|          4|            1|          1|   N|
| R1008ROGV9HPR|          5|            0|          0|   N|
|R10090Y4VX54G9|          5|           34|         37|   N|
|R1009PQF75PXP8|          2|            3|          4|   N|
|R100CH5V0LFAFF|          5|            0|          0|   N|
|R100D8W6AAGSFA|          5|            

In [15]:
df_vine_unpaid.count()

347709

## Analysis

In [16]:
# Extract vine paid with rating = 5
df_vine_paid_star_rating_5 = df_vine_paid.filter(df_vine_paid["star_rating"] == 5)
df_vine_paid_star_rating_5.count()

472

In [17]:
# Extract vine unpaid with rating = 5
df_vine_unpaid_star_rating_5 = df_vine_unpaid.filter(df_vine_unpaid["star_rating"] == 5)
df_vine_unpaid_star_rating_5.count()

216003

In [19]:
# Get the average star_rating, helful_votes and total_votes for paid and unpaid vines
df_avg = df_vine_table.groupBy("vine").avg()
df_avg.select("vine", "avg(star_rating)", "avg(helpful_votes)", "avg(total_votes)").show()

+----+------------------+------------------+------------------+
|vine|  avg(star_rating)|avg(helpful_votes)|  avg(total_votes)|
+----+------------------+------------------+------------------+
|   Y|4.3584070796460175|2.3584070796460175| 2.870575221238938|
|   N|  4.22309172325134|2.0626299578095475|2.4305007923292177|
+----+------------------+------------------+------------------+



In [20]:
# Get the count of star_rating, helpful_votes and total_votes
df_vine_table.createOrReplaceTempView("COUNT_VINE")
spark.sql("SELECT vine, COUNT(vine) FROM COUNT_VINE GROUP BY vine").show()

+----+-----------+
|vine|count(vine)|
+----+-----------+
|   Y|        904|
|   N|     347709|
+----+-----------+



In [21]:
# Get the count of paid and unpaid vine
spark.sql("SELECT vine, COUNT(vine) FROM COUNT_VINE WHERE star_rating = 5 GROUP BY vine").show()

+----+-----------+
|vine|count(vine)|
+----+-----------+
|   Y|        472|
|   N|     216003|
+----+-----------+



# Conclusion

The number of paid vine reviews is significantly low compare to unpaid vine reviews, which shows that reviewers are willing to put an effort to review the product without incentive. This initially shows the program is trustworthy since it is not based on incentives. The average star rating for both paid and unpaid vines are high with value of greater than 4 stars. The average helpful and total votes are both below 3. Comparing the ratings vs. the votes imply that vine program doesn’t show it is trustworthy since the votes is significantly low for such high ratings.

