# Goal

The goal of this notebook is to identify the Vine reviewers for the Amazon Product Reviews.  

### Notes/Thoughts/Reminders
Need to pull each dataset and extract out 
  *   the total number of records 
  *   the total number of vine users 
  *   a list of all the vine user names

## Environment Setup and Dependencies

In [1]:
# Dependencies
import os

# set spark version
spark_version = 'spark-3.0.3'
os.environ['SPARK_VERSION']=spark_version

# Install Spark and Java
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q http://www.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark


# Set Environment Variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Start a SparkSession
import findspark
findspark.init()

Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:6 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:7 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:8 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:9 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Ign:10 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:11 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:12 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Fetched 252 kB in 3s (79.4 kB/s)
Reading package li

In [2]:
# postgres connection
!wget https://jdbc.postgresql.org/download/postgresql-42.2.9.jar

--2022-08-06 01:16:04--  https://jdbc.postgresql.org/download/postgresql-42.2.9.jar
Resolving jdbc.postgresql.org (jdbc.postgresql.org)... 72.32.157.228, 2001:4800:3e1:1::228
Connecting to jdbc.postgresql.org (jdbc.postgresql.org)|72.32.157.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 914037 (893K) [application/java-archive]
Saving to: ‘postgresql-42.2.9.jar.2’


2022-08-06 01:16:04 (1.78 MB/s) - ‘postgresql-42.2.9.jar.2’ saved [914037/914037]



In [3]:
# setup pyspark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AmazonDatasetReview")\
  .config("spark.driver.extraClassPath","/content/postgresql-42.2.9.jar")\
  .getOrCreate()

# Extract 
*  Create connectiont to S3 bucket and file

In [4]:
import requests
from bs4 import BeautifulSoup as soup

url = 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt'
html = requests.get(url)

soup = soup(html.text, 'html.parser')

In [5]:
text = soup.contents[0]
text_list = text.split('\n')
text_list

url_list = []

for line in text_list:
  if line[0:4] == 'http':
    if 'sample' not in line and 'multilingual' not in line:
      url_list.append(line)

url_list

['https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Watches_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_DVD_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Toys_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Tools_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Sports_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Software_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Shoes_v1_00.tsv.gz',
 'https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Pet_Products_v1_00.tsv.g

In [6]:
# add files to pyspark
from pyspark import SparkFiles

summary=[]

# Load file
for url in url_list:

  summary_table = {}

  filename = url.split("tsv/")[1]
  short_filename = filename.replace(".tsv.gz","")
  summary_table["filename"] = short_filename


  spark.sparkContext.addFile(url)

  # read file
  df = spark.read.csv(SparkFiles.get(filename), header=True, inferSchema=True, sep='\t', timestampFormat="mm/dd/yy")
  df = df.dropna()
  df = df.dropDuplicates()


  # *****************************************************
  # find total reviews
  total_records = df.count()
  summary_table['total_reviews'] = total_records


  # *****************************************************
  # find vine reviews
  vine_count = df.groupby('vine').count()
  try:
    isYes = vine_count.filter(vine_count.vine == "Y").collect()[0][1]
  except:
    isYes = None

  try:
    isNo = vine_count.filter(vine_count.vine == "N").collect()[0][1]
  except:
    isNo = None 

  summary_table["vine_review_count"] = isYes
  summary_table["not_vine_review_count"] = isNo


  # *****************************************************
  # vine member list

  vine_members = df.filter(df.vine == "Y").select('customer_id').distinct()

  vine_member_list = list(vine_members.toPandas()['customer_id'])

  summary_table['vine_members'] = vine_member_list

  summary.append(summary_table)


In [7]:
# above took 3hr 47min 
#creating a dataframe
dataframe = spark.createDataFrame(summary)
  
# show data frame
dataframe.show()



+--------------------+---------------------+-------------+--------------------+-----------------+
|            filename|not_vine_review_count|total_reviews|        vine_members|vine_review_count|
+--------------------+---------------------+-------------+--------------------+-----------------+
|amazon_reviews_us...|              8983572|      9001052|[50699952, 520769...|            17480|
|amazon_reviews_us...|               958932|       960679|[38679000, 492148...|             1747|
|amazon_reviews_us...|              1781596|      1785886|[18800155, 518545...|             4290|
|amazon_reviews_us...|              5064083|      5068421|[18800155, 516674...|             4338|
|amazon_reviews_us...|               380575|       380575|                  []|             null|
|amazon_reviews_us...|              4821664|      4863497|[22978817, 188001...|            41833|
|amazon_reviews_us...|              1733213|      1740974|[17171509, 520769...|             7761|
|amazon_reviews_us..

In [8]:
import pandas as pd

summary_df = pd.DataFrame(summary)

summary_df

Unnamed: 0,filename,total_reviews,vine_review_count,not_vine_review_count,vine_members
0,amazon_reviews_us_Wireless_v1_00,9001052,17480.0,8983572,"[50699952, 52076994, 22978817, 35218069, 42553..."
1,amazon_reviews_us_Watches_v1_00,960679,1747.0,958932,"[38679000, 49214860, 17171509, 41591813, 51854..."
2,amazon_reviews_us_Video_Games_v1_00,1785886,4290.0,1781596,"[18800155, 51854558, 17481726, 38679000, 18496..."
3,amazon_reviews_us_Video_DVD_v1_00,5068421,4338.0,5064083,"[18800155, 51667424, 20590238, 17171509, 50874..."
4,amazon_reviews_us_Video_v1_00,380575,,380575,[]
5,amazon_reviews_us_Toys_v1_00,4863497,41833.0,4821664,"[22978817, 18800155, 45053274, 50699952, 41549..."
6,amazon_reviews_us_Tools_v1_00,1740974,7761.0,1733213,"[17171509, 52076994, 51854558, 20590238, 38679..."
7,amazon_reviews_us_Sports_v1_00,4849000,10080.0,4838920,"[18800155, 52773179, 50152643, 22978817, 41549..."
8,amazon_reviews_us_Software_v1_00,341913,10415.0,331498,"[50874998, 17481726, 51854558, 27189281, 50152..."
9,amazon_reviews_us_Shoes_v1_00,4366324,895.0,4365429,"[17481726, 51854558, 49214860, 52228204, 25620..."


In [10]:
dataframe.write.parquet("amazon_vine_reviews.parquet")

In [13]:
from google.colab import files
files.download("/content/amazon_vine_reviews.parquet")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>