## ETL Notebook:

For getting started with extracting and transforming compressed (.gz) JSON files.

In [4]:
import pandas as pd
import numpy as np
import json
import gzip
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *

In [5]:
# Create Spark session 
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

sc = spark.sparkContext

Run shell script (`%% sh`) and unzip file with [gzip](https://docs.python.org/3/library/gzip.html):

In [None]:
# %%sh

# gzip -d Electronics_5.json.gz
# ls -lhtr

View first few rows of JSON:

In [6]:
N = 3
with open("Electronics_5.json") as f:
    for i in range(0,N):
        print(f.readline(), end = '')

{"overall": 5.0, "vote": "67", "verified": true, "reviewTime": "09 18, 1999", "reviewerID": "AAP7PPBU72QFM", "asin": "0151004714", "style": {"Format:": " Hardcover"}, "reviewerName": "D. C. Carrad", "reviewText": "This is the best novel I have read in 2 or 3 years.  It is everything that fiction should be -- beautifully written, engaging, well-plotted and structured.  It has several layers of meanings -- historical, family,  philosophical and more -- and blends them all skillfully and interestingly.  It makes the American grad student/writers' workshop \"my parents were  mean to me and then my professors were mean to me\" trivia look  childish and silly by comparison, as they are.\nAnyone who says this is an  adolescent girl's coming of age story is trivializing it.  Ignore them.  Read this book if you love literature.\nI was particularly impressed with  this young author's grasp of the meaning and texture of the lost world of  French Algeria in the 1950's and '60's...particularly poig

Copy first row of JSON output above and use [Schema Generator](https://preetranjan.github.io/pyspark-schema-generator/) to create schema like below:

This schema below *should* work for all of the k-core datasets...

In [7]:
schema = StructType([StructField('overall',FloatType(),True),  # Changed to FloatType from StringType
    StructField('vote',StringType(),True),  
    StructField('verified',BooleanType(),True),  
    StructField('reviewTime',StringType(),True),  
    StructField('reviewerID',StringType(),True),  
    StructField('asin',StringType(),True),  
    StructField('style',StructType([StructField('Format:',StringType(),True)]),True),  
    StructField('reviewerName',StringType(),True),  
    StructField('reviewText',StringType(),True),  
    StructField('summary',StringType(),True),  
    StructField('unixReviewTime',IntegerType(),True)])

Use schema so spark does not have to infer:

In [8]:
# 5 core electronics data
e5_core_df = spark \
    .read \
    .json("Electronics_5.json", schema = schema)

e5_core_df.printSchema()

root
 |-- overall: float (nullable = true)
 |-- vote: string (nullable = true)
 |-- verified: boolean (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- asin: string (nullable = true)
 |-- style: struct (nullable = true)
 |    |-- Format:: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: integer (nullable = true)



Show data:

In [9]:
e5_core_df.show()

[Stage 0:>                                                          (0 + 1) / 1]

+-------+----+--------+-----------+--------------+----------+-----------------+--------------------+--------------------+--------------------+--------------+
|overall|vote|verified| reviewTime|    reviewerID|      asin|            style|        reviewerName|          reviewText|             summary|unixReviewTime|
+-------+----+--------+-----------+--------------+----------+-----------------+--------------------+--------------------+--------------------+--------------+
|    5.0|  67|    true|09 18, 1999| AAP7PPBU72QFM|0151004714|     { Hardcover}|        D. C. Carrad|This is the best ...|      A star is born|     937612800|
|    3.0|   5|    true|10 23, 2013|A2E168DTVGE6SV|0151004714|{ Kindle Edition}|                 Evy|Pages and pages o...|A stream of consc...|    1382486400|
|    5.0|   4|   false| 09 2, 2008|A1ER5AYS3FQ9O3|0151004714|     { Paperback}|               Kcorn|This is the kind ...|I'm a huge fan of...|    1220313600|
|    5.0|  13|   false| 09 4, 2000|A1T17LMQABMBN5|01

                                                                                

Check partitions:

In [10]:
e5_core_df.rdd.getNumPartitions()

32

Check shape:

In [11]:
print((e5_core_df.count(), len(e5_core_df.columns)))



(6739590, 11)


                                                                                