### Amazon Electronics Analysis

In [1]:
import pandas as pd
import numpy as np
import json
import gzip
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
import re


In [2]:
# Create Spark session 
spark = SparkSession \
    .builder \
    .master("local[*]") \
    .appName('Milestone_I') \
    .getOrCreate()

sc = spark.sparkContext

23/06/07 17:36:28 WARN Utils: Your hostname, MBP.local resolves to a loopback address: 127.0.0.1; using 192.168.0.29 instead (on interface en0)
23/06/07 17:36:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/07 17:36:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Meta electronics file for price data

View first few rows of JSON:

In [3]:
N = 2
with open("meta_Electronics.json") as f:
    for i in range(0,N):
        print(f.readline(), end = '')

{"category": ["Electronics", "Camera &amp; Photo", "Video Surveillance", "Surveillance Systems", "Surveillance DVR Kits"], "tech1": "", "description": ["The following camera brands and models have been tested for compatibility with GV-Software.\nGeoVision \tACTi \tArecont Vision \tAXIS \tBosch \tCanon\nCNB \tD-Link \tEtroVision \tHikVision \tHUNT \tIQEye\nJVC \tLG \tMOBOTIX \tPanasonic \tPelco \tSamsung\nSanyo \tSony \tUDP \tVerint \tVIVOTEK \t \n \nCompatible Standard and Protocol\nGV-System also allows for integration with all other IP video devices compatible with ONVIF(V2.0), PSIA (V1.1) standards, or RTSP protocol.\nONVIF \tPSIA \tRTSP \t  \t  \t \nNote: Specifications are subject to change without notice. Every effort has been made to ensure that the information on this Web site is accurate. No liability is assumed for incidental or consequential damages arising from the use of the information or products contained herein."], "fit": "", "title": "Genuine Geovision 1 Channel 3rd P

[Generate Schema](https://preetranjan.github.io/pyspark-schema-generator/) and Load into spark dataframe:

In [4]:
schema = StructType([StructField('category',ArrayType(StringType()),True),  
StructField('tech1',StringType(),True),  
StructField('description',ArrayType(StringType()),True),  
StructField('fit',StringType(),True),  
StructField('title',StringType(),True),  
StructField('also_buy',ArrayType(StringType()),True),  
StructField('tech2',StringType(),True),  
StructField('brand',StringType(),True),  
StructField('feature',ArrayType(StringType()),True),  
StructField('rank',ArrayType(StringType()),True),  
StructField('also_view',ArrayType(StringType()),True),  
StructField('main_cat',StringType(),True),  
StructField('similar_item',StringType(),True),  
StructField('date',StringType(),True),  
StructField('price',StringType(),True),  
StructField('asin',StringType(),True),  
StructField('imageURL',ArrayType(StringType()),True),  
StructField('imageURLHighRes',ArrayType(StringType()),True)])

Schema required some editing, added `StringType()` as parameter in some `ArrayType()` fields that had `null` as parameter.

In [5]:
meta_elect_df = spark \
    .read \
    .format("json") \
    .load("meta_Electronics.json", schema = schema)

meta_elect_df.printSchema()

root
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tech1: string (nullable = true)
 |-- description: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- fit: string (nullable = true)
 |-- title: string (nullable = true)
 |-- also_buy: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tech2: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- feature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- rank: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- also_view: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- main_cat: string (nullable = true)
 |-- similar_item: string (nullable = true)
 |-- date: string (nullable = true)
 |-- price: string (nullable = true)
 |-- asin: string (nullable = true)
 |-- imageURL: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- imageU

Check partitions:

In [6]:
meta_elect_df.rdd.getNumPartitions()

82

Check size:

In [7]:
print((meta_elect_df.count(), len(meta_elect_df.columns)))



(786445, 18)


                                                                                

Looks like a few hundred rows are missing (minor %). Website lists 786,868 products

Look at data:

In [8]:
meta_elect_df.show()

+--------------------+-----+--------------------+---+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------+--------------------+--------------------+
|            category|tech1|         description|fit|               title|            also_buy|tech2|               brand|             feature|                rank|           also_view|            main_cat|        similar_item|              date|               price|      asin|            imageURL|     imageURLHighRes|
+--------------------+-----+--------------------+---+--------------------+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+----------+--------------------+--------------------+
|[Electronics, Cam...|     |[The foll

Pick columns we want:

In [9]:
cols_to_use = ['asin', 'brand', 'main_cat', 'price', 'title']
meta_elect_df.select(cols_to_use).show()

+----------+--------------------+--------------------+--------------------+--------------------+
|      asin|               brand|            main_cat|               price|               title|
+----------+--------------------+--------------------+--------------------+--------------------+
|0011300000|           GeoVision|  Camera &amp; Photo|              $65.00|Genuine Geovision...|
|0043396828|        33 Books Co.|  Camera &amp; Photo|                    |Books "Handbook o...|
|0060009810|Visit Amazon's Ca...|               Books|              $11.49|      One Hot Summer|
|0060219602|Visit Amazon's Di...|               Books|.a-section.a-spac...|Hurray for Hattie...|
|0060786817|Visit Amazon's Lo...|               Books|              $13.95|sex.lies.murder.f...|
|0070524076|Visit Amazon's Al...|               Books|                    |     College Physics|
|0091912407|            ABBY LEE|               Books|               $4.76|Girl with a One-t...|
|0101635370|          Crazy Ca

We only want rows with prices:

In [10]:
expr = '\$.*' #Regex for anything that starts with a dollar sign ($)

#Filter with regex
meta_elect_df = meta_elect_df.filter(meta_elect_df.price.rlike(expr)).select(cols_to_use)
meta_elect_df.show()

+----------+--------------------+--------------------+---------+--------------------+
|      asin|               brand|            main_cat|    price|               title|
+----------+--------------------+--------------------+---------+--------------------+
|0011300000|           GeoVision|  Camera &amp; Photo|   $65.00|Genuine Geovision...|
|0060009810|Visit Amazon's Ca...|               Books|   $11.49|      One Hot Summer|
|0060786817|Visit Amazon's Lo...|               Books|   $13.95|sex.lies.murder.f...|
|0091912407|            ABBY LEE|               Books|    $4.76|Girl with a One-t...|
|0132492776|     Enter The Arena|Home Audio & Theater|    $7.99|Wireless Bluetoot...|
|0151004714|Visit Amazon's Cl...|               Books|   $13.81|The Last Life: A ...|
|0151014841|Visit Amazon's An...|               Books|    $5.79|        Lady Lazarus|
|0303532572|TDK Electronics Corp|Home Audio &amp; ...|   $48.99|TDK Hi8 MP120 Pre...|
|0312171048|Visit Amazon's Je...|               Books|

In [11]:
# Check dimensions
print((meta_elect_df.count(), len(meta_elect_df.columns)))



(304323, 5)


                                                                                

Looks like about half of the meta datset has prices.

### Electronics data subset (5-core)

Read first few lines of JSON: 

In [12]:
# Loading entire 4 gig file into 8GB of available ram...not good.....major slow downs.

N = 2
with open("Electronics_5.json") as f:
    for i in range(0,N):
        print(f.readline(), end = '')


{"overall": 5.0, "vote": "67", "verified": true, "reviewTime": "09 18, 1999", "reviewerID": "AAP7PPBU72QFM", "asin": "0151004714", "style": {"Format:": " Hardcover"}, "reviewerName": "D. C. Carrad", "reviewText": "This is the best novel I have read in 2 or 3 years.  It is everything that fiction should be -- beautifully written, engaging, well-plotted and structured.  It has several layers of meanings -- historical, family,  philosophical and more -- and blends them all skillfully and interestingly.  It makes the American grad student/writers' workshop \"my parents were  mean to me and then my professors were mean to me\" trivia look  childish and silly by comparison, as they are.\nAnyone who says this is an  adolescent girl's coming of age story is trivializing it.  Ignore them.  Read this book if you love literature.\nI was particularly impressed with  this young author's grasp of the meaning and texture of the lost world of  French Algeria in the 1950's and '60's...particularly poig

Copy first row of JSON output above and use [Schema Generator](https://preetranjan.github.io/pyspark-schema-generator/) to create schema below:

In [13]:
schema = StructType([StructField('overall',FloatType(),True),  # Changed to FloatType from StringType
    StructField('vote',StringType(),True),  
    StructField('verified',BooleanType(),True),  
    StructField('reviewTime',StringType(),True),  
    StructField('reviewerID',StringType(),True),  
    StructField('asin',StringType(),True),  
    StructField('style',StructType([StructField('Format:',StringType(),True)]),True),  
    StructField('reviewerName',StringType(),True),  
    StructField('reviewText',StringType(),True),  
    StructField('summary',StringType(),True),  
    StructField('unixReviewTime',IntegerType(),True)])

Use schema so spark does not have to infer:

In [25]:
# 5 core electronics data
e5_core_df = spark \
    .read \
    .json("Electronics_5.json", schema = schema)

e5_core_df.printSchema()

root
 |-- overall: float (nullable = true)
 |-- vote: string (nullable = true)
 |-- verified: boolean (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- asin: string (nullable = true)
 |-- style: struct (nullable = true)
 |    |-- Format:: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: integer (nullable = true)



Show data:

In [26]:
e5_core_df.show()

+-------+----+--------+-----------+--------------+----------+-----------------+--------------------+--------------------+--------------------+--------------+
|overall|vote|verified| reviewTime|    reviewerID|      asin|            style|        reviewerName|          reviewText|             summary|unixReviewTime|
+-------+----+--------+-----------+--------------+----------+-----------------+--------------------+--------------------+--------------------+--------------+
|    5.0|  67|    true|09 18, 1999| AAP7PPBU72QFM|0151004714|     { Hardcover}|        D. C. Carrad|This is the best ...|      A star is born|     937612800|
|    3.0|   5|    true|10 23, 2013|A2E168DTVGE6SV|0151004714|{ Kindle Edition}|                 Evy|Pages and pages o...|A stream of consc...|    1382486400|
|    5.0|   4|   false| 09 2, 2008|A1ER5AYS3FQ9O3|0151004714|     { Paperback}|               Kcorn|This is the kind ...|I'm a huge fan of...|    1220313600|
|    5.0|  13|   false| 09 4, 2000|A1T17LMQABMBN5|01

Check partitions:

In [27]:
e5_core_df.rdd.getNumPartitions()

32

Check shape:

In [28]:
print((e5_core_df.count(), len(e5_core_df.columns)))



(6739590, 11)


                                                                                

Same number of rows as described on dataset website. Now to pick the columns that we want:

In [29]:
cols_to_use = ['asin', 'overall', 'unixReviewTime', 'reviewerID', 'reviewText']
e5_core_df = e5_core_df.select(cols_to_use)
e5_core_df.show()

+----------+-------+--------------+--------------+--------------------+
|      asin|overall|unixReviewTime|    reviewerID|          reviewText|
+----------+-------+--------------+--------------+--------------------+
|0151004714|    5.0|     937612800| AAP7PPBU72QFM|This is the best ...|
|0151004714|    3.0|    1382486400|A2E168DTVGE6SV|Pages and pages o...|
|0151004714|    5.0|    1220313600|A1ER5AYS3FQ9O3|This is the kind ...|
|0151004714|    5.0|     968025600|A1T17LMQABMBN5|What gorgeous lan...|
|0151004714|    3.0|     949622400|A3QHJ0FXK33OBE|I was taken in by...|
|0380709473|    4.0|    1370390400|A3IYSOTP3HA77N|I read this proba...|
|0380709473|    5.0|    1466985600|A11SXV34PZUQ5E|I read every Perr...|
|0380709473|    5.0|    1438214400|A2AUQM1HT2D5T8|I love this serie...|
|0380709473|    5.0|    1424044800|A3UD8JRWLX6SRX|         Great read!|
|0380709473|    4.0|    1384992000|A3MV1KKHX51FYT|Crows Can't Count...|
|0446697192|    5.0|    1247529600|A3LXXYBYUHZWS5|Fresh from Con


Next up is to merge price with the data subset on `asin`.

### Merge data (primary key = asin)

Left join to get meta data:

In [63]:
elect_df = e5_core_df.join(meta_elect_df, on = 'asin', how = 'left')
elect_df.show()
# print('5 core dataset: ', (e5_core_df.count(), len(e5_core_df.columns)))
# print('Joined dataset: ', (elect_df.count(), len(elect_df.columns)))



+----------+-------+--------------+--------------+--------------------+--------------------+------------------+-------+--------------------+
|      asin|overall|unixReviewTime|    reviewerID|          reviewText|               brand|          main_cat|  price|               title|
+----------+-------+--------------+--------------+--------------------+--------------------+------------------+-------+--------------------+
|0446697192|    5.0|    1247529600|A3LXXYBYUHZWS5|Fresh from Connec...|Visit Amazon's Zo...|             Books| $17.99|Hollywood Is like...|
|0446697192|    5.0|    1247184000|A1X4L7AO1BXMHK|I don't know abou...|Visit Amazon's Zo...|             Books| $17.99|Hollywood Is like...|
|0446697192|    3.0|    1251849600|A1Y9RUTH5GG3MU|Obviously the pre...|Visit Amazon's Zo...|             Books| $17.99|Hollywood Is like...|
|0446697192|    4.0|    1251590400| AAR8E3JF9K93P|I am very happy t...|Visit Amazon's Zo...|             Books| $17.99|Hollywood Is like...|
|0446697192| 

                                                                                

Check for missing values (takes a few minutes):

In [20]:
# elect_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in elect_df.columns]
#    ).show()



+----+-------+----------+----------+----------+-------+--------+-------+-------+
|asin|overall|reviewTime|reviewerID|reviewText|  brand|main_cat|  price|  title|
+----+-------+----------+----------+----------+-------+--------+-------+-------+
|   0|      0|         0|         0|      1380|2389218| 2389218|2389218|2389218|
+----+-------+----------+----------+----------+-------+--------+-------+-------+



                                                                                

Drop any rows with missing values.

In [64]:
elect_df = elect_df.na.drop()

In [22]:
# elect_df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in elect_df.columns]
#    ).show()

                                                                                

+----+-------+----------+----------+----------+-----+--------+-----+-----+
|asin|overall|reviewTime|reviewerID|reviewText|brand|main_cat|price|title|
+----+-------+----------+----------+----------+-----+--------+-----+-----+
|   0|      0|         0|         0|         0|    0|       0|    0|    0|
+----+-------+----------+----------+----------+-----+--------+-----+-----+



No nulls (took a long time. Almost 10 mins!)

### Clean dates (unixreviewTime):

In [66]:
elect_df = elect_df \
    .withColumn('unixReviewTime', from_unixtime(col("unixReviewTime"),"MM-dd-yyyy")) \
    .withColumn('unixReviewTime', to_date(col("unixReviewTime"),"MM-dd-yyyy")) \
    .withColumn('year', year("unixReviewTime")) \
    .withColumn('month', month("unixReviewTime"))

elect_df.show()



+----------+-------+--------------+--------------+--------------------+--------------------+---------+------+--------------------+----+-----+
|      asin|overall|unixReviewTime|    reviewerID|          reviewText|               brand| main_cat| price|               title|year|month|
+----------+-------+--------------+--------------+--------------------+--------------------+---------+------+--------------------+----+-----+
|0446697192|    5.0|    2009-07-13|A3LXXYBYUHZWS5|Fresh from Connec...|Visit Amazon's Zo...|    Books|$17.99|Hollywood Is like...|2009|    7|
|0446697192|    5.0|    2009-07-09|A1X4L7AO1BXMHK|I don't know abou...|Visit Amazon's Zo...|    Books|$17.99|Hollywood Is like...|2009|    7|
|0446697192|    3.0|    2009-09-01|A1Y9RUTH5GG3MU|Obviously the pre...|Visit Amazon's Zo...|    Books|$17.99|Hollywood Is like...|2009|    9|
|0446697192|    4.0|    2009-08-29| AAR8E3JF9K93P|I am very happy t...|Visit Amazon's Zo...|    Books|$17.99|Hollywood Is like...|2009|    8|
|04466

                                                                                

### Clean Price

In [72]:
elect_df = elect_df.withColumn('price', regexp_replace('price', '[$,]', '').cast('double'))
elect_df.printSchema()
elect_df.show()

root
 |-- asin: string (nullable = true)
 |-- overall: float (nullable = true)
 |-- unixReviewTime: date (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- main_cat: string (nullable = true)
 |-- price: double (nullable = true)
 |-- title: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)



                                                                                

+----------+-------+--------------+--------------+--------------------+--------------------+---------+-----+--------------------+----+-----+
|      asin|overall|unixReviewTime|    reviewerID|          reviewText|               brand| main_cat|price|               title|year|month|
+----------+-------+--------------+--------------+--------------------+--------------------+---------+-----+--------------------+----+-----+
|0446697192|    5.0|    2009-07-13|A3LXXYBYUHZWS5|Fresh from Connec...|Visit Amazon's Zo...|    Books|17.99|Hollywood Is like...|2009|    7|
|0446697192|    5.0|    2009-07-09|A1X4L7AO1BXMHK|I don't know abou...|Visit Amazon's Zo...|    Books|17.99|Hollywood Is like...|2009|    7|
|0446697192|    3.0|    2009-09-01|A1Y9RUTH5GG3MU|Obviously the pre...|Visit Amazon's Zo...|    Books|17.99|Hollywood Is like...|2009|    9|
|0446697192|    4.0|    2009-08-29| AAR8E3JF9K93P|I am very happy t...|Visit Amazon's Zo...|    Books|17.99|Hollywood Is like...|2009|    8|
|0446697192| 

### A bit of exploration

Dimension check (after dropping missing data):

In [40]:
print("# rows: ", elect_df.count())



# rows =  4585331


                                                                                

Down from 6.7mil datapoints to 4.5mil.

How many unique items are there?

In [41]:
elect_df.select('asin').distinct().count()

                                                                                

78133

How many unique customers?

In [42]:
elect_df.select('reviewerID').distinct().count()

                                                                                

722689

725K unique customers...

In [43]:
print('Columns overview')
pd.DataFrame(elect_df.dtypes, columns = ['Column Name','Data type'])

Columns overview


Unnamed: 0,Column Name,Data type
0,asin,string
1,overall,float
2,unixReviewTime,string
3,reviewerID,string
4,reviewText,string
5,brand,string
6,main_cat,string
7,price,string
8,title,string


What are the frequency of item by category? (Customer Segment?)

In [44]:
elect_df.groupBy('main_cat').count().orderBy('count', ascending = False).toPandas()

23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:21:39 WARN RowBasedKeyValueBatch: Calling spill() on

Unnamed: 0,main_cat,count
0,Computers,1502170
1,All Electronics,824431
2,Home Audio & Theater,687849
3,Camera & Photo,675530
4,Cell Phones & Accessories,339843
5,Car Electronics,95802
6,Camera &amp; Photo,72940
7,Home Audio &amp; Theater,70590
8,Tools & Home Improvement,52898
9,Office Products,47290


In [47]:
elect_df.groupBy('brand').count().orderBy('count', ascending = False).toPandas()

23/06/07 18:27:22 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:22 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:22 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:22 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:23 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:23 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:23 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:23 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 18:27:24 WARN RowBasedKeyValueBatch: Calling spill() on

Unnamed: 0,brand,count
0,Logitech,106331
1,SanDisk,102637
2,Sony,80910
3,Samsung,75529
4,AmazonBasics,69277
...,...,...
11043,Kashimura,3
11044,Venturer,3
11045,ToolUSA,3
11046,Deer River,2


Some summary:

In [73]:
elect_df.groupBy('year')\
    .agg(count("year").alias("num_reviews"), \
         sum("price").alias("total$"), \
         avg("price").alias("avg_order$") \
     ) \
    .orderBy('year')\
    .show(truncate=False)

23/06/07 19:27:04 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:04 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:04 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/06/07 19:27:06 WARN RowBasedKeyValueBatch: Calling spill() on

+----+-----------+--------------------+------------------+
|year|num_reviews|total$              |avg_order$        |
+----+-----------+--------------------+------------------+
|1999|65         |1347.7299999999996  |20.734307692307684|
|2000|595        |12513.22            |21.030621848739496|
|2001|1055       |32970.600000000006  |31.251753554502375|
|2002|1679       |48750.92            |29.070316040548597|
|2003|2792       |88354.34999999998   |31.668225806451606|
|2004|4303       |154821.74           |35.97995352079944 |
|2005|6837       |361747.2000000001   |53.03433514147488 |
|2006|10520      |418606.4400000002   |39.80283731102027 |
|2007|24139      |908392.2800000019   |37.687934282039656|
|2008|36869      |1536533.7599999977  |41.713961178227166|
|2009|50865      |1887129.9899999986  |37.14018598335004 |
|2010|61805      |1883727.9000000008  |30.506703051110982|
|2011|97290      |2639615.550000003   |27.17722906327866 |
|2012|172247     |4695731.760000022   |27.30743414089499

Most reviews are from later years. Reviewers spent alot of money on electronics between 2014-2016. The average order during that time was around $25. Other interesting values would be top category or top brands (by year).

Next: Calculate CLV for 1 year. Use year values. First [partition](https://sparkbyexamples.com/spark/spark-read-write-dataframe-parquet-example/) and save data:

In [76]:
elect_df.write.partitionBy("year").parquet("electronics_cleaned.parquet")

23/06/07 19:46:46 WARN MemoryManager: Total allocation exceeds 95.00% (929,405,326 bytes) of heap memory
Scaling row group sizes to 98.92% for 7 writers
23/06/07 19:46:46 WARN MemoryManager: Total allocation exceeds 95.00% (929,405,326 bytes) of heap memory
Scaling row group sizes to 86.56% for 8 writers
23/06/07 19:46:46 WARN MemoryManager: Total allocation exceeds 95.00% (929,405,326 bytes) of heap memory
Scaling row group sizes to 98.92% for 7 writers
23/06/07 19:46:46 WARN MemoryManager: Total allocation exceeds 95.00% (929,405,326 bytes) of heap memory
Scaling row group sizes to 98.92% for 7 writers
23/06/07 19:46:46 WARN MemoryManager: Total allocation exceeds 95.00% (929,405,326 bytes) of heap memory
Scaling row group sizes to 86.56% for 8 writers
23/06/07 19:46:46 WARN MemoryManager: Total allocation exceeds 95.00% (929,405,326 bytes) of heap memory
Scaling row group sizes to 98.92% for 7 writers
23/06/07 19:46:47 WARN MemoryManager: Total allocation exceeds 95.00% (929,405,326

Deeper analysis in separate notebook.