## Section 5: Apache Spark 1.6 - Core Spark APIs - Get Daily Revenue Per Product

1. Use retail_db data set
2. Problem Statement
	* Get Daily revenue by product considering completed and closed orders
	* Data need to be sorted by ascending order by date and then descending order by revenue computed for each product for ech day
	* Data should be delimited by "," in this order [order_date,daily_revenue_per_product,product_name]
    * Data for orders and order_items is available in HDFS 
        - <path>/orders
        - <path>/order_items
    * Data for products is available locaclly under 
        - <path>/products
3. Final output need to ne stored under
	* HDFS location -avro format
		- <path>/daily_revenue_avro_python
	* HDFS location - text for,at
		- <path>/daily_revenue_python
	* Local location
		- <path>/daily_revenue_python
	* Solution need to be stored under		
		- <local path>/daily_revenue_python.txt

#### Launch Pyspark

- Understand the environment and use resources optimally
- Understand the capacity o cluster
- Resource manager web interface host on port ip:8088 [192.168.1.109:8088]
- Got to yarn-site.xml and find property name yarn.resourcemanager.webapp.https.address
- in UI find 
    * Memory Total
    * VCores Total
- Determine size of the data to determinine how much capacity to be used
    * du -s -h file_path
    * hdfs dfs -du -s -h hdfs_path
- Launch pyspark
    ***
    pyspark --master yarn --conf spark.ui.port=12369 --num-executers 2 --executer-memory 512m
    ***

In [1]:
def displayRDD(rDDName):
    print('RDD Content:')
    for i in rDDName.take(10):
        print(i)
    print('RDD Count:',rDDName.count())

In [2]:
#Read data orders data from RDD
orders = sc.textFile('/user/pi/retail_db/orders')
displayRDD(orders)

RDD Content:
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT
RDD Count: 68883


In [3]:
#Read data order_items data from RDD
order_items = sc.textFile('/user/pi/retail_db/order_items')
displayRDD(order_items)

RDD Content:
1,1,957,1,299.98,299.98
2,2,1073,1,199.99,199.99
3,2,502,5,250.0,50.0
4,2,403,1,129.99,129.99
5,4,897,2,49.98,24.99
6,4,365,5,299.95,59.99
7,4,502,3,150.0,50.0
8,4,1014,4,199.92,49.98
9,5,957,1,299.98,299.98
10,5,365,5,299.95,59.99
RDD Count: 172198


In [4]:
#Read Product data from local
productsRaw = open("/home/pi/shared/retail_db/products/part-00000").read().splitlines()
products=sc.parallelize(productsRaw)
displayRDD(products)

RDD Content:
1,2,Quest Q64 10 FT. x 10 FT. Slant Leg Instant U,,59.98,http://images.acmesports.sports/Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy
2,2,Under Armour Men's Highlight MC Football Clea,,129.99,http://images.acmesports.sports/Under+Armour+Men%27s+Highlight+MC+Football+Cleat
3,2,Under Armour Men's Renegade D Mid Football Cl,,89.99,http://images.acmesports.sports/Under+Armour+Men%27s+Renegade+D+Mid+Football+Cleat
4,2,Under Armour Men's Renegade D Mid Football Cl,,89.99,http://images.acmesports.sports/Under+Armour+Men%27s+Renegade+D+Mid+Football+Cleat
5,2,Riddell Youth Revolution Speed Custom Footbal,,199.99,http://images.acmesports.sports/Riddell+Youth+Revolution+Speed+Custom+Football+Helmet
6,2,Jordan Men's VI Retro TD Football Cleat,,134.99,http://images.acmesports.sports/Jordan+Men%27s+VI+Retro+TD+Football+Cleat
7,2,Schutt Youth Recruit Hybrid Custom Football H,,99.99,http://images.acmesports.sports/Schutt+Youth+Recruit+Hybrid+Custom+Football+Helmet+2014
8,2,Nike M

In [5]:
#filter only completed and closed orders
#see varies status
orders.map(lambda o:o.split(",")[3]).distinct().collect()

['CLOSED',
 'CANCELED',
 'COMPLETE',
 'PENDING_PAYMENT',
 'SUSPECTED_FRAUD',
 'PENDING',
 'ON_HOLD',
 'PROCESSING',
 'PAYMENT_REVIEW']

In [6]:
ordersFiltered=orders.filter(lambda o:o.split(",")[3] in ['CLOSED','COMPLETE'])
displayRDD(ordersFiltered)

RDD Content:
1,2013-07-25 00:00:00.0,11599,CLOSED
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
12,2013-07-25 00:00:00.0,1837,CLOSED
15,2013-07-25 00:00:00.0,2568,COMPLETE
17,2013-07-25 00:00:00.0,2667,COMPLETE
18,2013-07-25 00:00:00.0,1205,CLOSED
RDD Count: 30455


In [7]:
ordersPairRdd = ordersFiltered.map(lambda o:(int(o.split(",")[0]),o.split(',')[1]))
displayRDD(ordersPairRdd)

RDD Content:
(1, '2013-07-25 00:00:00.0')
(3, '2013-07-25 00:00:00.0')
(4, '2013-07-25 00:00:00.0')
(5, '2013-07-25 00:00:00.0')
(6, '2013-07-25 00:00:00.0')
(7, '2013-07-25 00:00:00.0')
(12, '2013-07-25 00:00:00.0')
(15, '2013-07-25 00:00:00.0')
(17, '2013-07-25 00:00:00.0')
(18, '2013-07-25 00:00:00.0')
RDD Count: 30455


In [8]:
orderItemsPairRdd=order_items.map(lambda oi:(int(oi.split(",")[1]),(int(oi.split(",")[2]),float(oi.split(",")[4]))))
displayRDD(orderItemsPairRdd)

RDD Content:
(1, (957, 299.98))
(2, (1073, 199.99))
(2, (502, 250.0))
(2, (403, 129.99))
(4, (897, 49.98))
(4, (365, 299.95))
(4, (502, 150.0))
(4, (1014, 199.92))
(5, (957, 299.98))
(5, (365, 299.95))
RDD Count: 172198


In [9]:
ordersJoin=ordersPairRdd.join(orderItemsPairRdd)
displayRDD(ordersJoin)

RDD Content:
(35188, ('2014-02-27 00:00:00.0', (627, 79.98)))
(35192, ('2014-02-27 00:00:00.0', (642, 120.0)))
(35196, ('2014-02-27 00:00:00.0', (572, 39.99)))
(35200, ('2014-02-27 00:00:00.0', (1073, 199.99)))
(35228, ('2014-02-27 00:00:00.0', (1014, 249.9)))
(35228, ('2014-02-27 00:00:00.0', (1004, 399.98)))
(35232, ('2014-02-27 00:00:00.0', (502, 100.0)))
(35248, ('2014-02-27 00:00:00.0', (1014, 199.92)))
(35264, ('2014-02-27 00:00:00.0', (502, 200.0)))
(35264, ('2014-02-27 00:00:00.0', (365, 239.96)))
RDD Count: 75408


In [10]:
dailyRevenueProducts = ordersJoin.map(lambda t:(t[1][1][0],(t[1][0],t[1][1][1])))
displayRDD(dailyRevenueProducts)

RDD Content:
(627, ('2014-02-27 00:00:00.0', 79.98))
(642, ('2014-02-27 00:00:00.0', 120.0))
(572, ('2014-02-27 00:00:00.0', 39.99))
(1073, ('2014-02-27 00:00:00.0', 199.99))
(1014, ('2014-02-27 00:00:00.0', 249.9))
(1004, ('2014-02-27 00:00:00.0', 399.98))
(502, ('2014-02-27 00:00:00.0', 100.0))
(1014, ('2014-02-27 00:00:00.0', 199.92))
(502, ('2014-02-27 00:00:00.0', 200.0))
(365, ('2014-02-27 00:00:00.0', 239.96))
RDD Count: 75408


In [11]:
productsMap=products.map(lambda p:(int(p.split(",")[0]),p.split(",")[2]))
displayRDD(productsMap)

RDD Content:
(1, 'Quest Q64 10 FT. x 10 FT. Slant Leg Instant U')
(2, "Under Armour Men's Highlight MC Football Clea")
(3, "Under Armour Men's Renegade D Mid Football Cl")
(4, "Under Armour Men's Renegade D Mid Football Cl")
(5, 'Riddell Youth Revolution Speed Custom Footbal')
(6, "Jordan Men's VI Retro TD Football Cleat")
(7, 'Schutt Youth Recruit Hybrid Custom Football H')
(8, "Nike Men's Vapor Carbon Elite TD Football Cle")
(9, 'Nike Adult Vapor Jet 3.0 Receiver Gloves')
(10, "Under Armour Men's Highlight MC Football Clea")
RDD Count: 1345


In [12]:
dailyRevenueProductsName=dailyRevenueProducts.join(productsMap)
displayRDD(dailyRevenueProductsName)

RDD Content:
(60, (('2014-01-19 00:00:00.0', 999.99), 'SOLE E25 Elliptical'))
(60, (('2014-04-06 00:00:00.0', 999.99), 'SOLE E25 Elliptical'))
(60, (('2014-07-09 00:00:00.0', 999.99), 'SOLE E25 Elliptical'))
(860, (('2013-12-23 00:00:00.0', 599.99), 'Bushnell Pro X7 Jolt Slope Rangefinder'))
(860, (('2014-01-25 00:00:00.0', 599.99), 'Bushnell Pro X7 Jolt Slope Rangefinder'))
(860, (('2014-03-17 00:00:00.0', 599.99), 'Bushnell Pro X7 Jolt Slope Rangefinder'))
(821, (('2014-02-28 00:00:00.0', 51.99), 'Titleist Pro V1 High Numbers Personalized Gol'))
(821, (('2014-02-28 00:00:00.0', 51.99), 'Titleist Pro V1 High Numbers Personalized Gol'))
(821, (('2014-02-28 00:00:00.0', 51.99), 'Titleist Pro V1 High Numbers Personalized Gol'))
(821, (('2014-03-04 00:00:00.0', 103.98), 'Titleist Pro V1 High Numbers Personalized Gol'))
RDD Count: 75408


In [13]:
from operator import add
dailyRevenueProductsNameGP = dailyRevenueProductsName.map(lambda r:((r[1][0][0],r[1][1]),r[1][0][1])).reduceByKey(add)
displayRDD(dailyRevenueProductsNameGP)

RDD Content:
(('2013-08-08 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 3249.749999999998)
(('2013-08-10 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 2729.789999999999)
(('2013-10-07 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 2079.84)
(('2013-10-14 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 2339.8199999999997)
(('2013-11-19 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 3769.7099999999973)
(('2013-11-20 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 4159.679999999997)
(('2013-11-24 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 5459.5799999999945)
(('2013-12-17 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 2209.83)
(('2013-12-19 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 2599.7999999999993)
(('2014-01-08 00:00:00.0', "Nike Men's CJ Elite 2 TD Football Cleat"), 1169.91)
RDD Count: 9120


In [16]:
dailyRevenueProductsNameSorted=dailyRevenueProductsNameGP.map(lambda p:((p[0][0],-p[1]),p[0][1])).sortByKey().\
map(lambda p:p[0][0]+','+str(-p[0][1])+','+p[1])
displayRDD(dailyRevenueProductsNameSorted)

RDD Content:
2013-07-25 00:00:00.0,5599.719999999999,Field & Stream Sportsman 16 Gun Fire Safe
2013-07-25 00:00:00.0,5099.489999999999,Nike Men's Free 5.0+ Running Shoe
2013-07-25 00:00:00.0,4499.700000000001,Diamondback Women's Serene Classic Comfort Bi
2013-07-25 00:00:00.0,3359.4399999999996,Perfect Fitness Perfect Rip Deck
2013-07-25 00:00:00.0,2999.8499999999995,Pelican Sunstream 100 Kayak
2013-07-25 00:00:00.0,2798.8800000000006,O'Brien Men's Neoprene Life Vest
2013-07-25 00:00:00.0,1949.8500000000001,Nike Men's CJ Elite 2 TD Football Cleat
2013-07-25 00:00:00.0,1650.0,Nike Men's Dri-FIT Victory Golf Polo
2013-07-25 00:00:00.0,1079.73,Under Armour Girls' Toddler Spine Surge Runni
2013-07-25 00:00:00.0,599.99,Bowflex SelectTech 1090 Dumbbells
RDD Count: 9120


In [17]:
dailyRevenueProductsNameSorted.saveAsTextFile('/user/pi/retail_db/daily_revenue_python')

In [26]:
! hdfs dfs -ls /user/pi/retail_db/daily_revenue_python

2020-05-21 19:09:28,726 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 21 items
-rw-r--r--   2 pi supergroup          0 2020-05-21 19:06 /user/pi/retail_db/daily_revenue_python/_SUCCESS
-rw-r--r--   2 pi supergroup      29347 2020-05-21 19:06 /user/pi/retail_db/daily_revenue_python/part-00000
-rw-r--r--   2 pi supergroup      34065 2020-05-21 19:06 /user/pi/retail_db/daily_revenue_python/part-00001
-rw-r--r--   2 pi supergroup      28870 2020-05-21 19:06 /user/pi/retail_db/daily_revenue_python/part-00002
-rw-r--r--   2 pi supergroup      39125 2020-05-21 19:06 /user/pi/retail_db/daily_revenue_python/part-00003
-rw-r--r--   2 pi supergroup      42584 2020-05-21 19:06 /user/pi/retail_db/daily_revenue_python/part-00004
-rw-r--r--   2 pi supergroup      25390 2020-05-21 19:06 /user/pi/retail_db/daily_revenue_python/part-00005
-rw-r--r--   2 pi supergroup      38863 2020-05-21 19:06 /user/pi/retail_db/d

In [27]:
! hdfs dfs -rm -r -f /user/pi/retail_db/daily_revenue_python
#save file into 2 partition
dailyRevenueProductsNameSorted.coalesce(2).saveAsTextFile('/user/pi/retail_db/daily_revenue_python')

2020-05-21 19:10:25,715 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleted /user/pi/retail_db/daily_revenue_python


In [28]:
! hdfs dfs -ls /user/pi/retail_db/daily_revenue_python

2020-05-21 19:10:36,109 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   2 pi supergroup          0 2020-05-21 19:10 /user/pi/retail_db/daily_revenue_python/_SUCCESS
-rw-r--r--   2 pi supergroup     317852 2020-05-21 19:10 /user/pi/retail_db/daily_revenue_python/part-00000
-rw-r--r--   2 pi supergroup     310279 2020-05-21 19:10 /user/pi/retail_db/daily_revenue_python/part-00001


In [29]:
dailyRevenue=sc.textFile('/user/pi/retail_db/daily_revenue_python')
displayRDD(dailyRevenue)

RDD Content:
2013-07-25 00:00:00.0,5599.719999999999,Field & Stream Sportsman 16 Gun Fire Safe
2013-07-25 00:00:00.0,5099.489999999999,Nike Men's Free 5.0+ Running Shoe
2013-07-25 00:00:00.0,4499.700000000001,Diamondback Women's Serene Classic Comfort Bi
2013-07-25 00:00:00.0,3359.4399999999996,Perfect Fitness Perfect Rip Deck
2013-07-25 00:00:00.0,2999.8499999999995,Pelican Sunstream 100 Kayak
2013-07-25 00:00:00.0,2798.8800000000006,O'Brien Men's Neoprene Life Vest
2013-07-25 00:00:00.0,1949.8500000000001,Nike Men's CJ Elite 2 TD Football Cleat
2013-07-25 00:00:00.0,1650.0,Nike Men's Dri-FIT Victory Golf Polo
2013-07-25 00:00:00.0,1079.73,Under Armour Girls' Toddler Spine Surge Runni
2013-07-25 00:00:00.0,599.99,Bowflex SelectTech 1090 Dumbbells
RDD Count: 9120


### Run pyspark with avro package
***
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master spark://192.168.1.109:7077 --packages org.apache.spark:spark-avro_2.11:2.4.0
***

In [19]:
#save to avro file format
#format and convert to DataFrame
dailyRevenueProductsNameSorted1=dailyRevenueProductsNameGP.map(lambda p:((p[0][0],-p[1]),p[0][1])).sortByKey().\
map(lambda p:(p[0][0],round(-p[0][1],2),p[1]))
displayRDD(dailyRevenueProductsNameSorted1)

RDD Content:
('2013-07-25 00:00:00.0', 5599.72, 'Field & Stream Sportsman 16 Gun Fire Safe')
('2013-07-25 00:00:00.0', 5099.49, "Nike Men's Free 5.0+ Running Shoe")
('2013-07-25 00:00:00.0', 4499.7, "Diamondback Women's Serene Classic Comfort Bi")
('2013-07-25 00:00:00.0', 3359.44, 'Perfect Fitness Perfect Rip Deck')
('2013-07-25 00:00:00.0', 2999.85, 'Pelican Sunstream 100 Kayak')
('2013-07-25 00:00:00.0', 2798.88, "O'Brien Men's Neoprene Life Vest")
('2013-07-25 00:00:00.0', 1949.85, "Nike Men's CJ Elite 2 TD Football Cleat")
('2013-07-25 00:00:00.0', 1650.0, "Nike Men's Dri-FIT Victory Golf Polo")
('2013-07-25 00:00:00.0', 1079.73, "Under Armour Girls' Toddler Spine Surge Runni")
('2013-07-25 00:00:00.0', 599.99, 'Bowflex SelectTech 1090 Dumbbells')
RDD Count: 9120


In [20]:
dailyRevenueDF=dailyRevenueProductsNameSorted1.toDF(schema=['order_date','daily_revenue_per_product','product_name'])
dailyRevenueDF.show()

+--------------------+-------------------------+--------------------+
|          order_date|daily_revenue_per_product|        product_name|
+--------------------+-------------------------+--------------------+
|2013-07-25 00:00:...|                  5599.72|Field & Stream Sp...|
|2013-07-25 00:00:...|                  5099.49|Nike Men's Free 5...|
|2013-07-25 00:00:...|                   4499.7|Diamondback Women...|
|2013-07-25 00:00:...|                  3359.44|Perfect Fitness P...|
|2013-07-25 00:00:...|                  2999.85|Pelican Sunstream...|
|2013-07-25 00:00:...|                  2798.88|O'Brien Men's Neo...|
|2013-07-25 00:00:...|                  1949.85|Nike Men's CJ Eli...|
|2013-07-25 00:00:...|                   1650.0|Nike Men's Dri-FI...|
|2013-07-25 00:00:...|                  1079.73|Under Armour Girl...|
|2013-07-25 00:00:...|                   599.99|Bowflex SelectTec...|
|2013-07-25 00:00:...|                   319.96|Elevation Trainin...|
|2013-07-25 00:00:..

In [21]:
#save avro file
dailyRevenueDF.write.save('/user/pi/retail_db/daily_revenue_avro_python',format="com.databricks.spark.avro")

In [24]:
#readavro
dailyRevenueDFRead=sqlContext.read.load('/user/pi/retail_db/daily_revenue_avro_python',format="com.databricks.spark.avro")
dailyRevenueDFRead.show()

+--------------------+-------------------------+--------------------+
|          order_date|daily_revenue_per_product|        product_name|
+--------------------+-------------------------+--------------------+
|2014-06-29 00:00:...|                  3599.76|Diamondback Women...|
|2014-06-29 00:00:...|                  3359.44|Perfect Fitness P...|
|2014-06-29 00:00:...|                   2650.0|Nike Men's Dri-FI...|
|2014-06-29 00:00:...|                  2099.16|O'Brien Men's Neo...|
|2014-06-29 00:00:...|                  1949.85|Nike Men's CJ Eli...|
|2014-06-29 00:00:...|                  1599.92|Pelican Sunstream...|
|2014-06-29 00:00:...|                  1159.71|Under Armour Girl...|
|2014-06-29 00:00:...|                   659.98|Stiga Master Seri...|
|2014-06-29 00:00:...|                   499.95|Merrell Women's G...|
|2014-06-29 00:00:...|                   299.99|Titleist Club Glo...|
|2014-06-29 00:00:...|                    280.0|adidas Youth Germ...|
|2014-06-29 00:00:..