# Activity: Customer Spend

Task: Create a driver program that does the following:

1. Split each comma-delimited line into fields
2. Map each line to key/value pairs of customerID and dollar amount.
3. Use `reduceByKey` to add up amount spent by customer ID.
4. `collect()` the results and print them.

In [1]:
spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.1.19:4040
SparkContext available as 'sc' (version = 2.4.5, master = local[*], app id = local-1588961514825)
SparkSession available as 'spark'


res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@6a402e0b


# Data import

In [2]:
val spend = sc.textFile("../../data/customer-orders.csv")

spend: org.apache.spark.rdd.RDD[String] = ../../data/customer-orders.csv MapPartitionsRDD[1] at textFile at <console>:25


Take a peek at some of the rows

In [4]:
spend.take(5).foreach(println)

44,8602,37.19
35,5368,65.89
2,3391,40.64
47,6694,14.98
29,680,13.08


**NOTE**: From the source, the first column is the `customerID` while the third column is the `spend`. We form the key/value pair based on these info.

Because we are importing this file as a `csv`, we need covert the entries into numerical types.

In [11]:
def parseLine(row:String)={
    val fields = row.split(",")
    val customerID = fields(0)
    val spend = fields(2)
    (customerID.toInt, spend.toFloat) //Conver to numeric type
}

parseLine: (row: String)(Int, Float)


In [13]:
val customerSpend = spend.map(parseLine(_))

customerSpend: org.apache.spark.rdd.RDD[(Int, Float)] = MapPartitionsRDD[4] at map at <console>:28


In [14]:
val customerTotalSpend = customerSpend.reduceByKey((x,y)=> x+y)

customerTotalSpend: org.apache.spark.rdd.RDD[(Int, Float)] = ShuffledRDD[5] at reduceByKey at <console>:26


In [18]:
val results = customerTotalSpend.collect()

results: Array[(Int, Float)] = Array((34,5330.7993), (52,5245.0605), (96,3924.23), (4,4815.05), (16,4979.0605), (82,4812.49), (66,4681.92), (28,5000.7104), (54,6065.39), (80,4727.86), (98,4297.26), (30,4990.72), (14,4735.0303), (50,4517.2695), (36,4278.05), (24,5259.92), (64,5288.69), (92,5379.281), (74,4647.1304), (90,5290.41), (72,5337.4395), (70,5368.2505), (18,4921.27), (12,4664.59), (38,4898.461), (20,4836.86), (78,4524.51), (10,4819.6997), (94,4475.5703), (84,4652.9395), (56,4701.02), (76,4904.2104), (22,5019.449), (46,5963.111), (48,4384.3296), (32,5496.0503), (0,5524.9497), (62,5253.3213), (42,5696.8403), (40,5186.4297), (6,5397.8794), (8,5517.24), (86,4908.809), (58,5437.7305), (44,4756.8906), (88,4830.55), (60,5040.7095), (26,5250.4004), (68,6375.45), (2,5994.591), (13,4367.62...

In [19]:
for (result <- results){
    val customerID = result._1
    val totalSpend = result._2
    println(s"Customer: $customerID, Total_Spend: $totalSpend")
}

Customer: 34, Total_Spend: 5330.7993
Customer: 52, Total_Spend: 5245.0605
Customer: 96, Total_Spend: 3924.23
Customer: 4, Total_Spend: 4815.05
Customer: 16, Total_Spend: 4979.0605
Customer: 82, Total_Spend: 4812.49
Customer: 66, Total_Spend: 4681.92
Customer: 28, Total_Spend: 5000.7104
Customer: 54, Total_Spend: 6065.39
Customer: 80, Total_Spend: 4727.86
Customer: 98, Total_Spend: 4297.26
Customer: 30, Total_Spend: 4990.72
Customer: 14, Total_Spend: 4735.0303
Customer: 50, Total_Spend: 4517.2695
Customer: 36, Total_Spend: 4278.05
Customer: 24, Total_Spend: 5259.92
Customer: 64, Total_Spend: 5288.69
Customer: 92, Total_Spend: 5379.281
Customer: 74, Total_Spend: 4647.1304
Customer: 90, Total_Spend: 5290.41
Customer: 72, Total_Spend: 5337.4395
Customer: 70, Total_Spend: 5368.2505
Customer: 18, Total_Spend: 4921.27
Customer: 12, Total_Spend: 4664.59
Customer: 38, Total_Spend: 4898.461
Customer: 20, Total_Spend: 4836.86
Customer: 78, Total_Spend: 4524.51
Customer: 10, Total_Spend: 4819.6997

Sort the results by decreasing spend.

In [22]:
val results = customerTotalSpend.sortBy(_._2, ascending=false).collect()

results: Array[(Int, Float)] = Array((68,6375.45), (73,6206.199), (39,6193.1104), (54,6065.39), (71,5995.66), (2,5994.591), (97,5977.1895), (46,5963.111), (42,5696.8403), (59,5642.8906), (41,5637.619), (0,5524.9497), (8,5517.24), (85,5503.4307), (61,5497.48), (32,5496.0503), (58,5437.7305), (63,5415.15), (15,5413.5103), (6,5397.8794), (92,5379.281), (43,5368.83), (70,5368.2505), (72,5337.4395), (34,5330.7993), (9,5322.6494), (55,5298.09), (90,5290.41), (64,5288.69), (93,5265.75), (24,5259.92), (33,5254.659), (62,5253.3213), (26,5250.4004), (52,5245.0605), (87,5206.3994), (40,5186.4297), (35,5155.42), (11,5152.29), (65,5140.3496), (69,5123.01), (81,5112.71), (19,5059.4307), (25,5057.6104), (60,5040.7095), (17,5032.6797), (29,5032.5303), (22,5019.449), (28,5000.7104), (30,4990.72), (16,49...

In [23]:
for (result <- results){
    val customerID = result._1
    val totalSpend = result._2
    println(s"Customer: $customerID, Total_Spend: $totalSpend")
}

Customer: 68, Total_Spend: 6375.45
Customer: 73, Total_Spend: 6206.199
Customer: 39, Total_Spend: 6193.1104
Customer: 54, Total_Spend: 6065.39
Customer: 71, Total_Spend: 5995.66
Customer: 2, Total_Spend: 5994.591
Customer: 97, Total_Spend: 5977.1895
Customer: 46, Total_Spend: 5963.111
Customer: 42, Total_Spend: 5696.8403
Customer: 59, Total_Spend: 5642.8906
Customer: 41, Total_Spend: 5637.619
Customer: 0, Total_Spend: 5524.9497
Customer: 8, Total_Spend: 5517.24
Customer: 85, Total_Spend: 5503.4307
Customer: 61, Total_Spend: 5497.48
Customer: 32, Total_Spend: 5496.0503
Customer: 58, Total_Spend: 5437.7305
Customer: 63, Total_Spend: 5415.15
Customer: 15, Total_Spend: 5413.5103
Customer: 6, Total_Spend: 5397.8794
Customer: 92, Total_Spend: 5379.281
Customer: 43, Total_Spend: 5368.83
Customer: 70, Total_Spend: 5368.2505
Customer: 72, Total_Spend: 5337.4395
Customer: 34, Total_Spend: 5330.7993
Customer: 9, Total_Spend: 5322.6494
Customer: 55, Total_Spend: 5298.09
Customer: 90, Total_Spend: 

**OBSERVATIONS**: The highest spend correspond to customer 68 while the lower is Customer 45.