# Amount of Money Spent per Costumer

Our goal in this notebook is to calculate the total amount of money spent per person on a fake store. The data for this example can be found at ../datasets/customer_orders.csv, and it simulates the sales data from a fictitious store. The data fields are, in order, the following: customer id, item id and the amount spent on that item.

First, we call the following libraries and tell the computer that we are going to run our script on our local system.

In [1]:
import pyspark
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster('local').setAppName('amount_per_customer')
sc = SparkContext(conf = conf)

Let's load the data and display a little sample of it

In [2]:
raw_data = sc.textFile('../datasets/customer_orders.csv')

for row in raw_data.take(10):
    print(row)

44,8602,37.19
35,5368,65.89
2,3391,40.64
47,6694,14.98
29,680,13.08
91,8900,24.59
70,3959,68.68
85,1733,28.53
53,9900,83.55
14,1505,4.32


We define the next function to split the rows of the data and to keep only the customer id and the amount spent fields. It transforms each row into a key/value pair, where the key and value are, respectively, the customer id and the amount spent. 

In [3]:
def split_data(row):
    splitted_row = row.split(',')
    customer_id = int(splitted_row[0])
    amount_spent = float(splitted_row[2])

    return (customer_id, amount_spent)

Now, we apply the function defined above to the raw_data RDD and display its first ten elements

In [4]:
data = raw_data.map(split_data)

for row in data.take(10):
    print(row)

(44, 37.19)
(35, 65.89)
(2, 40.64)
(47, 14.98)
(29, 13.08)
(91, 24.59)
(70, 68.68)
(85, 28.53)
(53, 83.55)
(14, 4.32)


In the next cell, we use .reduceByKey to calculate the total amount spent per customer.

In [5]:
total_amount = data.reduceByKey(lambda x, y: x + y)

for row in total_amount.take(10):
    print(row)

(44, 4756.8899999999985)
(35, 5155.419999999999)
(2, 5994.59)
(47, 4316.299999999999)
(29, 5032.529999999999)
(91, 4642.259999999999)
(70, 5368.249999999999)
(85, 5503.43)
(53, 4945.299999999999)
(14, 4735.030000000001)


Finally, we would like to order our results based on the total amount spent, in order to achieve this, we need to replace the keys for the values and vice versa, in this way, when we call the .sortByKey transformation our results will be sorted by the total amount spent. Let's do this and print the results.

In [6]:
amounts_ordered = total_amount.map(lambda x: (x[1], x[0])).sortByKey(ascending=False)

for customer_info in amounts_ordered.take(20):
    customer_id = customer_info[1]
    amount_spent = customer_info[0]
    print('Customer Id: ' + str(customer_id) + ', Amount Spent: ' + str(round(amount_spent,2)))

Customer Id: 68, Amount Spent: 6375.45
Customer Id: 73, Amount Spent: 6206.2
Customer Id: 39, Amount Spent: 6193.11
Customer Id: 54, Amount Spent: 6065.39
Customer Id: 71, Amount Spent: 5995.66
Customer Id: 2, Amount Spent: 5994.59
Customer Id: 97, Amount Spent: 5977.19
Customer Id: 46, Amount Spent: 5963.11
Customer Id: 42, Amount Spent: 5696.84
Customer Id: 59, Amount Spent: 5642.89
Customer Id: 41, Amount Spent: 5637.62
Customer Id: 0, Amount Spent: 5524.95
Customer Id: 8, Amount Spent: 5517.24
Customer Id: 85, Amount Spent: 5503.43
Customer Id: 61, Amount Spent: 5497.48
Customer Id: 32, Amount Spent: 5496.05
Customer Id: 58, Amount Spent: 5437.73
Customer Id: 63, Amount Spent: 5415.15
Customer Id: 15, Amount Spent: 5413.51
Customer Id: 6, Amount Spent: 5397.88
