## Exercises - Custom Map Reduce Functions
Here are the same exercises which you have solved before. Try to solve these using mapReduce APIs.
* We will provide you a python script as part of next cell which will have all the map reduce APIs developed earlier. Run it to expose all the map reduce functions developed.

In [1]:
def myFilter(c, f):
    c_f = []
    for e in c:
        if(f(e)):
            c_f.append(e)
    return c_f

def myMap(c, f):
    c_f = []
    for e in c:
        c_f.append(f(e))
    return c_f

def myReduce(c, f):
    t = c[0]
    for e in c[1:]:
        t = f(t, e)
    return t

def myReduceByKey(p, f): 
    p_f = {}
    for e in p:
        if(e[0] in p_f):
            p_f[e[0]] = f(p_f[e[0]], e[1])  
        else:
            p_f[e[0]] = e[1] 
    return list(p_f.items()) 

def myJoin(c1, c2):
    c1_dict = dict(c1) # dict with first element as key and second element as value
    results = [] # Initializing empty list
    for c2_item in c2: 
        if c2_item[0] in c1_dict:
            results.append((c2_item[0], (c1_dict[c2_item[0]], c2_item[1])))
    return results

* Get number of COMPLETE orders placed by each customer
* Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January.
* Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING.

### Details of Data

Here are the details about the orders data which you can leverage to take care of these exercises.
* Location: `/data/retail_db/orders/part-00000`
* Each record is line separated or line delimited.
* Attribute in each record is comma separated.
* Here are the columns in the orders data set.
  * order_id
  * order_date
  * order_customer_id
  * order_status

In [2]:
# Get the details about file
!ls -ltr /data/retail_db/orders/part-00000

-rw-r--r-- 1 root root 2999944 Jan 21  2021 /data/retail_db/orders/part-00000


In [3]:
# Get first five lines from the file
!head -5 /data/retail_db/orders/part-00000

1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE


In [4]:
# Get number of lines from the file
# We can use linux command wc with -l
!wc -l /data/retail_db/orders/part-00000

68883 /data/retail_db/orders/part-00000


Here are the details about the order_items data which you can leverage to take care of these exercises.
* Location: `/data/retail_db/order_items/part-00000`
* Each record is line separated or line delimited.
* Attribute in each record is comma separated.
* Here are the columns in the order_items data set.
  * order_item_id
  * order_item_order_id
  * order_item_product_id
  * order_item_quantity
  * order_item_subtotal
  * order_item_product_price

In [5]:
# Get the details about file
!ls -ltr /data/retail_db/order_items/part-00000

-rw-r--r-- 1 root root 5408880 Jan 21  2021 /data/retail_db/order_items/part-00000


In [6]:
# Get first five lines from the file
!head -5 /data/retail_db/order_items/part-00000

1,1,957,1,299.98,299.98
2,2,1073,1,199.99,199.99
3,2,502,5,250.0,50.0
4,2,403,1,129.99,129.99
5,4,897,2,49.98,24.99


In [7]:
# Get number of lines from the file
# We can use linux command wc with -l
!wc -l /data/retail_db/order_items/part-00000

172198 /data/retail_db/order_items/part-00000


### Exercise 1 - Read data from file
Before getting into problem statement, develop the code to read the file into list of elements.
* We should be able to use this function to read any file with text data using line as record delimiter.

In [8]:
# Update the logic here
def get_list_from_file(file_path):
    data_list=open(file_path).read().splitlines()
    return data_list

* Run below cells to validate the function
* You should see 68883 records as part of the output for the cell with `len(orders)` below.
* You should see 172198 records as part of the output for the cell with `len(order_items)` below.

In [9]:
orders = get_list_from_file('/data/retail_db/orders/part-00000')

In [10]:
orders[:5]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE']

In [11]:
len(orders)

68883

In [12]:
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')

In [13]:
order_items[:5]

['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99',
 '5,4,897,2,49.98,24.99']

In [14]:
len(order_items)

172198

### Exercise 2 - Complete Order Count by Customer

Get number of COMPLETE orders placed by each customer. Develop a function which read the orders data and get us count of orders with completed status by each customer using **order_customer_id**.
* The function should take the complete order list as argument and return count of complete orders by customer. The function should return **dict** type object.
* The order is said to be complete if the **order_status** is **COMPLETE**.
* You can review structure of the data under **Details of Data** section in this notebook.
* Use the relevant functions created as part of this notebook. We have created these functions in this notebook.
  * `myFilter`
  * `myMap`
  * `myReduce`
  * `myReduceByKey`
  * `myJoin`

In [15]:
# Update the logic here
# You need to use myFilter and myReduceByKey for this
def get_complete_order_count_by_customer(orders):
    order_complete=myFilter(orders, lambda order: order.split(',')[3]=='COMPLETE')
    order_complete_map=myMap(order_complete,lambda order: (order.split(',')[2],1))
    complete_order_count_by_customer=myReduceByKey(order_complete_map, lambda t,e : t+e )
    return dict(complete_order_count_by_customer)

* Run below cell to validate the function. You should get **22899** as output.

In [16]:
orders = get_list_from_file('/data/retail_db/orders/part-00000')

In [17]:
complete_order_count_by_customer = get_complete_order_count_by_customer(orders)

In [18]:
# This should return dict
type(complete_order_count_by_customer)

dict

In [19]:
# This should return 10538
len(complete_order_count_by_customer)

10538

* Run below cell to preview the data.

In [20]:
for e in sorted(complete_order_count_by_customer.items())[:5]:
    print(e)

('1', 1)
('10', 2)
('100', 3)
('1000', 4)
('10000', 2)


### Exercise 3 - Pending Order Count

Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January. Develop a function which read the orders data and get us pending order count.
* The function should take the complete order list as argument and return count of pending orders.
* The order is said to be complete if the status is **PENDING** or **PENDING_PAYMENT**. We should only consider the orders placed in the month of 2014 January.
* The second element in each comma separated record gives us the date
* The 4th or last element in each comma separated record gives us the order status.
* Use the relevant functions created as part of this notebook. We have created these functions in this notebook.
  * `myFilter`
  * `myMap`
  * `myReduce`
  * `myReduceByKey`
  * `myJoin`

In [21]:
# Update the logic here
# You need to use myFilter and myReduce for this
def get_pending_order_count(orders):
    order_filter=myFilter(orders,
    lambda order:order.split(',')[3] in ('PENDING','PENDING_PAYMENT') and order.split(',')[1].startswith('2014-01')
    )
    order_count=len(order_filter)
    return order_count

* Run below cell to validate your function. You should get **1969** as output.

In [22]:
get_pending_order_count(orders)

1969

* You can also validate results using simple linux scripts.

In [23]:
!egrep -w '(PENDING|PENDING_PAYMENT)' /data/retail_db/orders/part-00000|grep 2014-01|wc -l

1969


### Exercise 4 - Get Outstanding Revenue

Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING. Modularize by developing multiple functions.
* Develop a function which takes orders list as argument and return a collection of order ids with one of the pending statuses.
* Develop a function which takes **order_items list** as well as **orders dict with only status** as arguments and return outstanding amount. 
* You can use **order_item_subtotal** to compute the outstanding amount.
* Here are the instructions for the solution.
  * Create a list of tuples by name `pending_orders` for pending orders using `myFilter`. Each tuple in the list should contain order_id and hard coded value 1.
  * Create a list of tuples by name `order_item` for order_items in which each tuple contain `order_id` and `order_item_subtotal`.
  * Create a new list by name `order_item_subtotals` by invoking `myJoin` using `pending_orders` and `order_items`. 
  * `order_item_subtotals` is of type `list` of tuples where first element in each tuple is order_id and second element is a nested tuple which contain elements related to `pending_orders` and `order_items`.
  * We can then use `myMap` function to get order_item_subtotal and then use `sum` to get the outstanding revenue.
  * Make sure to use `round` to round off to 2 decimals.
* Review **Details of Data** section to get more details of columns.
* Use the relevant functions created as part of this notebook. We have created these functions in this notebook.
  * `myFilter`
  * `myMap`
  * `myReduce`
  * `myReduceByKey`
  * `myJoin`

In [24]:
def get_pending_order(orders):
    
    order_filter=myFilter(orders,
                         lambda order:order.split(',')[3] in ('PAYMENT_REVIEW', 'PENDING', 'PENDING_PAYMENT','PROCESSING')
                         )
    pending_orders=myMap(order_filter,
                   lambda order:(order.split(',')[0],1)
                   )
    return pending_orders

In [25]:
pending_orders=get_pending_order(orders)

In [26]:
order_items[:5]

['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99',
 '5,4,897,2,49.98,24.99']

In [27]:
order_item=myMap(order_items,
                    lambda order_item:(order_item.split(',')[1],float(order_item.split(',')[4]))
                    )

order_item_subtotals=myJoin(get_pending_order(orders),order_item)
order_item_subtotals[:5]

[('2', (1, 199.99)),
 ('2', (1, 250.0)),
 ('2', (1, 129.99)),
 ('8', (1, 179.97)),
 ('8', (1, 299.95))]

In [28]:
# sum_subtotal=myReduceByKey(order_item_subtotals,lambda t,e:(round(t[0] + e[0], 2), t[1] + e[1]))
# sum_subtotal[:10]

In [29]:
sum_subtotal=myMap(order_item_subtotals,lambda order:order[1])
subtotal_list=myMap(sum_subtotal,lambda order:order[1])
round(sum(subtotal_list),2)

15982030.54