# Development of Map Reduce APIs
Let us develop our own Map Reduce APIs to understand how they work internally. We need to be comfortable with passing the functions as arguments.

We will provide the code and walk you through to make you understand how the Map Reduce APIs are internally implemented.

* Develop myFilter
* Validate myFilter Function
* Develop myMap
* Validate myMap Function
* Develop myReduce
* Develop myReduceByKey
* Exercises

## Develop myFilter

Develop a function by name myFilter which takes a collection and a function as arguments. Function should do the following:
* Iterate through elements
* Apply the condition using the argument passed. We might pass named function or lambda function.
* Return the collection with all the elements satisfying the condition.

In [1]:
def myFilter(c, f):
    c_f = []
    for e in c:
        if f(e):
            c_f.append(e)
    return c_f

## Validate myFilter function

Use the same examples which were used before as part of Processing Collections using loops.

* Read orders data

In [2]:
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"
orders = open(orders_path). \
    read(). \
    splitlines()

In [3]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [8]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
int(order.split(',')[2]) == 11599

True

* Get orders placed by customer id 12431


In [9]:
customer_orders = myFilter(orders, 
                           lambda order: int(order.split(',')[2]) == 12431
                          )

In [10]:
customer_orders

['3774,2013-08-16 00:00:00.0,12431,CANCELED',
 '3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT',
 '4032,2013-08-17 00:00:00.0,12431,ON_HOLD',
 '22812,2013-12-12 00:00:00.0,12431,PENDING',
 '22927,2013-12-13 00:00:00.0,12431,CLOSED',
 '25614,2013-12-30 00:00:00.0,12431,CLOSED',
 '27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '45894,2014-05-06 00:00:00.0,12431,CLOSED',
 '46217,2014-05-07 00:00:00.0,12431,CLOSED',
 '49678,2014-05-31 00:00:00.0,12431,PENDING',
 '51865,2014-06-15 00:00:00.0,12431,PROCESSING',
 '63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT',
 '67110,2014-07-14 00:00:00.0,12431,PENDING']

* Get orders placed by customer id 12431 in the month of 2014 January

In [12]:
customer_orders_for_month = myFilter(orders, 
                           lambda order: int(order.split(',')[2]) == 12431
                                     and order.split(',')[1].startswith('2014-01')
                          )
customer_orders_for_month

['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD']

* Get orders placed by customer id 12431 in processing or pending_payment for the month of 2014 January

In [13]:
customer_orders_for_month = myFilter(orders, 
                           lambda order: int(order.split(',')[2]) == 12431
                                     and order.split(',')[1].startswith('2014-01')
                                     and order.split(',')[3] in ('PENDING_PAYMENT', 'PROCESSING')
                          )

In [14]:
customer_orders_for_month

['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT']

## Develop myMap

Develop a function by name myMap which takes a collection and a function as arguments. Function should do the following:
* Iterate through elements
* Apply the transformation logic using the argument passed.
* Return the collection with all the elements which are transformed based on the logic passed.

In [15]:
def myMap(c, f):
    c_t = []
    for e in c:
        c_t.append(f(e))
    return c_t

### Validate myMap function
* Create list for range between 1 to 9 and return square of each number.

In [16]:
l = list(range(1, 10))
l

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [17]:
myMap(l, lambda e: e * e)

[1, 4, 9, 16, 25, 36, 49, 64, 81]

* Use orders and extract order_dates. Also apply set and get only unique dates.

In [18]:
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"
orders = open(orders_path). \
    read(). \
    splitlines()

In [19]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [20]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
order.split(',')[1]

'2013-07-25 00:00:00.0'

In [21]:
order_dates = myMap(orders,
                    lambda order: order.split(',')[1]
                   )
order_dates[:10]

['2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0',
 '2013-07-25 00:00:00.0']

In [22]:
len(orders)

68883

In [23]:
len(order_dates)

68883

In [None]:
set(order_dates)

In [25]:
len(set(order_dates))

364

* Use orders and extract order_id as well as order_date from each element in the form of a tuple. Make sure that order_id is of type int.

In [26]:
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"
orders = open(orders_path). \
    read(). \
    splitlines()

In [27]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [None]:
[(1, '2013-07-25 00:00:00.0'), (2, '2013-07-25 00:00:00.0')]

In [29]:
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
(int(order.split(',')[0]), order.split(',')[1])

(1, '2013-07-25 00:00:00.0')

In [30]:
order_tuples = myMap(orders,
                     lambda order: (int(order.split(',')[0]), order.split(',')[1])
                    )

In [31]:
order_tuples[:10]

[(1, '2013-07-25 00:00:00.0'),
 (2, '2013-07-25 00:00:00.0'),
 (3, '2013-07-25 00:00:00.0'),
 (4, '2013-07-25 00:00:00.0'),
 (5, '2013-07-25 00:00:00.0'),
 (6, '2013-07-25 00:00:00.0'),
 (7, '2013-07-25 00:00:00.0'),
 (8, '2013-07-25 00:00:00.0'),
 (9, '2013-07-25 00:00:00.0'),
 (10, '2013-07-25 00:00:00.0')]

## Develop myReduce
Develop a function by name myReduce which takes a collection and a function as arguments. Function should do the following:
* Iterate through elements
* Perform aggregation operation using the argument passed. Argument should have necessary arithmetic logic.
* Return the aggregated result.

In [32]:
l = [1, 4, 6, 2, 5]

In [34]:
l[1:]

[4, 6, 2, 5]

In [35]:
def myReduce(c, f):
    t = c[0]
    for e in c[1:]:
        t = f(t, e)
    return t

In [36]:
myReduce(l, lambda t, e: t + e)

18

In [37]:
myReduce(l, lambda t, e: t * e)

240

In [39]:
min(7, 5)

5

In [40]:
myReduce(l, lambda t, e: min(t, e))

1

In [41]:
myReduce(l, lambda t, e: max(t, e))

6

In [42]:
order_items_path = "/Users/itversity/Research/data/retail_db/order_items/part-00000.csv"
order_items = open(order_items_path). \
    read(). \
    splitlines()

In [43]:
order_items[:10]

['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99',
 '5,4,897,2,49.98,24.99',
 '6,4,365,5,299.95,59.99',
 '7,4,502,3,150.0,50.0',
 '8,4,1014,4,199.92,49.98',
 '9,5,957,1,299.98,299.98',
 '10,5,365,5,299.95,59.99']

In [46]:
order_item = '2,2,1073,1,199.99,199.99'
int(order_item.split(',')[1]) == 2

True

In [47]:
order_items_filtered = myFilter(order_items,
                                lambda order_item: int(order_item.split(',')[1]) == 2
                               )

In [48]:
order_items_filtered

['2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99']

In [49]:
order_item = '2,2,1073,1,199.99,199.99'
float(order_item.split(',')[4])

199.99

In [50]:
order_item_subtotals = myMap(order_items_filtered,
                             lambda order_item: float(order_item.split(',')[4])
                            )

In [54]:
order_item_subtotals

[199.99, 250.0, 129.99]

In [53]:
sum(order_item_subtotals)

579.98

In [55]:
myReduce(order_item_subtotals, lambda t, e: t + e)

579.98

In [56]:
myReduce(order_item_subtotals, lambda t, e: min(t, e))

129.99

## Develop myReduceByKey
Develop a function by name myReduceByKey which takes a collection of tuples and a function as arguments. Each element in the collection should have exactly 2 attributes. Function should do the following:
* Iterate through the collection of tuples.
* Group the data by first element in the collection of tuples and apply the function using the argument passed. Argument should have necessary arithmetic logic. 
* Return a collection of tuples, where first element is unique and second element is aggregated result.

In [57]:
d = {}
d[2] = 199.99

In [58]:
d

{2: 199.99}

In [59]:
if 2 in d: d[2] = d[2] + 250.0

In [60]:
d

{2: 449.99}

In [61]:
if 4 in d: d[4] = d[4] + 100
else: d[4] = 100

In [62]:
d

{2: 449.99, 4: 100}

In [69]:
def myReduceByKey(c_p, f):
    d = {}
    for e in c_p:
        if e[0] in d:
            d[e[0]] = f(d[e[0]], e[1])
        else:
            d[e[0]] = e[1]
    return list(d.items())

* Use the function to get the count by date from orders.

In [70]:
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"
orders = open(orders_path). \
    read(). \
    splitlines()

In [71]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [72]:
orders_map = myMap(orders, 
                   lambda order: (order.split(',')[1], 1)
                  )
orders_map[:10]

[('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1),
 ('2013-07-25 00:00:00.0', 1)]

In [73]:
order_count_by_date = myReduceByKey(orders_map, 
                                    lambda t, e: t + e
                                   )

In [74]:
order_count_by_date[:10]

[('2013-07-25 00:00:00.0', 143),
 ('2013-07-26 00:00:00.0', 269),
 ('2013-07-27 00:00:00.0', 202),
 ('2013-07-28 00:00:00.0', 187),
 ('2013-07-29 00:00:00.0', 253),
 ('2013-07-30 00:00:00.0', 227),
 ('2013-07-31 00:00:00.0', 252),
 ('2013-08-01 00:00:00.0', 246),
 ('2013-08-02 00:00:00.0', 224),
 ('2013-08-03 00:00:00.0', 183)]

* Use the function to get the revenue for each order id.

In [75]:
order_items_path = "/Users/itversity/Research/data/retail_db/order_items/part-00000.csv"
order_items = open(order_items_path). \
    read(). \
    splitlines()

In [76]:
order_items[:10]

['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99',
 '5,4,897,2,49.98,24.99',
 '6,4,365,5,299.95,59.99',
 '7,4,502,3,150.0,50.0',
 '8,4,1014,4,199.92,49.98',
 '9,5,957,1,299.98,299.98',
 '10,5,365,5,299.95,59.99']

In [77]:
order_items_map = myMap(order_items,
                        lambda order_item: (int(order_item.split(',')[1]),
                                            float(order_item.split(',')[4])
                                           )
                       )

In [78]:
order_items_map[:10]

[(1, 299.98),
 (2, 199.99),
 (2, 250.0),
 (2, 129.99),
 (4, 49.98),
 (4, 299.95),
 (4, 150.0),
 (4, 199.92),
 (5, 299.98),
 (5, 299.95)]

In [81]:
revenue_per_order = myReduceByKey(order_items_map,
                                  lambda t, e: round(t + e, 2)
                                 )

In [82]:
revenue_per_order[:10]

[(1, 299.98),
 (2, 579.98),
 (4, 699.85),
 (5, 1129.86),
 (7, 579.92),
 (8, 729.84),
 (9, 599.96),
 (10, 651.92),
 (11, 919.79),
 (12, 1299.87)]

In [84]:
myReduceByKey(order_items_map,
              lambda t, e: min(t, e)
             )[:10]

[(1, 299.98),
 (2, 129.99),
 (4, 49.98),
 (5, 99.96),
 (7, 79.95),
 (8, 50.0),
 (9, 199.98),
 (10, 21.99),
 (11, 49.98),
 (12, 100.0)]

In [86]:
order_items_map = myMap(order_items,
                        lambda order_item: (int(order_item.split(',')[1]),
                                            (float(order_item.split(',')[4]), 1)
                                           )
                       )

In [89]:
order_items_map[:10]

[(1, (299.98, 1)),
 (2, (199.99, 1)),
 (2, (250.0, 1)),
 (2, (129.99, 1)),
 (4, (49.98, 1)),
 (4, (299.95, 1)),
 (4, (150.0, 1)),
 (4, (199.92, 1)),
 (5, (299.98, 1)),
 (5, (299.95, 1))]

In [None]:
[2, [(199.99, 1), (250.0, 1), (129.99, 1)]]

In [91]:
t1 = (199.99, 1)
t2 = (250.0, 1)
(t1[0] + t2[0], t1[1] + t2[1])

(449.99, 2)

In [88]:
myReduceByKey(order_items_map,
              lambda t, e: (round(t[0] + e[0], 2), t[1] + e[1])
             )[:10]

[(1, (299.98, 1)),
 (2, (579.98, 3)),
 (4, (699.85, 4)),
 (5, (1129.86, 5)),
 (7, (579.92, 3)),
 (8, (729.84, 4)),
 (9, (599.96, 3)),
 (10, (651.92, 5)),
 (11, (919.79, 5)),
 (12, (1299.87, 5))]

## Exercises
Here are the same exercises which you have solved before. Try to solve these using mapReduce APIs.
* We will provide you a python script which will have all the above map reduce APIs. Use it as package and solve the below mentioned problems.
  * Create a file with name `mymapreduce.py`
  * Import and use it `from mymapreduce import *`.

In [None]:
def myFilter(c, f):
   c_f = []
   for e in c:
       if(f(e)):
           c_f.append(e)
   return c_f

def myMap(c, f):
   c_f = []
   for e in c:
       c_f.append(f(e))
   return c_f

def myReduce(c, f):
   t = c[0]
   for e in c[1:]:
       t = f(t, e)
   return t

def myReduceByKey(p, f):
   p_f = {}
   for e in p:
       if(e[0] in p_f):
           p_f[e[0]] = f(p_f[e[0]], e[1])
       else:
           p_f[e[0]] = e[1]
   return list(p_f.items())

* Get number of COMPLETE orders placed by each customer
* Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January.
* Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING.