# Manipulating Collections using Map Reduce APIs – Python 3

As we understand about collections and how to manipulate them using traditional looping, now let us check out already existing APIs such as map reduce to process collection data.

* Define problem statements
* Develop myFilter, myMap and myReduce APIs
* Understanding existing packages and APIs
* Developing Solutions using Map Reduce APIs

### Define Problem Statements
Let us see few similar problem statements and understand how we can build solutions using conventional loops.

* Filtering
* Get COMPLETE orders from orders data set
* Get orders placed on 2013-07-25
* Get order items for given order id
* In all 3 cases we need to iterate through collection, filter based on criteria and return collection.

In [2]:
#01-loops-filtering-by-order-status.py 
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersFiltered = []
for order in orders:
    if(order.split(",")[3] == "COMPLETE"):
        ordersFiltered.append(order)
        
ordersFiltered[:10]

['3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '15,2013-07-25 00:00:00.0,2568,COMPLETE',
 '17,2013-07-25 00:00:00.0,2667,COMPLETE',
 '22,2013-07-25 00:00:00.0,333,COMPLETE',
 '26,2013-07-25 00:00:00.0,7562,COMPLETE',
 '28,2013-07-25 00:00:00.0,656,COMPLETE',
 '32,2013-07-25 00:00:00.0,3960,COMPLETE']

In [3]:
# 02-loops-filtering-by-order-date.py
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersFiltered = []
for order in orders:
    if(order.split(",")[1] == "2013-07-25 00:00:00.0"):
        ordersFiltered.append(order)
        
ordersFiltered[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [5]:
# 03-loops-filtering-by-order-id.py 
orderItemsPath = "/data/retail_db/order_items/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orderItems = readData(orderItemsPath)

orderItemsFiltered = []
for orderItem in orderItems:
    if(int(orderItem.split(",")[1]) == 2):
        orderItemsFiltered.append(orderItem)
        
orderItemsFiltered[:10]

['2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99']

* Mapping
    * Get order_id and order_status from orders (1st and 4th fields of orders data)
    * Get order_item_order_id and order_item_subtotal from order_items (2nd and 5th field of order_items data)
    * Get order_month from orders data (extract year and month from 2nd field)
    * In all 3 cases we need to iterate through collection, transform individual records and add them to new collection

In [7]:
# 01-loops-get-order-id-and-order-status.py
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersMap = []
for order in orders:
    ordersMap.append((int(order.split(",")[0]), order.split(",")[3]))
        
ordersMap[:10]


[(1, 'CLOSED'),
 (2, 'PENDING_PAYMENT'),
 (3, 'COMPLETE'),
 (4, 'CLOSED'),
 (5, 'COMPLETE'),
 (6, 'COMPLETE'),
 (7, 'COMPLETE'),
 (8, 'PROCESSING'),
 (9, 'PENDING_PAYMENT'),
 (10, 'PENDING_PAYMENT')]

In [6]:
# 02-loops-get-order-id-and-subtotal.py
orderItemsPath = "/data/retail_db/order_items/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orderItems = readData(orderItemsPath)

orderItemsMap = []
for orderItem in orderItems:
    orderItemsMap.append((int(orderItem.split(",")[1]), float(orderItem.split(",")[4])))
        
orderItemsMap[:10]

[(1, 299.98),
 (2, 199.99),
 (2, 250.0),
 (2, 129.99),
 (4, 49.98),
 (4, 299.95),
 (4, 150.0),
 (4, 199.92),
 (5, 299.98),
 (5, 299.95)]

In [1]:
# 03-loops-get-order-month.py 
ordersPath = "/data/retail_db/orders/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orders = readData(ordersPath)

ordersMap = []
for order in orders:
    ordersMap.append(order.split(",")[1][:7])
        
ordersMap[:10]

['2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07']

* Reduce (on filtered and mapped order item subtotal based on order_id)
    * Get total revenue by adding all the revenues
    * Get minimum of order item subtotal
    * Get maximum of order item subtotal
    * In all 3 cases we need to initialize aggregator, loop through the values in collection and add it to the aggregator

In [10]:
#01-loops-get-data-for-aggregations.py
orderItemsPath = "/data/retail_db/order_items/part-00000"

def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

def getOrderItemsFiltered(orderItems, orderId):
    orderItemsFiltered = []
    for orderItem in orderItems:
        if(int(orderItem.split(",")[1]) == orderId):
            orderItemsFiltered.append(orderItem)
    return orderItemsFiltered

def getOrderItemsMap(orderItemsFiltered):
    orderItemsMap = []
    for orderItem in orderItemsFiltered:
        orderItemsMap.append(float(orderItem.split(",")[4]))
    return orderItemsMap

orderItems = readData(orderItemsPath)
orderItemsFiltered = getOrderItemsFiltered(orderItems, 2)
orderItemsMap = getOrderItemsMap(orderItemsFiltered)

In [3]:
#02-loops-get-total-revenue.py
totalRevenue = orderItemsMap[0]
for orderItemSubtotal in orderItemsMap[1:]:
    totalRevenue += orderItemSubtotal

totalRevenue

579.98

In [4]:
#03-loops-get-min-revenue.py
minRevenue = orderItemsMap[0]
for orderItemSubtotal in orderItemsMap[1:]:
    minRevenue = minRevenue if(minRevenue < orderItemSubtotal) else orderItemSubtotal

minRevenue

129.99

In [5]:
#04-loops-get-max-revenue.p
maxRevenue = orderItemsMap[0]
for orderItemSubtotal in orderItemsMap[1:]:
    maxRevenue = maxRevenue if(maxRevenue > orderItemSubtotal) else orderItemSubtotal

maxRevenue

250.0

### Develop myFilter, myMap and myReduce APIs
Now let us see how we can leverage lambda functions to develop generic functions to filter data, to apply transformation or mapping, to perform aggregations using reduce.
* myFilter function
    * Define function with two arguments
    * first argument – lambda function with one argument (at run time we pass a code snippet which return True or False)
    * second argument – collection
    * Develop the logic which will iterate through the elements in collection, apply passed filter criteria and add elements to new collections which satisfied the criteria.
    * Here is the code and also sample invocations covering all 3 scenarios discussed above.

In [8]:
#myFilter.py
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

def myFilter(f, c):
    newC = []
    for i in c:
        if(f(i)):
            newC.append(i)
    return newC

ordersPath = "/data/retail_db/orders/part-00000"
orders = readData(ordersPath)
ordersCompleted = myFilter(lambda o: o.split(",")[3] == "COMPLETE", orders)
ordersCompleted[:10]

ordersFilteredByDate = myFilter(lambda o: o.split(",")[1] == "2013-07-25 00:00:00.0", orders)
ordersFilteredByDate[:10]

orderItemsPath = '/data/retail_db/order_items/part-00000'
orderItems = readData(orderItemsPath)
orderItemsFiltered = myFilter(lambda o: int(o.split(",")[1]) == 2, orderItems)
orderItemsFiltered

['2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99']

* myMap function
    * Define function with two arguments
    * first argument – lambda function with one argument (at run time we pass a code snippet which transform one record to another)
    * second argument – collection
    * Develop the logic which will iterate through the elements in collection, apply passed transformation rule and add elements to new collections which satisfied the criteria.

In [9]:
#myMap.py
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

def myMap(f, c):
    newC = []
    for i in c:
        newC.append(f(i))
    return newC

ordersPath = "/data/retail_db/orders/part-00000"
orders = readData(ordersPath)
orderIdAndStatus = myMap(lambda o: (int(o.split(",")[0]), o.split(",")[3]), orders)
orderIdAndStatus[:10]

orderItemsPath = '/data/retail_db/order_items/part-00000'
orderItems = readData(orderItemsPath)
orderIdAndSubtotal = myMap(lambda oi: (int(oi.split(",")[1]), float(oi.split(",")[4])), orderItems)
orderIdAndSubtotal[:10]

ordersPath = "/data/retail_db/orders/part-00000"
orders = readData(ordersPath)
orderMonths = myMap(lambda o: o.split(",")[1][:7], orders)
orderMonths[:10]


['2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07',
 '2013-07']

* myReduce function
    * Define function with two arguments
    * first argument – lambda function with 2 arguments (at run time we need pass logic which perform arithmetic operation between the 2)
    * second argument – collection
    * Develop the logic which will iterate through the elements in collection, apply aggregation and return one value.

In [10]:
#myReduce.py
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

def myFilter(f, c):
    newC = []
    for i in c:
        if(f(i)):
            newC.append(i)
    return newC

def myMap(f, c):
    newC = []
    for i in c:
        newC.append(f(i))
    return newC

orderItemsPath = '/data/retail_db/order_items/part-00000'
orderItems = readData(orderItemsPath)
orderItemsFiltered = myFilter(lambda oi: int(oi.split(",")[1]) == 2, orderItems)
orderItemSubtotals = myMap(lambda oi: float(oi.split(",")[4]), orderItemsFiltered)
orderItemSubtotals

def myReduce(f, c):
    t = c[0]
    for i in c[1:]:
        t = f(t, i)
    return t

orderRevenue = myReduce(lambda x, y: x + y, orderItemSubtotals)
orderRevenue

minSubtotal = myReduce(lambda x, y: x if(x < y) else y, orderItemSubtotals)
minSubtotal

maxSubtotal = myReduce(lambda x, y: x if(x > y) else y, orderItemSubtotals)
maxSubtotal


250.0

### Understanding existing packages and APIs
As we have seen how to develop reusable functions to process the data, now let us understand existing APIs in different Python packages.
* map
* filter
* functools reduce (in Python 3)
* itertools have several functions
* numpy
* pandas
* and more
We will review some of the APIs by going through help. In place of myFilter, myMap, myReduce – you can leverage existing APIs to get the similar functionality.

### Developing Solutions using Map Reduce APIs
Now, let us understand how to build applications using existing APIs
* Get revenue for given order id from order_items
* Use filter to filter for items for a given order id
* Use map to get order item subtotals
* Use reduce to aggregate. We can also use sum to get total of elements in numeric list.


In [11]:
#mapreduce-revenueForOrderId.py
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orderItemsPath = '/data/retail_db/order_items/part-00000'
orderItems = readData(orderItemsPath)
orderItemsFiltered = filter(lambda oi: int(oi.split(",")[1]) == 2, orderItems)
orderItemSubtotals = map(lambda oi: float(oi.split(",")[4]), orderItemsFiltered)

import functools as ft
orderRevenue = ft.reduce(lambda x, y: x + y, orderItemSubtotals)

orderRevenue

579.98

* We do not have APIs directly to perform by key aggregations
* We need to use plugins such as itertools, pandas etc
* itertools approach – Get revenue for each order id from order_items
    * Read data into collection – order items
    * Sort data using sort function based on the key we are going to use to group – order item order id
    * Group data using groupby function of itertools using key on which we need to get the aggregation
    * groupby returns new collection in which each element contains
        * key on which data is grouped
        * collection corresponding to the key
    * apply map function to process the collection corresponding to key to return sum of order item subtotal

In [13]:
#itertools-getRevenuePerOrder.py
def readData(dataPath):
  dataFile = open(dataPath)
  dataStr = dataFile.read()
  dataList = dataStr.splitlines()
  return dataList

orderItemsPath = '/data/retail_db/order_items/part-00000'
orderItems = readData(orderItemsPath)

#Using itertools
import itertools as it
# help(it.groupby)
orderItems.sort(key=lambda oi: int(oi.split(",")[1]))
orderItemsGroupByOrderId = it.groupby(orderItems, lambda oi: int(oi.split(",")[1]))
revenuePerOrder = map(lambda orderItems: 
                      (orderItems[0], sum(map(lambda oi: 
                                              float(oi.split(",")[4]), orderItems[1]
                                             )
                                         )
                      ), 
                      orderItemsGroupByOrderId)
list(revenuePerOrder)[:10]

[(1, 299.98),
 (2, 579.98),
 (4, 699.85),
 (5, 1129.8600000000001),
 (7, 579.9200000000001),
 (8, 729.8399999999999),
 (9, 599.96),
 (10, 651.9200000000001),
 (11, 919.79),
 (12, 1299.8700000000001)]

* pandas approach –  Get revenue for each order id from order_items
    * We will actually look into pandas in detail as part of next chapter
    * Create list for column names
    * Pass path and column names to pandas read_csv function to create data frame
    * We can refer attributes in data frames using names
    * Apply group by function to group data using order item order id and invoke aggregate function sum on order item subtotal – this will return a new data frame which contain order item order id and revenue.

In [16]:
#pandas-getRevenuePerOrder.py
import pandas as pd

orderItemsPath = '/data/retail_db/order_items/part-00000'
colNames = ["order_item_id", "order_item_order_id", "order_item_product_id",
         "order_item_quantity", "order_item_subtotal", "order_item_product_price"]

orderItems = pd.read_csv(orderItemsPath, names=colNames)
revenuePerOrder = orderItems.groupby(["order_item_order_id"])["order_item_subtotal"].sum()

revenuePerOrder

order_item_order_id
1         299.98
2         579.98
4         699.85
5        1129.86
7         579.92
          ...   
68879    1259.97
68880     999.77
68881     129.99
68882     109.99
68883    2149.99
Name: order_item_subtotal, Length: 57431, dtype: float64