# Processing Collections using loops
As part of this module we will see how to manipulate collections using loops. We typically don’t use loops but this module will help in improving our programming skills.

* Reading files into collections
* Standard Transformations
* Filtering Data
* Performing Aggregations
* Joining Data Sets
* Exercises
* Limitations of using Loops

## Reading files into collections

Let us understand how to read data from files into collections. 
* Python have simple and yet rich APIs to perform file I/O
* We can create a file object with `open` in different modes (by default read only mode)
* To read the contents from the file into memory, we have APIs on top of file object such as `read()`
* `read()` will create large string using contents of the files
* If the data have multiple records with new line character as delimiter, we can apply `splitlines()` on the output of read
* `splitlines()` will convert the string into list with new line character as delimiter

In [9]:
path = '/Users/itversity/Research/data/retail_db/orders/part-00000.csv'
# C:\\users\\itversity\\Research
orders_file = open(path)

In [10]:
orders_raw = orders_file.read()

In [11]:
orders = orders_raw.splitlines()

In [12]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

## Standard Transformations

Let us understand standard transformations we perform on top of data in collections.
* Filtering
* Row level transformations such as standardization, cleansing etc.
* Aggregations
* Grouped Aggregations
* Sorting and Ranking

Typically we use external libraries such as Pandas, Pyspark etc to perform these standard transformations. However, we will try to develop using conventional loops to understand how they are implemented and also to get better with respect to programming.


## Filtering Data
Let us perform few tasks to understand how to filter the data in collections using loops and conditionals.

* Here are the details about orders.
  * Data is in text file format
  * Each line in the file contains one record.
  * Each record contains 4 attributes which are separated by “,”
    * order_id
    * order_date
    * order_customer_id
    * order_status
  * Create a function by name **get_customer_orders** which take **orders list** and **customer_id** as arguments and **return all the orders placed by customer_id**


In [13]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [14]:
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'

In [19]:
int(order.split(',')[2]) == 12111

True

In [20]:
def get_customer_orders(orders, customer_id):
    orders_filtered = []
    for order in orders:
        if int(order.split(',')[2]) == customer_id:
            orders_filtered.append(order)
    return orders_filtered

* Use the function and get all the orders which are placed by customer with id 12431

In [21]:
get_customer_orders(orders, 12431)

['3774,2013-08-16 00:00:00.0,12431,CANCELED',
 '3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT',
 '4032,2013-08-17 00:00:00.0,12431,ON_HOLD',
 '22812,2013-12-12 00:00:00.0,12431,PENDING',
 '22927,2013-12-13 00:00:00.0,12431,CLOSED',
 '25614,2013-12-30 00:00:00.0,12431,CLOSED',
 '27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '45894,2014-05-06 00:00:00.0,12431,CLOSED',
 '46217,2014-05-07 00:00:00.0,12431,CLOSED',
 '49678,2014-05-31 00:00:00.0,12431,PENDING',
 '51865,2014-06-15 00:00:00.0,12431,PROCESSING',
 '63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT',
 '67110,2014-07-14 00:00:00.0,12431,PENDING']

* Create a function by name get_customer_orders_for_month which take orders list, customer_id and month in the format YYYY-MM as arguments and return all the orders placed by customer_id for a given month.

In [24]:
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'

In [25]:
int(order.split(',')[2]) == 12111

True

In [23]:
order.split(',')[1].startswith('2013-07')

True

In [26]:
int(order.split(',')[2]) == 12111 and order.split(',')[1].startswith('2013-07')

True

In [29]:
def get_customer_orders_for_month(orders, customer_id, order_month):
    orders_filtered = []
    for order in orders:
        order_elements = order.split(',')
        if int(order_elements[2]) == customer_id and order_elements[1].startswith(order_month):
            orders_filtered.append(order)
    return orders_filtered

* Use the function and get all the orders which are placed by customer with id 12431 in January 2014

In [30]:
get_customer_orders_for_month(orders, 12431, '2014-01')

['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD']

* Write ad hoc code to get all the orders which are placed by customer with id 12431 in January 2014 and status is in PENDING_PAYMENT or PROCESSING

In [36]:
for order in orders:
    order_elements = order.split(',')
    if int(order_elements[2]) == 12431 \
        and order_elements[1].startswith('2014-01') \
        and (order_elements[3] in ('PROCESSING', 'PENDING_PAYMENT')):
        print(order)

27585,2014-01-12 00:00:00.0,12431,PROCESSING
28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT


## Performing Aggregations

Let us perform few tasks to understand how to perform aggregations by key over the data in collections using loops and conditionals.
* Here are the details about orders.
  * Data is in text file format
  * Each line in the file contains one record.
  * Each record contains 4 attributes which are separated by “,”
    * order_id
    * order_date
    * order_customer_id
    * order_status
* Here are the details about order_items.
  * Data is in text file format
  * Each line in the file contains one record.
  * Each record contains 6 attributes which are separated by “,”
    * order_item_id
    * order_item_order_id
    * order_item_product_id
    * order_item_quantity
    * order_item_subtotal
    * order_item_product_price
* Create a function get_count_by_order_status which takes orders list as argument and returns a dict which contain order_status and corresponding count.


In [40]:
d = {}

d['CLOSED'] = 1

In [41]:
d

{'CLOSED': 1}

In [42]:
d['CLOSED'] = 2

In [43]:
d

{'CLOSED': 2}

In [48]:
d = {}

In [51]:
if 'CLOSED' in d: d['CLOSED'] = d['CLOSED'] + 1
else: d['CLOSED'] = 1

In [52]:
d

{'CLOSED': 2}

In [57]:
d['CLOSED']

2

In [67]:
def get_count_by_order_status(orders):
    order_count = {}
    for order in orders:
        order_status = order.split(',')[3]
        if order_status in order_count: order_count[order_status] += 1
        else: order_count[order_status] = 1
    return order_count

* Use the function to get count by status and preview the output.

In [69]:
get_count_by_order_status(orders)

{'CLOSED': 7556,
 'PENDING_PAYMENT': 15030,
 'COMPLETE': 22899,
 'PROCESSING': 8275,
 'PAYMENT_REVIEW': 729,
 'PENDING': 7610,
 'ON_HOLD': 3798,
 'CANCELED': 1428,
 'SUSPECTED_FRAUD': 1558}

* Create a function get_revenue_per_order which takes order_items list as argument and returns a dict which contain order_item_order_id and corresponding order_revenue.

In [81]:
def get_revenue_per_order(order_items):
    revenue_per_order = {}
    for order_item in order_items:
        order_item_order_id = int(order_item.split(',')[1])
        order_item_subtotal = float(order_item.split(',')[4])
        if revenue_per_order.get(order_item_order_id):
            revenue_per_order[order_item_order_id] += order_item_subtotal
        else:
            revenue_per_order[order_item_order_id] = order_item_subtotal
    return revenue_per_order

* Use the function to get revenue for each order_item_order_id and preview the output.

In [82]:
order_items_path = '/Users/itversity/Research/data/retail_db/order_items/part-00000.csv'
order_items = open(order_items_path). \
    read(). \
    splitlines()

In [85]:
for i in order_items[:10]:
    print(i)

1,1,957,1,299.98,299.98
2,2,1073,1,199.99,199.99
3,2,502,5,250.0,50.0
4,2,403,1,129.99,129.99
5,4,897,2,49.98,24.99
6,4,365,5,299.95,59.99
7,4,502,3,150.0,50.0
8,4,1014,4,199.92,49.98
9,5,957,1,299.98,299.98
10,5,365,5,299.95,59.99


In [86]:
list(get_revenue_per_order(order_items).items())[:10]

[(1, 299.98),
 (2, 579.98),
 (4, 699.85),
 (5, 1129.8600000000001),
 (7, 579.9200000000001),
 (8, 729.8399999999999),
 (9, 599.96),
 (10, 651.9200000000001),
 (11, 919.79),
 (12, 1299.8700000000001)]

* Create a function get_order_count_by_month which takes orders list and order_status as arguments and returns a dict which contain order_month and count. We only have to count those orders which belong to the passed order_status

In [87]:
def get_order_count_by_month(orders, order_status):
    order_count = {}
    for order in orders:
        order_month = order.split(',')[1][:7]
        l_order_status = order.split(',')[3]
        if l_order_status == order_status:
            if order_month in order_count: order_count[order_month] += 1
            else: order_count[order_month] = 1
    return order_count

* Use the function to get count for each order_date and preview the output. We need to pass the status also as an argument.

In [89]:
get_order_count_by_month(orders, 'CLOSED')

{'2013-07': 161,
 '2013-08': 637,
 '2013-09': 676,
 '2013-10': 609,
 '2013-11': 686,
 '2013-12': 705,
 '2014-01': 633,
 '2014-02': 602,
 '2014-03': 612,
 '2014-04': 583,
 '2014-05': 585,
 '2014-06': 563,
 '2014-07': 504}

## Joining Data Sets

Let us perform few tasks to understand how to perform  joins over multiple collections using loops and conditionals.
* There are different strategies for joins.
  * Nested Loops
  * Sort Merge
  * Hash Join
* We will be using Nested Loops approach using orders and order_items.
  * Build dict for one data set - orders.
  * Iteratively lookup into the orders data set while processing the other one  - order_items
* Develop a function get_daily_revenue which takes orders, order_items and order_status as arguments and return dict containing order_date and order_revenue. We need to get revenue considering only those orders which satisfy the status passed.

In [91]:
def get_orders_dict(orders, order_status):
    orders_dict = {}
    for order in orders:
        order_id = int(order.split(',')[0])
        order_date = order.split(',')[1]
        l_order_status = order.split(',')[3]
        if l_order_status == order_status:
            orders_dict[order_id] = order_date
    return orders_dict   

In [93]:
len(get_orders_dict(orders, 'COMPLETE'))

22899

In [100]:
def get_daily_revenue(orders, order_items, order_status):
    orders_dict = get_orders_dict(orders, order_status)
    daily_revenue = {}
    for order_item in order_items:
        order_item_order_id = int(order_item.split(',')[1])
        order_item_subtotal = float(order_item.split(',')[4])
        
        if order_item_order_id in orders_dict:
            orders_dict_date = orders_dict[order_item_order_id]
            if orders_dict_date in daily_revenue:
                daily_revenue[orders_dict_date] = round(daily_revenue[orders_dict_date] + order_item_subtotal, 2)
            else:
                daily_revenue[orders_dict_date] = order_item_subtotal
    return daily_revenue

* Use the function to get daily revenue considering only COMPLETE orders.

In [None]:
get_daily_revenue(orders, order_items, 'COMPLETE')

## Exercises
Here are some of the exercises you can work on to process collection using conventional loops and conditionals. Create functions for each of the below problem statement.

* Get number of COMPLETE orders placed by each customer
* Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January.
* Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING.

## Limitations of using Loops

There are several limitations using loops.
* If you look at the above examples, most of the functions have similar logic to iterate through elements.
* We are spending more time on coding non business logic.
* It results in too much of code and it can become a maintenance problem.
* Map Reduce APIs will solve these problems.
  * We do not have to develop loops and conditionals.
  * Loops and Conditionals are taken care by the existing APIs.
  * We can just focus on business logic. It can be passed using Lambda Functions.
