## Exercises - Manipulating Collections using Loops

Let us go throuh some of the exercises to understand how to process collections using conventional loops and conditionals. Create functions for each of the below problem statement.
* Get number of COMPLETE orders placed by each customer
* Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January.
* Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING.

### Details of Data

Here are the details about the orders data which you can leverage to take care of these exercises.
* Location: `/data/retail_db/orders/part-00000`
* Each record is line separated or line delimited.
* Attribute in each record is comma separated.
* Here are the columns in the orders data set.
  * order_id
  * order_date
  * order_customer_id
  * order_status

In [3]:
# Get the details about file
!ls -ltr /data/retail_db/orders/part-00000

-rw-r--r-- 1 root root 2999944 Jan 21  2021 /data/retail_db/orders/part-00000


In [4]:
# Get first five lines from the file
!head -5 /data/retail_db/orders/part-00000

1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE


In [5]:
# Get number of lines from the file
# We can use linux command wc with -l
!wc -l /data/retail_db/orders/part-00000

68883 /data/retail_db/orders/part-00000


Here are the details about the order_items data which you can leverage to take care of these exercises.
* Location: `/data/retail_db/order_items/part-00000`
* Each record is line separated or line delimited.
* Attribute in each record is comma separated.
* Here are the columns in the order_items data set.
  * order_item_id
  * order_item_order_id
  * order_item_product_id
  * order_item_quantity
  * order_item_subtotal
  * order_item_product_price

In [6]:
# Get the details about file
!ls -ltr /data/retail_db/order_items/part-00000

-rw-r--r-- 1 root root 5408880 Jan 21  2021 /data/retail_db/order_items/part-00000


In [7]:
# Get first five lines from the file
!head -5 /data/retail_db/order_items/part-00000

1,1,957,1,299.98,299.98
2,2,1073,1,199.99,199.99
3,2,502,5,250.0,50.0
4,2,403,1,129.99,129.99
5,4,897,2,49.98,24.99


In [8]:
# Get number of lines from the file
# We can use linux command wc with -l
!wc -l /data/retail_db/order_items/part-00000

172198 /data/retail_db/order_items/part-00000


### Exercise 1 - Read data from file
Before getting into problem statement, develop the code to read the file into list of elements.
* We should be able to use this function to read any file with text data using line as record delimiter.

In [9]:
# Update the logic here
def get_list_from_file(file_path):
    data_list = open(file_path).read().splitlines()
    return data_list

* Run below cells to validate the function
* You should see 68883 records as part of the output for the cell with `len(orders)` below.
* You should see 172198 records as part of the output for the cell with `len(order_items)` below.

In [10]:
orders = get_list_from_file('/data/retail_db/orders/part-00000')

In [11]:
orders[:5]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE']

In [12]:
len(orders)

68883

In [13]:
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')

In [14]:
order_items[:5]

['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99',
 '5,4,897,2,49.98,24.99']

In [15]:
len(order_items)

172198

### Exercise 2 - Complete Order Count by Customer

Get number of COMPLETE orders placed by each customer. Develop a function which read the orders data and get us complete order count by each customer using **order_customer_id**.
* The function should take the complete order list as argument and return count of complete orders by customer. The function should return **dict** type object.
* The order is said to be complete if the **order_status** is **COMPLETE**.
* You can review structure of the data under **Details of Data** section in this notebook.

In [20]:
# Update the logic here
def get_complete_order_count_by_customer(orders):
    order_count_by_customer = {}
    for order in orders:
        order_status=order.split(',')[3]
        order_customer_id=order.split(',')[2]
        if order_status=='COMPLETE':
            if int(order_customer_id) in order_count_by_customer:
                order_count_by_customer[int(order_customer_id)]+=1
            else:
                order_count_by_customer[int(order_customer_id)]=1
    return order_count_by_customer

In [21]:
# # Update the logic here
# def get_complete_order_count_by_customer(orders):
#     order_count_by_customer = {}
#     for order in orders:
#         order_status=order.split(',')[3]
#         order_customer_id=order.split(',')[2]
#         if order_status=='COMPLETE':
#             if order_count_by_customer.get(int(order.split(',')[2])): # dict key customer_id not found
#                 order_count_by_customer[int(order.split(',')[2])] += 1
#             else:
#                 order_count_by_customer[int(order.split(',')[2])] = 1
#     return order_count_by_customer

* Run below cell to validate the function. You should get **22899** as output.

In [22]:
orders = get_list_from_file('/data/retail_db/orders/part-00000')
len(orders)

68883

In [23]:
complete_order_count_by_customer = get_complete_order_count_by_customer(orders)

In [24]:
# This should return dict
type(complete_order_count_by_customer)

dict

In [25]:
# This should return 10538
len(complete_order_count_by_customer)

10538

* Run below cell to preview the data.
```python
(1, 1)
(2, 2)
(3, 5)
(4, 4)
(5, 2)
```

In [26]:
for e in sorted(complete_order_count_by_customer.items())[:5]:
    print(e)

(1, 1)
(2, 2)
(3, 5)
(4, 4)
(5, 2)


### Exercise 3 - Pending Order Count

Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January. Develop a function which read the orders data and get us pending order count.
* The function should take the complete order list as argument and return count of pending orders.
* The order is said to be complete if the status is **PENDING** or **PENDING_PAYMENT**. We should only consider the orders placed in the month of 2014 January.
* The second element in each comma separated record gives us the date
* The 4th or last element in each comma separated record gives us the order status.

In [27]:
# order_details=orders[0].split(',')
# order_details[3] in ('CLOSED','COMPLETE') and order_details[1].startswith('2013-07')

In [28]:
# Update the logic here
def get_pending_order_count(orders):
    order_count=0
    for order in orders:
        order_details=order.split(',')
        order_date=order_details[1]
        order_status=order_details[3]
        if(order_status in ('PENDING','PENDING_PAYMENT') and order_date.startswith('2014-01')):
            order_count+=1
    return order_count

* Run below cell to validate your function. You should get **1969** as output.

In [29]:
get_pending_order_count(orders)

1969

* You can also validate results using simple linux scripts.

In [30]:
!egrep -w '(PENDING|PENDING_PAYMENT)' /data/retail_db/orders/part-00000|grep 2014-01|wc -l

1969


### Exercise 4 - Get Outstanding Revenue

Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING. Modularize by developing multiple functions.
* Develop a function which takes orders list as argument and return a collection of order ids with one of the pending statuses.
* Develop a function which takes **order_items list** as well as **orders dict with only status** as arguments and return outstanding amount. 
* You can use **order_item_subtotal** to compute the outstanding amount.
* Here are the instructions for the solution.
  * Create a list or set or dict for pending orders as part of first function with name that starts with **get_pending_orders**.
  * As part of **get_outstanding_revenue** make sure to iterate through **order_items** and lookup into **pending_orders** to get the subtotal for each order item.
* Review **Details of Data** section to get more details of columns.

* Develop a function to create list of orders with pending status and lookup into it.

In [88]:
# def get_pending_orders(orders):
#     pending_orders =[]
#     comparing_status = ['PENDING', 'PENDING_PAYMENT','PAYMENT_REVIEW' ,'PROCESSING']
#     for order in orders:
#         order_status = order.split(',')[3]
#         if order_status in comparing_status:
#             order_concat = order.split(',')[0] + ',' +order.split(',')[3]
#             pending_orders.append(order_concat)
#     return pending_orders

In [14]:
# Update the logic here
def get_pending_orders(orders):
    pending_orders=[
        order.split(',')[0] + ',' + order.split(',')[3] for order in orders 
        if (order.split(',')[3] in ('PAYMENT_REVIEW','PENDING','PENDING_PAYMENT','PROCESSING'))]
    return pending_orders

* Validate by running below cells to see if the list is created with order ids.

In [15]:
orders = get_list_from_file('/data/retail_db/orders/part-00000')

In [16]:
pending_orders = get_pending_orders(orders)

In [17]:
# It should return list
type(pending_orders)

list

In [18]:
# Preview first five elements
pending_orders[:5]

['2,PENDING_PAYMENT',
 '8,PROCESSING',
 '9,PENDING_PAYMENT',
 '10,PENDING_PAYMENT',
 '11,PAYMENT_REVIEW']

In [104]:
# Reading first element from the list
pending_orders[0] 

'2,PENDING_PAYMENT'

In [72]:
# It should return 31644
len(pending_orders)

31644

In [73]:
# def get_outstanding_revenue(order_items, pending_orders):
#     outstanding_revenue=0
#     for order_item in order_items:
#         order_item_order_id=int(order_item.split(',')[1])
#         subtotal=float(order_item.split(',')[4])
#         for pending_order in pending_orders:
#             if order_item_order_id == pending_order[0]:
#                 outstanding_revenue+=subtotal
#     return outstanding_revenue

In [105]:
def get_outstanding_revenue(order_items, pending_orders):
    outstanding_revenue = 0
    pending_ord_id= []
    for pen_ord in pending_orders:
        pending_ord_id.append(pen_ord.split(',')[0])
    ord_id = pending_ord_id
    for items in order_items:
        if items.split(',')[1] in ord_id:
            outstanding_revenue += float(items.split(',')[4])
    return round(outstanding_revenue, 2)

In [106]:
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')

In [107]:
%%time
# You should get 15982030.54 as output. Even if it is different by few dollars it is fine.
get_outstanding_revenue(order_items, pending_orders)

CPU times: user 34.2 s, sys: 3.91 ms, total: 34.2 s
Wall time: 34.2 s


15982030.54

### Using Set

In [14]:
# def get_pending_orders_set(orders):
#     pending_orders =set()
#     comparing_status = ['PENDING', 'PENDING_PAYMENT','PAYMENT_REVIEW' ,'PROCESSING']
#     for order in orders:
#         order_status = order.split(',')[3]
#         if order_status in comparing_status:
#             order_concat = order.split(',')[0] + ',' +order.split(',')[3]
#             pending_orders.add(order_concat)
#     return pending_orders

* Develop a function to create set of orders with pending status and lookup into it.

In [26]:
# Update the logic here
def get_pending_orders_set(orders):
    pending_orders=set(
        order.split(',')[0] + ',' + order.split(',')[3] for order in orders 
        if (order.split(',')[3] in ('PAYMENT_REVIEW','PENDING','PENDING_PAYMENT','PROCESSING')))
    return pending_orders

* Validate by running below cells to see if the set is created with order ids.

In [27]:
orders = get_list_from_file('/data/retail_db/orders/part-00000')

In [28]:
pending_orders = get_pending_orders_set(orders)

In [29]:
# It should return set
type(pending_orders)

set

In [30]:
# Preview first five elements
list(pending_orders)[:5]

['41443,PROCESSING',
 '6325,PENDING',
 '13325,PROCESSING',
 '13455,PENDING_PAYMENT',
 '29998,PROCESSING']

In [31]:
# Reading first element from the list
list(pending_orders)[0] 

'41443,PROCESSING'

In [32]:
# It should return 31644
len(pending_orders)

31644

In [42]:
def get_outstanding_revenue(order_items, pending_orders):
    outstanding_revenue = 0
    pending_ord_id= []
    for pen_ord in pending_orders:
        pending_ord_id.append(pen_ord.split(',')[0])
    ord_id = set(pending_ord_id)
    for items in order_items:
        if items.split(',')[1] in ord_id:
            outstanding_revenue += float(items.split(',')[4])
    return round(outstanding_revenue, 2)

In [43]:
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')

In [44]:
%%time
# You should get 15982030.54 as output. Even if it is different by few dollars it is fine.
get_outstanding_revenue(order_items, pending_orders)

CPU times: user 74.1 ms, sys: 10 µs, total: 74.1 ms
Wall time: 73.9 ms


15982030.54

* Develop a function to create dict of orders with pending status and lookup into it. The dict need to have order id as key and some constant value as value. In my case, I have used 1 as value.

### Using Dictionary

In [22]:
# Update the logic here
def get_pending_orders_dict(orders):
    pending_orders={
        order.split(',')[0]:order.split(',')[3] for order in orders 
        if (order.split(',')[3] in ('PAYMENT_REVIEW','PENDING','PENDING_PAYMENT','PROCESSING'))}
    return pending_orders

* Validate by running below cells to see if the dict is created with order id and order status.

In [23]:
orders = get_list_from_file('/data/retail_db/orders/part-00000')

In [24]:
pending_orders = get_pending_orders_dict(orders)

In [25]:
# It should return dict
type(pending_orders)

dict

In [29]:
# Preview first five elements
list(pending_orders.items())[:5]

[('2', 'PENDING_PAYMENT'),
 ('8', 'PROCESSING'),
 ('9', 'PENDING_PAYMENT'),
 ('10', 'PENDING_PAYMENT'),
 ('11', 'PAYMENT_REVIEW')]

In [30]:
# Reading first element from the dict
list(pending_orders.items())[0] 

('2', 'PENDING_PAYMENT')

In [31]:
# It should return 31644
len(pending_orders)

31644

In [36]:
def get_outstanding_revenue(order_items, pending_orders):
    outstanding_revenue = 0
    for order_item in order_items:
        if order_item.split(',')[1] in pending_orders:
            outstanding_revenue += float(order_item.split(',')[4])
    return round(outstanding_revenue, 2)

In [37]:
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')

In [38]:
%%time
# You should get 15982030.54 as output. Even if it is different by few dollars it is fine.
get_outstanding_revenue(order_items, pending_orders)

CPU times: user 65.2 ms, sys: 0 ns, total: 65.2 ms
Wall time: 64.6 ms


15982030.54

### Exercise 5 - Compare Performance

As part of the previous exercise you were asked to come up with the solution using 3 different approaches. You need to add a markdown cell below each question and provide answer.

* Question: Which of the 3 approaches is faster? Add a markdown cell below and provide your answer.
  * list
  * set
  * dict

Answer : Dictionary solution is the fastest one. 

* Question: Provide explanation why the option you have chosen is faster over others. Add a markdown cell below and provide your answer.

Answer: It checks if the id is present in dictionary(compares with key) directly which avoids the looping process 
in list and set.