## Exercises - Pandas Data Frames

Here are some of the Exercises related to Pandas.
* Create Pandas Data Frames using Schema
* Get all the orders which belong to the month of 2013 August
* Get all the orders which belong to the months of August, September and October in 2013.
* Get count of orders by status for the month of 2014 January
* Get all the records from orders where there are no corresponding records in order_items
* Get all the customers who have not placed any orders
* Get the revenue by status

### Exercise 1 - Create Pandas Data Frames using Schema

Create Pandas Data Frame for orders, order_items and customers. Make sure to use **schema/retail_db/retail.json** to get the column names.

In [1]:
import os
import json
import csv
import pandas as pd

def get_df(base_folder, data_set_name, schema_file):
    file_names = os.listdir(f'{base_folder}/{data_set_name}')
    retail_schemas = json.load(open(schema_file))
    columns = list(map(lambda col: col['column_name'], retail_schemas[data_set_name]))
    data = []
    for file_name in file_names:
        file_path = f'{base_folder}/{data_set_name}/{file_name}'
        raw_data = open(file_path)
        data += list(raw_data)
    return pd.DataFrame(map(lambda rec: rec.split(','), data), columns=columns)

In [3]:
orders = get_df('D:/BIGDATA_LEARN/bigdata-learn/data/retail_db', 'orders', 'schemas/retail_db/retail.json')

In [4]:
order_items = get_df('D:/BIGDATA_LEARN/bigdata-learn/data/retail_db', 'order_items', 'schemas/retail_db/retail.json')

In [5]:
customers = get_df('D:/BIGDATA_LEARN/bigdata-learn/data/retail_db', 'customers', 'schemas/retail_db/retail.json')

In [43]:
orders.head()

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED\n
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT\n
2,3,2013-07-25 00:00:00.0,12111,COMPLETE\n
3,4,2013-07-25 00:00:00.0,8827,CLOSED\n
4,5,2013-07-25 00:00:00.0,11318,COMPLETE\n


In [42]:
customers.head()

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode
0,1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521\n
1,2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126\n
2,3,Ann,Smith,XXXXXXXXX,XXXXXXXXX,3422 Blue Pioneer Bend,Caguas,PR,00725\n
3,4,Mary,Jones,XXXXXXXXX,XXXXXXXXX,8324 Little Common,San Marcos,CA,92069\n
4,5,Robert,Hudson,XXXXXXXXX,XXXXXXXXX,"""10 Crystal River Mall """,Caguas,PR,00725\n


### Get all the orders which belong to the month of 2013 August

In [7]:
orders[orders['order_date'].str[:7]=='2013-08']

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE\n
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED\n
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE\n
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT\n
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT\n
...,...,...,...,...
68705,68706,2013-08-20 00:00:00.0,130,COMPLETE\n
68706,68707,2013-08-23 00:00:00.0,11730,COMPLETE\n
68707,68708,2013-08-26 00:00:00.0,8852,ON_HOLD\n
68708,68709,2013-08-30 00:00:00.0,4756,COMPLETE\n


### Get all the orders which belong to the months of August, September and October in 2013.

In [10]:
orders[orders['order_date'].str[:7].isin(('2013-08','2013-09','2013-10'))]

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE\n
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED\n
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE\n
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT\n
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT\n
...,...,...,...,...
68737,68738,2013-10-27 00:00:00.0,1100,COMPLETE\n
68738,68739,2013-10-28 00:00:00.0,2528,PENDING\n
68739,68740,2013-10-29 00:00:00.0,10691,ON_HOLD\n
68740,68741,2013-10-30 00:00:00.0,5974,PENDING_PAYMENT\n


### Get count of orders by status for the month of 2014 January

In [13]:
orders[orders['order_date'].str.slice(0,7)=='2014-01'].shape[0]

5908

### Get all the records from orders where there are no corresponding records in order_items

In [31]:
orders.set_index('order_id'). \
    join(order_items.set_index('order_item_order_id'))

Unnamed: 0,order_date,order_customer_id,order_status,order_item_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2013-07-25 00:00:00.0,11599,CLOSED\n,1,957,1,299.98,299.98\n
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT\n,24,1073,1,199.99,199.99\n
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT\n,25,1014,2,99.96,49.98\n
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT\n,26,403,1,129.99,129.99\n
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT\n,27,917,1,21.99,21.99\n
...,...,...,...,...,...,...,...,...
9998,2013-09-25 00:00:00.0,9419,PENDING\n,24970,278,1,44.99,44.99\n
9998,2013-09-25 00:00:00.0,9419,PENDING\n,24971,627,1,39.99,39.99\n
9999,2013-09-25 00:00:00.0,1185,CLOSED\n,24972,627,3,119.97,39.99\n
9999,2013-09-25 00:00:00.0,1185,CLOSED\n,24973,365,5,299.95,59.99\n


In [26]:
orders.set_index('order_id'). \
    join(order_items.set_index('order_item_order_id')). \
    query('order_item_id.isna()')

Unnamed: 0,order_date,order_customer_id,order_status,order_item_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
10015,2013-09-25 00:00:00.0,3112,COMPLETE\n,,,,,
10016,2013-09-25 00:00:00.0,1214,PROCESSING\n,,,,,
10022,2013-09-25 00:00:00.0,2697,PENDING_PAYMENT\n,,,,,
10031,2013-09-25 00:00:00.0,9968,COMPLETE\n,,,,,
10035,2013-09-25 00:00:00.0,2570,PENDING_PAYMENT\n,,,,,
...,...,...,...,...,...,...,...,...
9978,2013-09-25 00:00:00.0,3100,CANCELED\n,,,,,
9980,2013-09-25 00:00:00.0,8261,PENDING_PAYMENT\n,,,,,
9982,2013-09-25 00:00:00.0,4860,COMPLETE\n,,,,,
9994,2013-09-25 00:00:00.0,9585,PENDING_PAYMENT\n,,,,,


In [36]:
orders_joined = orders.set_index('order_id'). \
    join(order_items.set_index('order_item_order_id'))

In [37]:
orders_joined[orders_joined['order_item_id'].isna()]

Unnamed: 0,order_date,order_customer_id,order_status,order_item_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
10015,2013-09-25 00:00:00.0,3112,COMPLETE\n,,,,,
10016,2013-09-25 00:00:00.0,1214,PROCESSING\n,,,,,
10022,2013-09-25 00:00:00.0,2697,PENDING_PAYMENT\n,,,,,
10031,2013-09-25 00:00:00.0,9968,COMPLETE\n,,,,,
10035,2013-09-25 00:00:00.0,2570,PENDING_PAYMENT\n,,,,,
...,...,...,...,...,...,...,...,...
9978,2013-09-25 00:00:00.0,3100,CANCELED\n,,,,,
9980,2013-09-25 00:00:00.0,8261,PENDING_PAYMENT\n,,,,,
9982,2013-09-25 00:00:00.0,4860,COMPLETE\n,,,,,
9994,2013-09-25 00:00:00.0,9585,PENDING_PAYMENT\n,,,,,


### Get all the customers who have not placed any orders

In [44]:
customers.set_index('customer_id'). \
    join(orders.set_index('order_customer_id')). \
    query('order_id.isna()')

Unnamed: 0,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,order_id,order_date,order_status
10060,Mary,Shaw,XXXXXXXXX,XXXXXXXXX,4645 Fallen Timber By-pass,Caguas,PR,00725\n,,,
10330,Mary,Smith,XXXXXXXXX,XXXXXXXXX,3410 Lazy Shadow Pathway,Hamilton,OH,45013\n,,,
10439,Emma,Smith,XXXXXXXXX,XXXXXXXXX,1465 Clear Elk Diversion,Caguas,PR,00725\n,,,
10913,Mary,Williams,XXXXXXXXX,XXXXXXXXX,9113 Grand Hills Parade,San Jose,CA,95123\n,,,
10958,Joan,Smith,XXXXXXXXX,XXXXXXXXX,8771 Middle Quail Heath,Los Angeles,CA,90024\n,,,
1187,Dorothy,Vazquez,XXXXXXXXX,XXXXXXXXX,363 Green Goose Run,Danbury,CT,06810\n,,,
12175,Amanda,Smith,XXXXXXXXX,XXXXXXXXX,3729 Cinder Grove Concession,Tonawanda,NY,14150\n,,,
12190,Mary,Smith,XXXXXXXXX,XXXXXXXXX,4462 Little Lagoon Route,Tempe,AZ,85283\n,,,
12392,Alan,Wolf,XXXXXXXXX,XXXXXXXXX,6470 Fallen Barn Autoroute,Santa Ana,CA,92704\n,,,
1481,Grace,Smith,XXXXXXXXX,XXXXXXXXX,2171 Clear Lake Isle,Caguas,PR,00725\n,,,


### Get the revenue by status

In [45]:
orders.head()

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED\n
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT\n
2,3,2013-07-25 00:00:00.0,12111,COMPLETE\n
3,4,2013-07-25 00:00:00.0,8827,CLOSED\n
4,5,2013-07-25 00:00:00.0,11318,COMPLETE\n


In [47]:
order_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172198 entries, 0 to 172197
Data columns (total 6 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   order_item_id             172198 non-null  object
 1   order_item_order_id       172198 non-null  object
 2   order_item_product_id     172198 non-null  object
 3   order_item_quantity       172198 non-null  object
 4   order_item_subtotal       172198 non-null  object
 5   order_item_product_price  172198 non-null  object
dtypes: object(6)
memory usage: 7.9+ MB


In [50]:
order_items['total'] = order_items['order_item_subtotal'].astype(float)

In [51]:
orders.set_index('order_id').join(order_items.set_index('order_item_order_id'),how='inner').groupby('order_status')['total'].sum()

order_status
CANCELED\n             696030.99
CLOSED\n              3736048.79
COMPLETE\n           11276933.69
ON_HOLD\n             1864731.24
PAYMENT_REVIEW\n       357841.45
PENDING\n             3851881.28
PENDING_PAYMENT\n     7581671.05
PROCESSING\n          4190636.76
SUSPECTED_FRAUD\n      766844.68
Name: total, dtype: float64