## Exercises - Pandas Data Frames

Here are some of the Exercises related to Pandas.
* Create Pandas Data Frames using Schema
* Get all the orders which belong to the month of 2013 August
* Get all the orders which belong to the months of August, September and October in 2013.
* Get count of orders by status for the month of 2014 January
* Get all the records from orders where there are no corresponding records in order_items
* Get all the customers who have not placed any orders
* Get the revenue by status

### Exercise 1 - Create Pandas Data Frames using Schema

Create Pandas Data Frame for orders, order_items and customers. Make sure to use **schema/retail_db/retail.json** to get the column names.

In [1]:
import os
import json
import csv
import pandas as pd

def get_df(base_folder, data_set_name, schema_file):
    file_names = os.listdir(f'{base_folder}/{data_set_name}')
    retail_schemas = json.load(open(schema_file))
    columns = list(map(lambda col: col['column_name'], retail_schemas[data_set_name]))
    data = []
    for file_name in file_names:
        file_path = f'{base_folder}/{data_set_name}/{file_name}'
        raw_data = open(file_path)
        data += list(raw_data)
    return pd.DataFrame(map(lambda rec: rec.split(','), data), columns=columns)

In [2]:
orders = get_df('/data/retail_db', 'orders', 'schemas/retail_db/retail.json')

In [3]:
customers = get_df('/data/retail_db', 'customers', 'schemas/retail_db/retail.json')

In [4]:
order_items = get_df('/data/retail_db', 'order_items', 'schemas/retail_db/retail.json')
# order_items[:3]

In [5]:
# order_items.query('order_item_order_id == 2')

In [6]:
# order_items[order_items['order_item_order_id'] == 2]

In [7]:
order_items['order_item_order_id']

0             1
1             2
2             2
3             2
4             4
          ...  
172193    68881
172194    68882
172195    68882
172196    68883
172197    68883
Name: order_item_order_id, Length: 172198, dtype: object

In [8]:
order_items.order_item_order_id [:4]

0    1
1    2
2    2
3    2
Name: order_item_order_id, dtype: object

### Get all the orders which belong to the month of 2013 August

In [9]:
orders.query('order_date.str.startswith("2013-08")', engine='python')

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE\n
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED\n
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE\n
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT\n
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT\n
...,...,...,...,...
68705,68706,2013-08-20 00:00:00.0,130,COMPLETE\n
68706,68707,2013-08-23 00:00:00.0,11730,COMPLETE\n
68707,68708,2013-08-26 00:00:00.0,8852,ON_HOLD\n
68708,68709,2013-08-30 00:00:00.0,4756,COMPLETE\n


[09:44] megha mishra
orders[(orders.order_date.str.startswith('2013-08')) | (orders.order_date.str.startswith('2013-09')) | (orders.order_date.str.startswith('2013-10'))]

### Get all the orders which belong to the months of August, September and October in 2013.

In [10]:
orders[(orders.order_date.str.startswith('2013-08')) | 
       (orders.order_date.str.startswith('2013-09')) | 
       (orders.order_date.str.startswith('2013-10'))]

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE\n
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED\n
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE\n
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT\n
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT\n
...,...,...,...,...
68737,68738,2013-10-27 00:00:00.0,1100,COMPLETE\n
68738,68739,2013-10-28 00:00:00.0,2528,PENDING\n
68739,68740,2013-10-29 00:00:00.0,10691,ON_HOLD\n
68740,68741,2013-10-30 00:00:00.0,5974,PENDING_PAYMENT\n


### Get count of orders by status for the month of 2014 January

In [11]:
orders[orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64') == 201401]['order_status'].count()

5908

### Get all the records from orders where there are no corresponding records in order_items

In [12]:
orders_joined = orders.set_index('order_id'). \
join(order_items.set_index('order_item_order_id'))
orders_joined[orders_joined['order_item_id'].isna()]

Unnamed: 0,order_date,order_customer_id,order_status,order_item_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
10015,2013-09-25 00:00:00.0,3112,COMPLETE\n,,,,,
10016,2013-09-25 00:00:00.0,1214,PROCESSING\n,,,,,
10022,2013-09-25 00:00:00.0,2697,PENDING_PAYMENT\n,,,,,
10031,2013-09-25 00:00:00.0,9968,COMPLETE\n,,,,,
10035,2013-09-25 00:00:00.0,2570,PENDING_PAYMENT\n,,,,,
...,...,...,...,...,...,...,...,...
9978,2013-09-25 00:00:00.0,3100,CANCELED\n,,,,,
9980,2013-09-25 00:00:00.0,8261,PENDING_PAYMENT\n,,,,,
9982,2013-09-25 00:00:00.0,4860,COMPLETE\n,,,,,
9994,2013-09-25 00:00:00.0,9585,PENDING_PAYMENT\n,,,,,


### Get all the customers who have not placed any orders

In [13]:
customer_order=customers.set_index('customer_id'). \
join(orders.set_index('order_customer_id'))
customer_order[customer_order['order_id'].isna()].reset_index().rename(columns={'index':'customer_id'})

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode,order_id,order_date,order_status
0,10060,Mary,Shaw,XXXXXXXXX,XXXXXXXXX,4645 Fallen Timber By-pass,Caguas,PR,00725\n,,,
1,10330,Mary,Smith,XXXXXXXXX,XXXXXXXXX,3410 Lazy Shadow Pathway,Hamilton,OH,45013\n,,,
2,10439,Emma,Smith,XXXXXXXXX,XXXXXXXXX,1465 Clear Elk Diversion,Caguas,PR,00725\n,,,
3,10913,Mary,Williams,XXXXXXXXX,XXXXXXXXX,9113 Grand Hills Parade,San Jose,CA,95123\n,,,
4,10958,Joan,Smith,XXXXXXXXX,XXXXXXXXX,8771 Middle Quail Heath,Los Angeles,CA,90024\n,,,
5,1187,Dorothy,Vazquez,XXXXXXXXX,XXXXXXXXX,363 Green Goose Run,Danbury,CT,06810\n,,,
6,12175,Amanda,Smith,XXXXXXXXX,XXXXXXXXX,3729 Cinder Grove Concession,Tonawanda,NY,14150\n,,,
7,12190,Mary,Smith,XXXXXXXXX,XXXXXXXXX,4462 Little Lagoon Route,Tempe,AZ,85283\n,,,
8,12392,Alan,Wolf,XXXXXXXXX,XXXXXXXXX,6470 Fallen Barn Autoroute,Santa Ana,CA,92704\n,,,
9,1481,Grace,Smith,XXXXXXXXX,XXXXXXXXX,2171 Clear Lake Isle,Caguas,PR,00725\n,,,


In [14]:
#order_items[order_items['order_item_order_id']==2]
# order_items.query('order_item_order_id == 2')

### Get the revenue by status

In [15]:
df_joined = orders.set_index('order_id').join(order_items.set_index('order_item_order_id'),how = 'inner')
order_item_subtotal_float = df_joined['order_item_subtotal'].astype(float)
df_joined['order_item_subtotal'] = order_item_subtotal_float
group_by_status = round(df_joined.groupby('order_status')['order_item_subtotal'].agg([sum]),2).rename(columns = {'sum':'Revenue'}).reset_index()
group_by_status

Unnamed: 0,order_status,Revenue
0,CANCELED\n,696030.99
1,CLOSED\n,3736048.79
2,COMPLETE\n,11276933.69
3,ON_HOLD\n,1864731.24
4,PAYMENT_REVIEW\n,357841.45
5,PENDING\n,3851881.28
6,PENDING_PAYMENT\n,7581671.05
7,PROCESSING\n,4190636.76
8,SUSPECTED_FRAUD\n,766844.68
