## Data Processing using Pandas
Let us understand how to process data using a 3rd party library called Pandas.

* Limitations of Collections
* Overview of Pandas
* Overview of Series
* Reading files into Data Frames
* Standard Transformations
* Projection and Filtering Data
* Aggregations
* Writing to Files
* Joining Data Frames
* Exercises

## Limitations of Collections
Let us understand some of the limitations of the collections.
* No structure defined
* The code is not readable
* Some of the APIs are scattered in multiple Python modules or plugins.
* Pandas provide us a robust set of APIs where we can refer to columns using names and perform all standard transformations


## Overview of Pandas
Let us understand the details with respect to Pandas.
* Pandas is not a core Python module and hence we need to install using pip.
* It has 2 types of data structures - series and DataFrame
* We can perform all standard transformations using Pandas APIs
* We also have SQL based wrappers on top of Pandas where we can write queries.


## Overview of Series
Let us quickly go through one of the Pandas Data Structure - Series.
* Pandas Series is a one-dimensional labeled array capable of holding any data type.
* It is similar to one column in an excel spreadsheet or a database table.
* We can create Series by using dict.

In [1]:
d = {"JAN": 10, "FEB": 15, "MAR": 12, "APR": 16}

In [2]:
type(d)

dict

In [3]:
import pandas as pd
s = pd.Series(d)

In [4]:
s

JAN    10
FEB    15
MAR    12
APR    16
dtype: int64

In [11]:
s['FEB']

15

In [12]:
type(s)

pandas.core.series.Series

In [13]:
s.sum()

53

In [5]:
l = [10, 15, 12, 16]

In [6]:
l_s = pd.Series(l)

In [7]:
l_s

0    10
1    15
2    12
3    16
dtype: int64

In [8]:
l_s[0]

10

* When we fetch only one column from a Pandas Dataframe, it will be returned as Series.

In [15]:
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"

In [16]:
orders_schema = [
  "order_id",
  "order_date",
  "order_customer_id",
  "order_status"
]

In [17]:
orders = pd.read_csv(orders_path,
  header=None,
  names=orders_schema
)

In [18]:
type(orders)

pandas.core.frame.DataFrame

In [21]:
order_dates = orders.order_date
order_dates

0        2013-07-25 00:00:00.0
1        2013-07-25 00:00:00.0
2        2013-07-25 00:00:00.0
3        2013-07-25 00:00:00.0
4        2013-07-25 00:00:00.0
                 ...          
68878    2014-07-09 00:00:00.0
68879    2014-07-13 00:00:00.0
68880    2014-07-19 00:00:00.0
68881    2014-07-22 00:00:00.0
68882    2014-07-23 00:00:00.0
Name: order_date, Length: 68883, dtype: object

In [22]:
type(order_dates)

pandas.core.series.Series

**Don’t worry too much about creating Data Frames yet, we are trying to understand how Data Frame and Series are related.**

## Reading files into Data Frames
Let us see how we can create the Pandas Data Frame.
read_csv is the most popular API to create a Data Frame by reading data from files.
* Here are some of the important options.
  * sep or delimiter
  * header or names
  * index_col
  * dtype
  * and many more
* We have several other APIs which will facilitate us to create Data Frame
  * read_fwf
  * read_table
  * pandas.io.json
  * and more
* Here is how we can create a Data Frame for orders dataset.
  * Delimiter is default
  * There is no Header and hence we have to set keyword argument header to None.
  * We can pass the column names as a list.
  * Data types of each column are typically inferred based on the data, however we can explicitly specify Data Types using dtype.


In [None]:
import pandas as pd
pd.read_csv?

In [24]:
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"

In [25]:
orders_schema = [
  "order_id",
  "order_date",
  "order_customer_id",
  "order_status"
]

In [26]:
orders = pd.read_csv(orders_path,
                     delimiter=',',
                     header=None,
                     names=orders_schema
                    )

In [27]:
orders

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE
3,4,2013-07-25 00:00:00.0,8827,CLOSED
4,5,2013-07-25 00:00:00.0,11318,COMPLETE
...,...,...,...,...
68878,68879,2014-07-09 00:00:00.0,778,COMPLETE
68879,68880,2014-07-13 00:00:00.0,1117,COMPLETE
68880,68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68881,68882,2014-07-22 00:00:00.0,10000,ON_HOLD


In [29]:
orders.head(10)

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE
3,4,2013-07-25 00:00:00.0,8827,CLOSED
4,5,2013-07-25 00:00:00.0,11318,COMPLETE
5,6,2013-07-25 00:00:00.0,7130,COMPLETE
6,7,2013-07-25 00:00:00.0,4530,COMPLETE
7,8,2013-07-25 00:00:00.0,2911,PROCESSING
8,9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
9,10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT


## Standard Transformations
Let us see some of the standard transformations that can be performed using Data Frame APIs.
* Projection
* Filtering
* Aggregations
* Joins

Let us see some examples related to standard transformation using Pandas API. But before that let us read both orders as well as order_items into Pandas Data Frame.

* Read order_items data and project order_item_order_id and order_item_subtotal. Columns can be named with these names in the same order.
  * order_item_id
  * order_item_order_id
  * order_item_product_id
  * order_item_quantity
  * order_item_subtotal
  * order_item_product_price

In [31]:
import pandas as pd
pd.read_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m[0;34m,[0m [0mIO[0m[0;34m[[0m[0;34m~[0m[0mAnyStr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m=[0m[0;34m','[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m=[0m[0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msqueeze[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0

In [32]:
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"

In [33]:
orders_schema = [
  "order_id",
  "order_date",
  "order_customer_id",
  "order_status"
]

In [34]:
orders = pd.read_csv(orders_path,
                     delimiter=',',
                     header=None,
                     names=orders_schema
                    )

In [35]:
# Reading order_items
order_items_path = "/Users/itversity/Research/data/retail_db/order_items/part-00000.csv"

In [36]:
order_items_schema = [
    "order_item_id",
    "order_item_order_id",
    "order_item_product_id",
    "order_item_quantity",
    "order_item_subtotal",
    "order_item_product_price"
]

In [37]:
order_items = pd.read_csv(
    order_items_path,
    header=None,
    names=order_items_schema
)

In [38]:
order_items

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
0,1,1,957,1,299.98,299.98
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.00,50.00
3,4,2,403,1,129.99,129.99
4,5,4,897,2,49.98,24.99
...,...,...,...,...,...,...
172193,172194,68881,403,1,129.99,129.99
172194,172195,68882,365,1,59.99,59.99
172195,172196,68882,502,1,50.00,50.00
172196,172197,68883,208,1,1999.99,1999.99


## Projection and Filtering Data

Let us understand how to project as well filter data in Data Frames.

* Projecting data

In [39]:
orders.order_date

0        2013-07-25 00:00:00.0
1        2013-07-25 00:00:00.0
2        2013-07-25 00:00:00.0
3        2013-07-25 00:00:00.0
4        2013-07-25 00:00:00.0
                 ...          
68878    2014-07-09 00:00:00.0
68879    2014-07-13 00:00:00.0
68880    2014-07-19 00:00:00.0
68881    2014-07-22 00:00:00.0
68882    2014-07-23 00:00:00.0
Name: order_date, Length: 68883, dtype: object

In [40]:
orders['order_date']

0        2013-07-25 00:00:00.0
1        2013-07-25 00:00:00.0
2        2013-07-25 00:00:00.0
3        2013-07-25 00:00:00.0
4        2013-07-25 00:00:00.0
                 ...          
68878    2014-07-09 00:00:00.0
68879    2014-07-13 00:00:00.0
68880    2014-07-19 00:00:00.0
68881    2014-07-22 00:00:00.0
68882    2014-07-23 00:00:00.0
Name: order_date, Length: 68883, dtype: object

In [41]:
# Project order_item_order_id and order_item_subtotal
order_items[['order_item_order_id', 'order_item_subtotal']]

Unnamed: 0,order_item_order_id,order_item_subtotal
0,1,299.98
1,2,199.99
2,2,250.00
3,2,129.99
4,4,49.98
...,...,...
172193,68881,129.99
172194,68882,59.99
172195,68882,50.00
172196,68883,1999.99


* Filter for order_item_order_id 2

In [42]:
order_items[order_items.order_item_order_id == 2]

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0
3,4,2,403,1,129.99,129.99


In [43]:
order_items[order_items['order_item_order_id'] == 2]

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0
3,4,2,403,1,129.99,129.99


In [44]:
order_items['order_item_order_id'] == 2

0         False
1          True
2          True
3          True
4         False
          ...  
172193    False
172194    False
172195    False
172196    False
172197    False
Name: order_item_order_id, Length: 172198, dtype: bool

In [45]:
order_items.query('order_item_order_id == 2')

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0
3,4,2,403,1,129.99,129.99


* Filter for order_item_order_id 2 and order_item_subtotal between 150 and 250

In [48]:
order_items[(order_items.order_item_order_id == 2) &
            ((order_items.order_item_subtotal >= 150) &
             (order_items.order_item_subtotal <= 250)
            )
           ]

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0


In [51]:
order_items.query('order_item_order_id == 2 and ' +
                  'order_item_subtotal >= 150 and ' +
                  'order_item_subtotal <= 250')

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0


* Filter for orders which are placed on 2013 August 1st

In [52]:
orders[orders.order_date == '2013-08-01 00:00:00.0']

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT
...,...,...,...,...
57959,57960,2013-08-01 00:00:00.0,10177,PENDING
57960,57961,2013-08-01 00:00:00.0,835,COMPLETE
57961,57962,2013-08-01 00:00:00.0,10521,PENDING_PAYMENT
67446,67447,2013-08-01 00:00:00.0,8956,COMPLETE


In [54]:
orders[orders.order_date.str.startswith('2013-08-01')]

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT
...,...,...,...,...
57959,57960,2013-08-01 00:00:00.0,10177,PENDING
57960,57961,2013-08-01 00:00:00.0,835,COMPLETE
57961,57962,2013-08-01 00:00:00.0,10521,PENDING_PAYMENT
67446,67447,2013-08-01 00:00:00.0,8956,COMPLETE


In [56]:
orders.query('order_date.str.startswith("2013-08-01")', engine='python')

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT
...,...,...,...,...
57959,57960,2013-08-01 00:00:00.0,10177,PENDING
57960,57961,2013-08-01 00:00:00.0,835,COMPLETE
57961,57962,2013-08-01 00:00:00.0,10521,PENDING_PAYMENT
67446,67447,2013-08-01 00:00:00.0,8956,COMPLETE


## Aggregations

Let us understand how to perform aggregations using Pandas. There are 2 types of aggregations.
* Global Aggregations
* By key Aggregations

### Global Aggregations

There are several global aggregations that can be performed.

* Getting number of records in the Data Frame.

In [59]:
orders.shape

(68883, 4)

In [58]:
orders.shape[0]

68883

* Getting number of non np.NaN values in each attribute in a Data Frame

In [60]:
orders.count()

order_id             68883
order_date           68883
order_customer_id    68883
order_status         68883
dtype: int64

In [61]:
type(orders.count())

pandas.core.series.Series

In [62]:
orders.count()['order_id']

68883

* Getting basic statistics of numeric fields of a Data Frame

In [63]:
orders.describe()

Unnamed: 0,order_id,order_customer_id
count,68883.0,68883.0
mean,34442.0,6216.571099
std,19884.953633,3586.205241
min,1.0,1.0
25%,17221.5,3122.0
50%,34442.0,6199.0
75%,51662.5,9326.0
max,68883.0,12435.0


* Get revenue for a order id 2 from order_items

In [67]:
order_items[order_items.order_item_order_id == 2].order_item_subtotal.sum()

579.98

### By Key Aggregations

By Key Aggregations are those which are computed per key. Here are some of the examples.

* Getting number of orders per day

In [74]:
orders.groupby(orders['order_date'])['order_id'].count()

order_date
2013-07-25 00:00:00.0    143
2013-07-26 00:00:00.0    269
2013-07-27 00:00:00.0    202
2013-07-28 00:00:00.0    187
2013-07-29 00:00:00.0    253
                        ... 
2014-07-20 00:00:00.0    285
2014-07-21 00:00:00.0    235
2014-07-22 00:00:00.0    138
2014-07-23 00:00:00.0    166
2014-07-24 00:00:00.0    185
Name: order_id, Length: 364, dtype: int64

* Getting number of orders per status

In [76]:
orders.groupby('order_status')['order_status'].count()

order_status
CANCELED            1428
CLOSED              7556
COMPLETE           22899
ON_HOLD             3798
PAYMENT_REVIEW       729
PENDING             7610
PENDING_PAYMENT    15030
PROCESSING          8275
SUSPECTED_FRAUD     1558
Name: order_status, dtype: int64

* Computing revenue per order

In [78]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    sum()

order_item_order_id
1         299.98
2         579.98
4         699.85
5        1129.86
7         579.92
          ...   
68879    1259.97
68880     999.77
68881     129.99
68882     109.99
68883    2149.99
Name: order_item_subtotal, Length: 57431, dtype: float64

In [81]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'})

Unnamed: 0_level_0,revenue,min,max,item_count
order_item_order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,299.98,299.98,299.98,1
2,579.98,129.99,250.00,3
4,699.85,49.98,299.95,4
5,1129.86,99.96,299.98,5
7,579.92,79.95,299.98,3
...,...,...,...,...
68879,1259.97,129.99,999.99,3
68880,999.77,149.94,250.00,5
68881,129.99,129.99,129.99,1
68882,109.99,50.00,59.99,2


In [82]:
order_items.rename(columns={'order_item_order_id': 'order_id'})

Unnamed: 0,order_item_id,order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
0,1,1,957,1,299.98,299.98
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.00,50.00
3,4,2,403,1,129.99,129.99
4,5,4,897,2,49.98,24.99
...,...,...,...,...,...,...
172193,172194,68881,403,1,129.99,129.99
172194,172195,68882,365,1,59.99,59.99
172195,172196,68882,502,1,50.00,50.00
172196,172197,68883,208,1,1999.99,1999.99


## Writing to files

Pandas also provides simple APIs to write the data back to files.

* Let us write the revenue per order along with order_id to a file.

In [None]:
order_items.to_csv?

In [84]:
base_dir = "/Users/itversity/Research/data/retail_db"

In [85]:
output_dir = base_dir + "/revenue_per_order"

In [92]:
import subprocess

subprocess.call(['mkdir', output_dir])

0

In [94]:
import subprocess
#ls -ltr /Users/itversity/Research/data/retail_db/revenue_per_order
subprocess.check_output(['ls', '-ltr', output_dir])

b''

In [96]:
%%sh
ls -ltr /Users/itversity/Research/data/retail_db

total 0
drwxr-xr-x  3 itversity  staff   96 Oct  7  2019 categories
drwxr-xr-x  3 itversity  staff   96 Oct  7  2019 customers
drwxr-xr-x  3 itversity  staff   96 Oct  7  2019 departments
drwxr-xr-x  3 itversity  staff   96 Oct  7  2019 products
drwxr-xr-x  3 itversity  staff   96 Apr 18 09:30 order_items
drwxr-xr-x  6 itversity  staff  192 Apr 22 06:40 orders_filtered
drwxr-xr-x  4 itversity  staff  128 Apr 25 11:10 orders
drwxr-xr-x  8 itversity  staff  256 Apr 25 11:20 order_count_by_date
drwxr-xr-x  2 itversity  staff   64 May  9 09:23 revenue_per_order


In [100]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
    to_json(output_dir + '/revenue_per_order.json', orient='table')

In [101]:
%%sh
ls -ltr /Users/itversity/Research/data/retail_db/revenue_per_order

total 9768
-rw-r--r--  1 itversity  staff  4999380 May  9 09:27 revenue_per_order.json


In [106]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
    round(2). \
    to_csv(output_dir + '/revenue_per_order.csv')

## Joining Data Frames

Let us understand how to join Data Frames using Pandas.

* Join orders and order_items using orders.order_id and order_items.order_item_order_id.

In [126]:
orders.set_index('order_id', 'order_date')

Unnamed: 0_level_0,order_date,order_customer_id,order_status
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
...,...,...,...
68879,2014-07-09 00:00:00.0,778,COMPLETE
68880,2014-07-13 00:00:00.0,1117,COMPLETE
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68882,2014-07-22 00:00:00.0,10000,ON_HOLD


In [110]:
order_items.set_index('order_item_order_id')

Unnamed: 0_level_0,order_item_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
order_item_order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,957,1,299.98,299.98
2,2,1073,1,199.99,199.99
2,3,502,5,250.00,50.00
2,4,403,1,129.99,129.99
4,5,897,2,49.98,24.99
...,...,...,...,...,...
68881,172194,403,1,129.99,129.99
68882,172195,365,1,59.99,59.99
68882,172196,502,1,50.00,50.00
68883,172197,208,1,1999.99,1999.99


In [112]:
orders.join?

[0;31mSignature:[0m [0morders[0m[0;34m.[0m[0mjoin[0m[0;34m([0m[0mother[0m[0;34m,[0m [0mon[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mhow[0m[0;34m=[0m[0;34m'left'[0m[0;34m,[0m [0mlsuffix[0m[0;34m=[0m[0;34m''[0m[0;34m,[0m [0mrsuffix[0m[0;34m=[0m[0;34m''[0m[0;34m,[0m [0msort[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Join columns of another DataFrame.

Join columns with `other` DataFrame either on index or on a key
column. Efficiently join multiple DataFrame objects by index at once by
passing a list.

Parameters
----------
other : DataFrame, Series, or list of DataFrame
    Index should be similar to one of the columns in this one. If a
    Series is passed, its name attribute must be set, and that will be
    used as the column name in the resulting joined DataFrame.
on : str, list of str, or array-like, optional
    Column or index level name(s) in the caller to join on the index
    in `other`,

In [111]:
# Join orders and order_items using order_id (order_item_order_id from order_items)
orders.set_index('order_id'). \
    join(order_items.set_index('order_item_order_id'))

Unnamed: 0,order_date,order_customer_id,order_status,order_item_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2013-07-25 00:00:00.0,11599,CLOSED,1.0,957.0,1.0,299.98,299.98
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT,2.0,1073.0,1.0,199.99,199.99
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT,3.0,502.0,5.0,250.00,50.00
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT,4.0,403.0,1.0,129.99,129.99
3,2013-07-25 00:00:00.0,12111,COMPLETE,,,,,
...,...,...,...,...,...,...,...,...
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT,172194.0,403.0,1.0,129.99,129.99
68882,2014-07-22 00:00:00.0,10000,ON_HOLD,172195.0,365.0,1.0,59.99,59.99
68882,2014-07-22 00:00:00.0,10000,ON_HOLD,172196.0,502.0,1.0,50.00,50.00
68883,2014-07-23 00:00:00.0,5533,COMPLETE,172197.0,208.0,1.0,1999.99,1999.99


In [113]:
orders.set_index('order_id'). \
    join(order_items.set_index('order_item_order_id'), how='inner')

Unnamed: 0,order_date,order_customer_id,order_status,order_item_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2013-07-25 00:00:00.0,11599,CLOSED,1,957,1,299.98,299.98
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT,2,1073,1,199.99,199.99
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT,3,502,5,250.00,50.00
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT,4,403,1,129.99,129.99
4,2013-07-25 00:00:00.0,8827,CLOSED,5,897,2,49.98,24.99
...,...,...,...,...,...,...,...,...
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT,172194,403,1,129.99,129.99
68882,2014-07-22 00:00:00.0,10000,ON_HOLD,172195,365,1,59.99,59.99
68882,2014-07-22 00:00:00.0,10000,ON_HOLD,172196,502,1,50.00,50.00
68883,2014-07-23 00:00:00.0,5533,COMPLETE,172197,208,1,1999.99,1999.99


* Compute Daily Revenue using orders.order_date and order_items.order_item_order_subtotal considering only COMPLETE and CLOSED orders.

In [None]:
# Compute Daily Revenue using
# orders.order_date and order_items.order_item_order_subtotal
# considering only COMPLETE and CLOSED orders.

import pandas as pd

In [120]:
# Reading orders
orders_path = "/Users/itversity/Research/data/retail_db/orders/part-00000.csv"
orders_schema = [
    "order_id",
    "order_date",
    "order_customer_id",
    "order_status"
]

orders = pd.read_csv(
    orders_path,
    header=None,
    names=orders_schema
)

In [121]:
# Reading order_items
order_items_path = "/Users/itversity/Research/data/retail_db/order_items/part-00000.csv"
order_items_schema = [
    "order_item_id",
    "order_item_order_id",
    "order_item_product_id",
    "order_item_quantity",
    "order_item_subtotal",
    "order_item_product_price"
]

order_items = pd.read_csv(
    order_items_path,
    header=None,
    names=order_items_schema
)

In [122]:
orders_considered = orders.query("order_status in ('COMPLETE', 'CLOSED')")

In [123]:
orders_filtered = orders[orders.order_status.isin(["COMPLETE", "CLOSED"])]

In [124]:
orders_considered. \
    set_index('order_id'). \
    join(order_items.set_index('order_item_order_id'), how='inner'). \
    groupby('order_date')['order_item_subtotal']. \
    agg(['sum']). \
    rename(columns={'sum': 'revenue'})

Unnamed: 0_level_0,revenue
order_date,Unnamed: 1_level_1
2013-07-25 00:00:00.0,31547.23
2013-07-26 00:00:00.0,54713.23
2013-07-27 00:00:00.0,48411.48
2013-07-28 00:00:00.0,35672.03
2013-07-29 00:00:00.0,54579.70
...,...
2014-07-20 00:00:00.0,60047.45
2014-07-21 00:00:00.0,51427.70
2014-07-22 00:00:00.0,36717.24
2014-07-23 00:00:00.0,38795.23


## Exercises
Here are some of the Exercises related to Pandas.

* Get all the orders which belong to the month of 2013 August
* Get all the orders which belong to the months of August, September and October in 2013.
* Get count of orders by status for the month of 2014 January
* Get all the records from orders where there are no corresponding records in order_items
* Get all the customers who have not placed any orders
* Get the revenue by status