## Data Processing using Pandas
Let us understand how to process data using a 3rd party library called Pandas.


## Limitations of Collections
Let us understand some of the limitations of the collections.
* No structure defined
* The code is not readable
* Some of the APIs are scattered in multiple Python modules or plugins.
* Pandas provide us a robust set of APIs where we can refer to columns using names and perform all standard transformations


## Overview of Pandas
Let us understand the details with respect to Pandas.
* Pandas is not a core Python module and hence we need to install using pip.
* It has 2 types of data structures - series and DataFrame
* We can perform all standard transformations using Pandas APIs
* We also have SQL based wrappers on top of Pandas where we can write queries.


## Overview of Series
Let us quickly go through one of the Pandas Data Structure - Series.
* Pandas Series is a one-dimensional labeled array capable of holding any data type.
* It is similar to one column in an excel spreadsheet or a database table.
* We can create Series by using dict.

In [1]:
d = {"JAN": 10, "FEB": 15, "MAR": 12, "APR": 16}

In [2]:
import pandas as pd
s = pd.Series(d)

In [3]:
s

JAN    10
FEB    15
MAR    12
APR    16
dtype: int64

In [4]:
type(s)

pandas.core.series.Series

In [5]:
l = [10, 15, 12, 16]
pd.Series(l)

0    10
1    15
2    12
3    16
dtype: int64

In [6]:
s.count()

4

In [7]:
s.sum()

53

In [8]:
s.min()

10

In [9]:
s.max()

16

* When we fetch only one column from a Pandas Dataframe, it will be returned as Series.

In [10]:
orders_path = "/data/retail_db/orders/part-00000"

In [11]:
orders_schema = [
  "order_id",
  "order_date",
  "order_customer_id",
  "order_status"
]

In [12]:
pd.read_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m[0;34m,[0m [0mIO[0m[0;34m[[0m[0;34m~[0m[0mAnyStr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m=[0m[0;34m','[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m=[0m[0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msqueeze[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m

In [13]:
orders = pd.read_csv(orders_path,
  header=None,
  names=orders_schema
)

In [14]:
type(orders)

pandas.core.frame.DataFrame

In [15]:
order_dates = orders.order_date

In [16]:
type(order_dates)

pandas.core.series.Series

In [17]:
# Preview Series
order_dates

0        2013-07-25 00:00:00.0
1        2013-07-25 00:00:00.0
2        2013-07-25 00:00:00.0
3        2013-07-25 00:00:00.0
4        2013-07-25 00:00:00.0
                 ...          
68878    2014-07-09 00:00:00.0
68879    2014-07-13 00:00:00.0
68880    2014-07-19 00:00:00.0
68881    2014-07-22 00:00:00.0
68882    2014-07-23 00:00:00.0
Name: order_date, Length: 68883, dtype: object

**Don’t worry too much about creating Data Frames yet, we are trying to understand how Data Frame and Series are related.**

## Reading files into Data Frames
Let us see how we can create the Pandas Data Frame.
read_csv is the most popular API to create a Data Frame by reading data from files.
* Here are some of the important options.
  * sep or delimiter
  * header or names
  * index_col
  * dtype
  * and many more
* We have several other APIs which will facilitate us to create Data Frame
  * read_fwf
  * read_table
  * pandas.io.json
  * and more
* Here is how we can create a Data Frame for orders dataset.
  * Delimiter is default
  * There is no Header and hence we have to set keyword argument header to None.
  * We can pass the column names as a list.
  * Data types of each column are typically inferred based on the data, however we can explicitly specify Data Types using dtype.


In [18]:
orders_path = "/data/retail_db/orders/part-00000"

In [19]:
orders_schema = [
  "order_id",
  "order_date",
  "order_customer_id",
  "order_status"
]

In [20]:
orders = pd.read_csv(orders_path,
  header=None,
  names=orders_schema
)

In [21]:
type(orders)

pandas.core.frame.DataFrame

In [22]:
# Preview Data Frame
orders

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
2,3,2013-07-25 00:00:00.0,12111,COMPLETE
3,4,2013-07-25 00:00:00.0,8827,CLOSED
4,5,2013-07-25 00:00:00.0,11318,COMPLETE
...,...,...,...,...
68878,68879,2014-07-09 00:00:00.0,778,COMPLETE
68879,68880,2014-07-13 00:00:00.0,1117,COMPLETE
68880,68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68881,68882,2014-07-22 00:00:00.0,10000,ON_HOLD


## Standard Transformations
Let us see some of the standard transformations that can be performed using Data Frame APIs.
* Projection
* Filtering
* Aggregations
* Joins

Let us see some examples related to standard transformation using Pandas API. But before that let us read both orders as well as order_items into Pandas Data Frame.

* Read order_items data and project order_item_order_id and order_item_subtotal. Columns can be named with these names in the same order.
  * order_item_id
  * order_item_order_id
  * order_item_product_id
  * order_item_quantity
  * order_item_subtotal
  * order_item_product_price

In [23]:
import pandas as pd

In [24]:
# Reading orders
orders_path = "/data/retail_db/orders/part-00000"

In [25]:
orders_schema = [
    "order_id",
    "order_date",
    "order_customer_id",
    "order_status"
]

In [26]:
orders = pd.read_csv(
    orders_path,
    header=None,
    names=orders_schema
)

In [27]:
# Reading order_items
order_items_path = "/data/retail_db/order_items/part-00000"

In [28]:
order_items_schema = [
    "order_item_id",
    "order_item_order_id",
    "order_item_product_id",
    "order_item_quantity",
    "order_item_subtotal",
    "order_item_product_price"
]

In [29]:
order_items = pd.read_csv(
    order_items_path,
    header=None,
    names=order_items_schema
)

## Projection and Filtering Data

Let us understand how to project as well filter data in Data Frames.

* Projecting data

In [30]:
orders.order_date

0        2013-07-25 00:00:00.0
1        2013-07-25 00:00:00.0
2        2013-07-25 00:00:00.0
3        2013-07-25 00:00:00.0
4        2013-07-25 00:00:00.0
                 ...          
68878    2014-07-09 00:00:00.0
68879    2014-07-13 00:00:00.0
68880    2014-07-19 00:00:00.0
68881    2014-07-22 00:00:00.0
68882    2014-07-23 00:00:00.0
Name: order_date, Length: 68883, dtype: object

In [31]:
orders['order_date']

0        2013-07-25 00:00:00.0
1        2013-07-25 00:00:00.0
2        2013-07-25 00:00:00.0
3        2013-07-25 00:00:00.0
4        2013-07-25 00:00:00.0
                 ...          
68878    2014-07-09 00:00:00.0
68879    2014-07-13 00:00:00.0
68880    2014-07-19 00:00:00.0
68881    2014-07-22 00:00:00.0
68882    2014-07-23 00:00:00.0
Name: order_date, Length: 68883, dtype: object

In [32]:
# Project order_item_order_id and order_item_subtotal
order_items[["order_item_order_id", "order_item_subtotal"]]

Unnamed: 0,order_item_order_id,order_item_subtotal
0,1,299.98
1,2,199.99
2,2,250.00
3,2,129.99
4,4,49.98
...,...,...
172193,68881,129.99
172194,68882,59.99
172195,68882,50.00
172196,68883,1999.99


* Filter for order_item_order_id 2

In [33]:
order_items[order_items.order_item_order_id == 2]

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0
3,4,2,403,1,129.99,129.99


In [34]:
order_items[order_items["order_item_order_id"] == 2]

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0
3,4,2,403,1,129.99,129.99


In [35]:
order_items.query('order_item_order_id == 2')

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0
3,4,2,403,1,129.99,129.99


* Filter for order_item_order_id 2 and order_item_subtotal between 125 and 250

In [37]:
order_items[(order_items.order_item_order_id == 2) & 
            ((order_items.order_item_subtotal >= 150) & (order_items.order_item_subtotal <= 250))]

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0


In [38]:
order_items.query('order_item_order_id == 2 and ' +
                  'order_item_subtotal >= 150 and ' +
                  'order_item_subtotal <= 250'
                 )

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.0,50.0


* Filter for orders which are placed on 2013 August 1st

In [39]:
orders[orders.order_date.str.startswith('2013-08-01')]

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT
...,...,...,...,...
57959,57960,2013-08-01 00:00:00.0,10177,PENDING
57960,57961,2013-08-01 00:00:00.0,835,COMPLETE
57961,57962,2013-08-01 00:00:00.0,10521,PENDING_PAYMENT
67446,67447,2013-08-01 00:00:00.0,8956,COMPLETE


In [40]:
orders.query('order_date.str.startswith("2013-08-01")', engine='python')

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT
...,...,...,...,...
57959,57960,2013-08-01 00:00:00.0,10177,PENDING
57960,57961,2013-08-01 00:00:00.0,835,COMPLETE
57961,57962,2013-08-01 00:00:00.0,10521,PENDING_PAYMENT
67446,67447,2013-08-01 00:00:00.0,8956,COMPLETE


## Aggregations

Let us understand how to perform aggregations using Pandas. There are 2 types of aggregations.
* Global Aggregations
* By key Aggregations

### Global Aggregations

There are several global aggregations that can be performed.

* Getting number of records in the Data Frame.

In [43]:
orders.shape

(68883, 4)

* Getting number of non np.NaN values in each attribute in a Data Frame

In [44]:
orders.count()

order_id             68883
order_date           68883
order_customer_id    68883
order_status         68883
dtype: int64

* Getting basic statistics of numeric fields of a Data Frame

In [45]:
orders.describe()

Unnamed: 0,order_id,order_customer_id
count,68883.0,68883.0
mean,34442.0,6216.571099
std,19884.953633,3586.205241
min,1.0,1.0
25%,17221.5,3122.0
50%,34442.0,6199.0
75%,51662.5,9326.0
max,68883.0,12435.0


* Get revenue for a order id 2 from order_items

In [46]:
order_items[order_items.order_item_order_id == 2].order_item_subtotal.sum()

579.98

### By Key Aggregations

By Key Aggregations are those which are computed per key. Here are some of the examples.

* Getting number of orders per day
* Getting number of orders per status
* Computing revenue per order

In [47]:
## Getting number of orders per day
orders.groupby(orders.order_date).count()
## This gives count of each and every field by default

Unnamed: 0_level_0,order_id,order_customer_id,order_status
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-07-25 00:00:00.0,143,143,143
2013-07-26 00:00:00.0,269,269,269
2013-07-27 00:00:00.0,202,202,202
2013-07-28 00:00:00.0,187,187,187
2013-07-29 00:00:00.0,253,253,253
...,...,...,...
2014-07-20 00:00:00.0,285,285,285
2014-07-21 00:00:00.0,235,235,235
2014-07-22 00:00:00.0,138,138,138
2014-07-23 00:00:00.0,166,166,166


In [48]:
orders.groupby(orders.order_date)['order_status'].count()

order_date
2013-07-25 00:00:00.0    143
2013-07-26 00:00:00.0    269
2013-07-27 00:00:00.0    202
2013-07-28 00:00:00.0    187
2013-07-29 00:00:00.0    253
                        ... 
2014-07-20 00:00:00.0    285
2014-07-21 00:00:00.0    235
2014-07-22 00:00:00.0    138
2014-07-23 00:00:00.0    166
2014-07-24 00:00:00.0    185
Name: order_status, Length: 364, dtype: int64

In [50]:
## Getting number of orders per status
orders.groupby(orders.order_status)['order_status'].count()

order_status
CANCELED            1428
CLOSED              7556
COMPLETE           22899
ON_HOLD             3798
PAYMENT_REVIEW       729
PENDING             7610
PENDING_PAYMENT    15030
PROCESSING          8275
SUSPECTED_FRAUD     1558
Name: order_status, dtype: int64

In [52]:
orders. \
    groupby(orders.order_status)['order_status']. \
    agg(['count', 'min', 'max']). \
    rename(columns={'count': 'order_count'})

Unnamed: 0_level_0,order_count,min,max
order_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CANCELED,1428,CANCELED,CANCELED
CLOSED,7556,CLOSED,CLOSED
COMPLETE,22899,COMPLETE,COMPLETE
ON_HOLD,3798,ON_HOLD,ON_HOLD
PAYMENT_REVIEW,729,PAYMENT_REVIEW,PAYMENT_REVIEW
PENDING,7610,PENDING,PENDING
PENDING_PAYMENT,15030,PENDING_PAYMENT,PENDING_PAYMENT
PROCESSING,8275,PROCESSING,PROCESSING
SUSPECTED_FRAUD,1558,SUSPECTED_FRAUD,SUSPECTED_FRAUD


In [53]:
## Computing revenue per order
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum']). \
    rename(columns={'sum': 'revenue'})

Unnamed: 0_level_0,revenue
order_item_order_id,Unnamed: 1_level_1
1,299.98
2,579.98
4,699.85
5,1129.86
7,579.92
...,...
68879,1259.97
68880,999.77
68881,129.99
68882,109.99


## Writing to files

Pandas also provides simple APIs to write the data back to files.

* Let us write the revenue per order along with order_id to a file.

In [56]:
order_items.to_csv?

[0;31mSignature:[0m
[0morder_items[0m[0;34m.[0m[0mto_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpath_or_buf[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m[0;34m,[0m [0mIO[0m[0;34m[[0m[0;34m~[0m[0mAnyStr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m:[0m[0mstr[0m[0;34m=[0m[0;34m','[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mna_rep[0m[0;34m:[0m[0mstr[0m[0;34m=[0m[0;34m''[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfloat_format[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mSequence[0m[0;34m[[0m[0mcollections[0m[0;34m.[0m[0mabc[0m[0;34m.[0m[0mHashable[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m[0;34m=[0m[0;3

In [63]:
import getpass

username = getpass.getuser()

username

'training'

In [66]:
import os
os.system(f'mkdir -p /home/{username}/retail_db')
os.system(f'ls -ltr /home/{username}/retail_db')

0

In [71]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum']). \
    rename(columns={'sum': 'revenue'}). \
    round(2). \
    to_csv(f'/home/{username}/retail_db/order_revenue.csv')

In [76]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum']). \
    rename(columns={'sum': 'revenue'}). \
    round(2). \
    to_json(f'/home/{username}/retail_db/order_revenue.json', orient='table')

In [74]:
order_items.to_json?

[0;31mSignature:[0m
[0morder_items[0m[0;34m.[0m[0mto_json[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpath_or_buf[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m[0;34m,[0m [0mIO[0m[0;34m[[0m[0;34m~[0m[0mAnyStr[0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0morient[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdate_format[0m[0;34m:[0m[0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdouble_precision[0m[0;34m:[0m[0mint[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mforce_ascii[0m[0;34m:[0m[0mbool[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdate_unit[0m[0;34m:[0m[

In [77]:
import platform
print(platform.python_version())

3.6.8
