# Data exploration and cleaning

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data loading and exploration

In [28]:
orders = pd.read_csv('../00.Data/orders_cripted.csv')

In [29]:
orders.head()

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,Total,Discount Code,Discount Amount,Shipping Method,Created at,Lineitem quantity,Lineitem name,Lineitem price,Lineitem compare at price,Lineitem sku,Lineitem requires shipping,Lineitem taxable,Lineitem fulfillment status,Billing City,Billing Zip,Billing Country,Notes,Note Attributes,Cancelled at,Payment Method,Payment Reference,Refunded Amount,Vendor,Id,Tags,Risk Level,Source,Lineitem discount,Tax 1 Name,Tax 1 Value,Tax 2 Name,Tax 2 Value,Tax 3 Name,Tax 3 Value,Tax 4 Name,Tax 4 Value,Tax 5 Name,Tax 5 Value,Receipt Number
0,#1244,paid,2019-04-15 09:51:49 +0200,unfulfilled,,yes,EUR,45.0,4.9,0.0,49.9,,0.0,Envío estándar,2019-04-15 09:51:49 +0200,1,Cesta de temporada sin pescado (1/2 pensión) - 2,45.0,,,True,False,pending,Barcelona,'08032,ES,,Fecha de entrega: 16/04/2019\nDía de la semana...,,Stripe,c2959210184748.1,0.0,Mercat a Casa,891010000000.0,,Low,web,0,,,,,,,,,,,
1,#1243,paid,2019-04-11 23:35:23 +0200,fulfilled,2019-04-11 23:43:20 +0200,yes,EUR,40.1,4.9,0.0,45.0,,0.0,Envío estándar,2019-04-11 23:35:23 +0200,2,Hamburguesa de cebolla queso y huevo (2 uds.),3.8,,361.0,True,False,fulfilled,Barcelona,'08021,ES,,Fecha de entrega: 12/04/2019\nDía de la semana...,,Stripe,c2946921922604.1,0.0,Carns Ruano,886963000000.0,,Low,web,0,,,,,,,,,,,
2,#1243,,,,,,,,,,,,,,2019-04-11 23:35:23 +0200,1,TEM - QA - 17 - 2,32.5,,,True,False,fulfilled,,,,,,,,,,Mercat a Casa,,,,,0,,,,,,,,,,,
3,#1242,refunded,2019-04-11 23:21:35 +0200,unfulfilled,,yes,EUR,40.1,4.9,0.0,45.0,,0.0,Envío estándar,2019-04-11 23:21:34 +0200,2,Hamburguesa de cebolla queso y huevo (2 uds.),3.8,,361.0,True,False,pending,Barcelona,'08021,ES,,Fecha de entrega: 12/04/2019\nDía de la semana...,2019-04-11 23:32:02 +0200,Stripe,c2946859925548.1,45.0,Carns Ruano,886948000000.0,,Low,web,0,,,,,,,,,,,
4,#1242,,,,,,,,,,,,,,2019-04-11 23:21:34 +0200,1,TEM - QA - 17 - 2,32.5,,,True,False,pending,,,,,,,,,,Mercat a Casa,,,,,0,,,,,,,,,,,


It seems that we have a lot of nan values and I'm not sure what the Tax N Value and Name are and seem to be empty. Let's check all Nan values compared to the total amount of data we have.

In [30]:
orders.shape

(2415, 49)

In [31]:
orders.isna().sum()

Name                              0
Financial Status               2178
Paid at                        2258
Fulfillment Status             2178
Fulfilled at                   2208
Accepts Marketing              2178
Currency                       2178
Subtotal                       2178
Shipping                       2178
Taxes                          2178
Total                          2178
Discount Code                  2378
Discount Amount                2178
Shipping Method                2188
Created at                        0
Lineitem quantity                 0
Lineitem name                     0
Lineitem price                    0
Lineitem compare at price      2392
Lineitem sku                     79
Lineitem requires shipping        0
Lineitem taxable                  0
Lineitem fulfillment status       0
Billing City                   2186
Billing Zip                    2186
Billing Country                2185
Notes                          2362
Note Attributes             

Ok, so we have plenty of NaNs. We'll drop all 'Tax' related values except the #1 as there is no data in them. Also 'Receipt Number' has no values so we will also drop it.

In [32]:
orders.drop(columns=['Tax 2 Name', 'Tax 2 Value', 'Tax 3 Name', 'Tax 3 Value', 'Tax 4 Name', 'Tax 4 Value', 'Tax 5 Name', 'Tax 5 Value', 'Receipt Number'], inplace=True)

Each row is a register for an order but there's no unique Id for each row. The 'Id' we have is mostly empty. 
We have 2 possible scenarios:

1) 'Id' is the order id and we only have 2415 - 2178 = 237 orders

2) All the rows are actually orders but there is no id.

The most probable option is 1 but to discard option 2, we will group by 'Created at' as it is a not-null variable and if two rows are created at exactly the same time, we can assume that correspond to the same order.

In [33]:
orders.groupby('Created at').count().shape

(237, 39)

We have 237 different Timestamps, meaning that we actually have 237 orders, the empty data refers to specific information of the order. Actually, the 'Name' appears to be the ID of the orders.

In [34]:
orders.groupby('Name').count().shape

(237, 39)

In [35]:
orders['Name'].value_counts().head()

#1164    34
#1171    33
#1051    32
#1153    30
#1009    30
Name: Name, dtype: int64

In [36]:
orders['Name'].value_counts().tail()

#1203    1
#1195    1
#1019    1
#1220    1
#1211    1
Name: Name, dtype: int64

So we have some orders with 34 rows of information and some with just 1. Let's check the order #1164 to better understand what type of information we might find in the additional rows.

In [37]:
orders[orders['Name']=='#1164']

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,Total,Discount Code,Discount Amount,Shipping Method,Created at,Lineitem quantity,Lineitem name,Lineitem price,Lineitem compare at price,Lineitem sku,Lineitem requires shipping,Lineitem taxable,Lineitem fulfillment status,Billing City,Billing Zip,Billing Country,Notes,Note Attributes,Cancelled at,Payment Method,Payment Reference,Refunded Amount,Vendor,Id,Tags,Risk Level,Source,Lineitem discount,Tax 1 Name,Tax 1 Value
266,#1164,paid,2019-01-15 22:58:00 +0100,fulfilled,2019-01-15 23:46:53 +0100,yes,EUR,105.1,0.0,0.0,105.1,,0.0,Envío gratuito,2019-01-15 22:57:57 +0100,1,Jamón serrano cebo - 150 grs.,8.25,,398.0,True,False,fulfilled,Barcelona,'08006,ES,,"streamthing_delivery_date: January 16th, 2019\...",,Stripe,c2584919212076.1,0.0,Xarcuteria Alonso Andrés,785876000000.0,,Low,web,0,,
267,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Jamón dulce - 150 grs.,2.7,,395.0,True,False,fulfilled,,,,,,,,,,Xarcuteria Alonso Andrés,,,,,0,,
268,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Tomate Monterosa - 500 grs.,1.95,,351.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,
269,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Tomate de untar (300 grs),1.8,,354.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,
270,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Puerro - 1 Kg. (2/3 uds.),2.8,,333.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,
271,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Pimientos Rojo - 500 grs. (1/2 uds.),1.45,,158.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,
272,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Patata Kennebec - 1 Kg.,1.4,,155.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,
273,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Judía Perona fina - 500 grs.,3.9,,149.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,
274,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Jengibre - 100 grs.,0.5,,342.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,
275,#1164,,,,,,,,,,,,,,2019-01-15 22:57:57 +0100,1,Espinacas - Bolsa 300 grs.,2.0,,148.0,True,False,fulfilled,,,,,,,,,,Fruites i Verdures Rovira,,,,,0,,


We can see that there is unique information for the first row. This would be the general order information, which includes: 

    order_specific = ['Name', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Billing City', 'Billing Zip', 'Billing Country', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source']

Other variables are clearly item-specific:

    item_specific = ['Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status']
    
    
Some variables could be either also order-specific or item-specific information but need to be confirmed as here they contain NaN values for all the rows.

    uncertain_order = ['Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']
    uncertain_item = ['Lineitem compare at price']

We will now split the DataFrame as follows:
* order_specific information
* item_specific information

To do that we need:

1) Classify the uncertain variables

2) Reference both tables with some sort of Order Id

### Classify the uncertain variables

In [39]:
orders[['Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']].notna().sum()

Discount Code    37
Notes            53
Cancelled at     29
Tags              4
Tax 1 Name        6
Tax 1 Value       6
dtype: int64

In [26]:
orders[orders['Discount Code'].notna()]['Name'].value_counts().head()

#1126    1
#1186    1
#1075    1
#1189    1
#1204    1
Name: Name, dtype: int64

We can see that 'Discount Code' is unique per order. We'll do the same for the rest of variables

In [46]:
orders[orders['Notes'].notna()]['Name'].value_counts().head()

#1021    1
#1182    1
#1034    1
#1042    1
#1184    1
Name: Name, dtype: int64

In [48]:
orders[orders['Cancelled at'].notna()]['Name'].value_counts().head()

#1015    1
#1056    1
#1175    1
#1173    1
#1183    1
Name: Name, dtype: int64

In [42]:
orders[orders['Tags'].notna()]['Name'].value_counts().head()

#1233    1
#1234    1
#1232    1
#1235    1
Name: Name, dtype: int64

In [43]:
orders[orders['Tax 1 Name'].notna()]['Name'].value_counts().head()

#1075    1
#1074    1
#1176    1
#1199    1
#1174    1
Name: Name, dtype: int64

In [44]:
orders[orders['Tax 1 Value'].notna()]['Name'].value_counts().head()

#1075    1
#1074    1
#1176    1
#1199    1
#1174    1
Name: Name, dtype: int64

In [49]:
orders[orders['Lineitem compare at price'].notna()]['Name'].value_counts().head()

#1152    2
#1109    2
#1134    2
#1165    2
#1111    1
Name: Name, dtype: int64

All the uncertain variables are unique to each order or item as we suspected (so uncertain lists are now certain) but only fulfilled for few orders/items. We will put them in the specific DataFrames and then decide later what to do with it. The updated list of variables per dataframe is the following:

    order_specific = ['Name', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Billing City', 'Billing Zip', 'Billing Country', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source', 'Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']

    item_specific = ['Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status', 'Lineitem compare at price']

## Splitting the dataframe

To ease the understanding of the information we will split our dataframe in 2. The first will be order-specific and the second item-specific. This will help us to better structure the database, and clean each DF separately with a better undestanding of each variable.

In [50]:
cols_order = ['Name', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Billing City', 'Billing Zip', 'Billing Country', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source', 'Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']
order_specific = orders[cols_order]

#note that we are including 'name' and 'id' to the item-specific dataframe so that this information is traceable to the specific order.
cols_item = ['Name', 'Id', 'Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status', 'Lineitem compare at price']
item_specific = orders[cols_item]

## Cleaning order-specific data

In [51]:
order_specific.head()

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,Total,Discount Amount,Shipping Method,Billing City,Billing Zip,Billing Country,Note Attributes,Payment Method,Payment Reference,Refunded Amount,Id,Risk Level,Source,Discount Code,Notes,Cancelled at,Tags,Tax 1 Name,Tax 1 Value
0,#1244,paid,2019-04-15 09:51:49 +0200,unfulfilled,,yes,EUR,45.0,4.9,0.0,49.9,0.0,Envío estándar,Barcelona,'08032,ES,Fecha de entrega: 16/04/2019\nDía de la semana...,Stripe,c2959210184748.1,0.0,891010000000.0,Low,web,,,,,,
1,#1243,paid,2019-04-11 23:35:23 +0200,fulfilled,2019-04-11 23:43:20 +0200,yes,EUR,40.1,4.9,0.0,45.0,0.0,Envío estándar,Barcelona,'08021,ES,Fecha de entrega: 12/04/2019\nDía de la semana...,Stripe,c2946921922604.1,0.0,886963000000.0,Low,web,,,,,,
2,#1243,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,#1242,refunded,2019-04-11 23:21:35 +0200,unfulfilled,,yes,EUR,40.1,4.9,0.0,45.0,0.0,Envío estándar,Barcelona,'08021,ES,Fecha de entrega: 12/04/2019\nDía de la semana...,Stripe,c2946859925548.1,45.0,886948000000.0,Low,web,,,2019-04-11 23:32:02 +0200,,,
4,#1242,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We need to keep only the first row of each order, the rest contains only NaN values.

In [67]:
order_specific_clean = pd.DataFrame()
for order in list(order_specific['Name'].unique()):
    order_specific_clean = order_specific_clean.append(order_specific[order_specific['Name']==order].head(1))

In [69]:
order_specific_clean.isna().sum()

Name                    0
Financial Status        0
Paid at                80
Fulfillment Status      0
Fulfilled at           30
Accepts Marketing       0
Currency                0
Subtotal                0
Shipping                0
Taxes                   0
Total                   0
Discount Amount         0
Shipping Method        10
Billing City            8
Billing Zip             8
Billing Country         7
Note Attributes        17
Payment Method          0
Payment Reference      37
Refunded Amount         0
Id                      0
Risk Level              0
Source                  0
Discount Code         200
Notes                 184
Cancelled at          208
Tags                  233
Tax 1 Name            231
Tax 1 Value           231
dtype: int64

We can see that:
* 80 orders were not registered as paid
* 30 orders were not fulfilled
* 10 orders do not have shipping method
* 8 orders do not have Billing city nor Billing Zip
* 7 orders do not have Billing country
* 17 orders do not have Note Attributes
* 37 orders do not have a payment reference
* 200 orders do not have a discout code
* 184 orders do not have notes
* 208 orders were not cancelled
* 233 orders do not have tags
* 231 orders do not have tax information

We will now analyse each of the variables with NaN values to correct this issue.

### Solving NaN values

#### Binary encoding