# Data exploration and cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data loading and exploration

In [2]:
orders = pd.read_csv('../00.Data/orders_cripted.csv')

In [3]:
orders.head()

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,...,Tax 1 Value,Tax 2 Name,Tax 2 Value,Tax 3 Name,Tax 3 Value,Tax 4 Name,Tax 4 Value,Tax 5 Name,Tax 5 Value,Receipt Number
0,#1244,paid,2019-04-15 09:51:49 +0200,unfulfilled,,yes,EUR,45.0,4.9,0.0,...,,,,,,,,,,
1,#1243,paid,2019-04-11 23:35:23 +0200,fulfilled,2019-04-11 23:43:20 +0200,yes,EUR,40.1,4.9,0.0,...,,,,,,,,,,
2,#1243,,,,,,,,,,...,,,,,,,,,,
3,#1242,refunded,2019-04-11 23:21:35 +0200,unfulfilled,,yes,EUR,40.1,4.9,0.0,...,,,,,,,,,,
4,#1242,,,,,,,,,,...,,,,,,,,,,


It seems that we have a lot of nan values and I'm not sure what the Tax N Value and Name are and seem to be empty. Let's check all Nan values compared to the total amount of data we have.

In [4]:
orders.shape

(2415, 49)

In [5]:
orders.isna().sum()

Name                              0
Financial Status               2178
Paid at                        2258
Fulfillment Status             2178
Fulfilled at                   2208
Accepts Marketing              2178
Currency                       2178
Subtotal                       2178
Shipping                       2178
Taxes                          2178
Total                          2178
Discount Code                  2378
Discount Amount                2178
Shipping Method                2188
Created at                        0
Lineitem quantity                 0
Lineitem name                     0
Lineitem price                    0
Lineitem compare at price      2392
Lineitem sku                     79
Lineitem requires shipping        0
Lineitem taxable                  0
Lineitem fulfillment status       0
Billing City                   2186
Billing Zip                    2186
Billing Country                2185
Notes                          2362
Note Attributes             

Ok, so we have plenty of NaNs. We'll drop all 'Tax' related values except the #1 as there is no data in them. Also 'Receipt Number' has no values so we will also drop it.

In [6]:
orders.drop(columns=['Tax 2 Name', 'Tax 2 Value', 'Tax 3 Name', 'Tax 3 Value', 'Tax 4 Name', 'Tax 4 Value', 'Tax 5 Name', 'Tax 5 Value', 'Receipt Number'], inplace=True)

Each row is a register for an order but there's no unique Id for each row. The 'Id' we have is mostly empty. 
We have 2 possible scenarios:

1) 'Id' is the order id and we only have 2415 - 2178 = 237 orders

2) All the rows are actually orders but there is no id.

The most probable option is 1 but to discard option 2, we will group by 'Created at' as it is a not-null variable and if two rows are created at exactly the same time, we can assume that correspond to the same order.

In [7]:
orders.groupby('Created at').count().shape

(237, 39)

We have 237 different Timestamps, meaning that we actually have 237 orders, the empty data refers to specific information of the order. Actually, the 'Name' appears to be the ID of the orders.

In [8]:
orders.groupby('Name').count().shape

(237, 39)

In [9]:
orders['Name'].value_counts().head()

#1164    34
#1171    33
#1051    32
#1153    30
#1157    30
Name: Name, dtype: int64

In [10]:
orders['Name'].value_counts().tail()

#1175    1
#1190    1
#1013    1
#1244    1
#1155    1
Name: Name, dtype: int64

So we have some orders with 34 rows of information and some with just 1. Let's check the order #1164 to better understand what type of information we might find in the additional rows.

In [11]:
orders[orders['Name']=='#1164']

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,...,Payment Reference,Refunded Amount,Vendor,Id,Tags,Risk Level,Source,Lineitem discount,Tax 1 Name,Tax 1 Value
266,#1164,paid,2019-01-15 22:58:00 +0100,fulfilled,2019-01-15 23:46:53 +0100,yes,EUR,105.1,0.0,0.0,...,c2584919212076.1,0.0,Xarcuteria Alonso Andrés,785876000000.0,,Low,web,0,,
267,#1164,,,,,,,,,,...,,,Xarcuteria Alonso Andrés,,,,,0,,
268,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,
269,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,
270,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,
271,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,
272,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,
273,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,
274,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,
275,#1164,,,,,,,,,,...,,,Fruites i Verdures Rovira,,,,,0,,


We can see that there is unique information for the first row. This would be the general order information, which includes: 

    order_specific = ['Name', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Billing City', 'Billing Zip', 'Billing Country', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source']

Other variables are clearly item-specific:

    item_specific = ['Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status']
    
    
Some variables could be either also order-specific or item-specific information but need to be confirmed as here they contain NaN values for all the rows.

    uncertain_order = ['Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']
    uncertain_item = ['Lineitem compare at price']

We will now split the DataFrame as follows:
* order_specific information
* item_specific information

To do that we need:

1) Classify the uncertain variables

2) Reference both tables with some sort of Order Id

### Classify the uncertain variables

In [12]:
orders[['Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']].notna().sum()

Discount Code    37
Notes            53
Cancelled at     29
Tags              4
Tax 1 Name        6
Tax 1 Value       6
dtype: int64

In [13]:
orders[orders['Discount Code'].notna()]['Name'].value_counts().head()

#1197    1
#1235    1
#1219    1
#1188    1
#1073    1
Name: Name, dtype: int64

We can see that 'Discount Code' is unique per order. We'll do the same for the rest of variables

In [14]:
orders[orders['Notes'].notna()]['Name'].value_counts().head()

#1056    1
#1184    1
#1182    1
#1033    1
#1026    1
Name: Name, dtype: int64

In [15]:
orders[orders['Cancelled at'].notna()]['Name'].value_counts().head()

#1242    1
#1049    1
#1184    1
#1145    1
#1174    1
Name: Name, dtype: int64

In [16]:
orders[orders['Tags'].notna()]['Name'].value_counts().head()

#1235    1
#1234    1
#1233    1
#1232    1
Name: Name, dtype: int64

In [17]:
orders[orders['Tax 1 Name'].notna()]['Name'].value_counts().head()

#1065    1
#1074    1
#1174    1
#1199    1
#1075    1
Name: Name, dtype: int64

In [18]:
orders[orders['Tax 1 Value'].notna()]['Name'].value_counts().head()

#1065    1
#1074    1
#1174    1
#1199    1
#1075    1
Name: Name, dtype: int64

In [19]:
orders[orders['Lineitem compare at price'].notna()]['Name'].value_counts().head()

#1152    2
#1165    2
#1109    2
#1134    2
#1137    1
Name: Name, dtype: int64

All the uncertain variables are unique to each order or item as we suspected (so uncertain lists are now certain) but only fulfilled for few orders/items. We will put them in the specific DataFrames and then decide later what to do with it. The updated list of variables per dataframe is the following:

    order_specific = ['Name', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Billing City', 'Billing Zip', 'Billing Country', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source', 'Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']

    item_specific = ['Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status', 'Lineitem compare at price']

## Splitting the dataframe

To ease the understanding of the information we will split our dataframe in 2. The first will be order-specific and the second item-specific. This will help us to better structure the database, and clean each DF separately with a better undestanding of each variable.

In [20]:
cols_order = ['Name', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Billing City', 'Billing Zip', 'Billing Country', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source', 'Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']
order_specific = orders[cols_order]

#note that we are including 'name' and 'id' to the item-specific dataframe so that this information is traceable to the specific order.
cols_item = ['Name', 'Id', 'Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status', 'Lineitem compare at price']
item_specific = orders[cols_item]

## Cleaning order-specific data

In [21]:
order_specific.head()

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,...,Refunded Amount,Id,Risk Level,Source,Discount Code,Notes,Cancelled at,Tags,Tax 1 Name,Tax 1 Value
0,#1244,paid,2019-04-15 09:51:49 +0200,unfulfilled,,yes,EUR,45.0,4.9,0.0,...,0.0,891010000000.0,Low,web,,,,,,
1,#1243,paid,2019-04-11 23:35:23 +0200,fulfilled,2019-04-11 23:43:20 +0200,yes,EUR,40.1,4.9,0.0,...,0.0,886963000000.0,Low,web,,,,,,
2,#1243,,,,,,,,,,...,,,,,,,,,,
3,#1242,refunded,2019-04-11 23:21:35 +0200,unfulfilled,,yes,EUR,40.1,4.9,0.0,...,45.0,886948000000.0,Low,web,,,2019-04-11 23:32:02 +0200,,,
4,#1242,,,,,,,,,,...,,,,,,,,,,


We need to keep only the first row of each order, the rest contains only NaN values.

In [22]:
order_specific_clean = pd.DataFrame()
for order in list(order_specific['Name'].unique()):
    order_specific_clean = order_specific_clean.append(order_specific[order_specific['Name']==order].head(1))

In [23]:
order_specific_clean.isna().sum()

Name                    0
Financial Status        0
Paid at                80
Fulfillment Status      0
Fulfilled at           30
Accepts Marketing       0
Currency                0
Subtotal                0
Shipping                0
Taxes                   0
Total                   0
Discount Amount         0
Shipping Method        10
Billing City            8
Billing Zip             8
Billing Country         7
Note Attributes        17
Payment Method          0
Payment Reference      37
Refunded Amount         0
Id                      0
Risk Level              0
Source                  0
Discount Code         200
Notes                 184
Cancelled at          208
Tags                  233
Tax 1 Name            231
Tax 1 Value           231
dtype: int64

We can see that:
* 'Paid at': 80 orders were not registered as paid
* 'Fullfilled at': 30 orders were not fulfilled
* 'Shipping method': 10 orders do not have shipping method
* 'Billing city' and 'Billing zip': 8 orders do not have Billing city nor Billing Zip
* 'Billing country': 7 orders do not have Billing country
* 'Note attributes': 17 orders do not have Note Attributes
* 'Payment reference': 37 orders do not have a payment reference
* 'Discount Code': 200 orders do not have a discout code
* 'Notes': 184 orders do not have notes
* 'Canelled at': 208 orders were not cancelled
* 'Tags': 233 orders do not have tags
* 'Tax Name' and 'Tax Value': 231 orders do not have tax information

We will now analyse each of the variables with NaN values to correct this issue.

### Order-specific data: Solving NaN values

#### Dropping values

We will drop the following variables as do not provide relevant information for the analysis:

'Paid at', 'Fullfilled at', 'Billing Country', 'Note Attributes', 'Payment Reference', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value'

In [24]:
order_specific_clean = order_specific_clean.drop(columns=['Paid at', 'Fulfilled at', 'Billing Country', 'Note Attributes', 'Payment Reference', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value'])

In [25]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

Name                    0
Financial Status        0
Fulfillment Status      0
Accepts Marketing       0
Currency                0
Subtotal                0
Shipping                0
Taxes                   0
Total                   0
Discount Amount         0
Shipping Method        10
Billing City            8
Billing Zip             8
Payment Method          0
Refunded Amount         0
Id                      0
Risk Level              0
Source                  0
Discount Code         200
Notes                 184
dtype: int64

We don't care about the discount code but we do care whether there was a discount or not.

In [26]:
order_specific_clean['Discount Code'] = order_specific_clean['Discount Code'].fillna(value=0)

In [27]:
order_specific_clean['Discount Code'] = order_specific_clean['Discount Code'].map(lambda x: 0 if x==0 else 1)

We need to analyse 'Notes' as it will identify test records

In [28]:
order_specific_clean['Notes'] = order_specific_clean['Notes'].fillna(value='None')
order_specific_clean['Notes'] = order_specific_clean['Notes'].map(lambda x: 'test' if 'prueba' in x else ('test' if 'prova' in x else ('test' if 'Hola' in x else ('test' if 'fdfsdfsdf' in x else ('test' if 'wqweqwe' in x else ('test' if 'hohsadasdasd' in x else ('test' if 'fsdfsfsd' in x else ('test' if 'hola mundo' in x else x))))))))


We now remove all test records

In [29]:
order_specific_clean = order_specific_clean[order_specific_clean['Notes'] != 'test']

In [30]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

Name                   0
Financial Status       0
Fulfillment Status     0
Accepts Marketing      0
Currency               0
Subtotal               0
Shipping               0
Taxes                  0
Total                  0
Discount Amount        0
Shipping Method       10
Billing City           8
Billing Zip            8
Payment Method         0
Refunded Amount        0
Id                     0
Risk Level             0
Source                 0
Discount Code          0
Notes                  0
dtype: int64

In [31]:
order_specific_clean['Shipping Method'].value_counts()

Envío estándar                              133
Envío gratuito                               51
Recogida en el mercado                       21
El reparto se realiza a través de Shargo      4
Name: Shipping Method, dtype: int64

When 'Shipping Method' is NaN, it means that it is delivered by one of the employees. We will fill NaNs with 'Other'

In [35]:
order_specific_clean['Shipping Method'] = order_specific_clean['Shipping Method'].fillna(value='Other')

In [36]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

Name                  0
Financial Status      0
Fulfillment Status    0
Accepts Marketing     0
Currency              0
Subtotal              0
Shipping              0
Taxes                 0
Total                 0
Discount Amount       0
Shipping Method       0
Billing City          8
Billing Zip           8
Payment Method        0
Refunded Amount       0
Id                    0
Risk Level            0
Source                0
Discount Code         0
Notes                 0
dtype: int64

In [37]:
order_specific_clean[order_specific_clean['Billing City'].isna()]

Unnamed: 0,Name,Financial Status,Fulfillment Status,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,Total,Discount Amount,Shipping Method,Billing City,Billing Zip,Payment Method,Refunded Amount,Id,Risk Level,Source,Discount Code,Notes
5,#1241,paid,fulfilled,no,EUR,38.94,0.0,0.0,38.94,0.0,Other,,,manual,0.0,886555000000.0,Low,shopify_draft_order,0,
24,#1231,paid,fulfilled,no,EUR,200.1,0.0,0.0,200.1,0.0,Other,,,manual,0.0,878885000000.0,Low,shopify_draft_order,0,A entregar 6 de abril
36,#1228,paid,fulfilled,no,EUR,54.5,0.0,0.0,54.5,0.0,Other,,,manual,0.0,877671000000.0,Low,shopify_draft_order,0,
46,#1224,paid,fulfilled,no,EUR,26.16,0.0,0.0,26.16,0.0,Other,,,manual,0.0,873965000000.0,Low,shopify_draft_order,0,
47,#1223,paid,fulfilled,no,EUR,40.94,0.0,0.0,40.94,0.0,Other,,,manual,0.0,870070000000.0,Low,shopify_draft_order,0,
69,#1214,paid,fulfilled,no,EUR,51.15,0.0,0.0,51.15,0.0,Other,,,manual,0.0,861757000000.0,Low,shopify_draft_order,0,
71,#1212,paid,fulfilled,no,EUR,32.65,0.0,0.0,32.65,0.0,Other,,,manual,0.0,854882000000.0,Low,iphone,0,Es un pedido para Vimet (evento Wado)
78,#1207,paid,fulfilled,no,EUR,27.5,0.0,0.0,27.5,0.0,Other,,,manual,0.0,852265000000.0,Low,shopify_draft_order,0,Es un pedido para Vimet (Coworking)


In almost all cases, when the order is delivered by an employee, the Billing City and Zip are null. We will replace this with 'Unknown'.

In [39]:
order_specific_clean['Billing City'] = order_specific_clean['Billing City'].fillna(value='Unknown')

In [40]:
order_specific_clean['Billing Zip'] = order_specific_clean['Billing Zip'].fillna(value='Unknown')

In [41]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

Name                  0
Financial Status      0
Fulfillment Status    0
Accepts Marketing     0
Currency              0
Subtotal              0
Shipping              0
Taxes                 0
Total                 0
Discount Amount       0
Shipping Method       0
Billing City          0
Billing Zip           0
Payment Method        0
Refunded Amount       0
Id                    0
Risk Level            0
Source                0
Discount Code         0
Notes                 0
dtype: int64

We solved all the Nan values issues.

### Order-specific data: Data cleaning

Now we don't have NaN values but we need to check the content of the variables in order to clean it.

In [42]:
order_specific_clean.head()

Unnamed: 0,Name,Financial Status,Fulfillment Status,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,Total,Discount Amount,Shipping Method,Billing City,Billing Zip,Payment Method,Refunded Amount,Id,Risk Level,Source,Discount Code,Notes
0,#1244,paid,unfulfilled,yes,EUR,45.0,4.9,0.0,49.9,0.0,Envío estándar,Barcelona,'08032,Stripe,0.0,891010000000.0,Low,web,0,
1,#1243,paid,fulfilled,yes,EUR,40.1,4.9,0.0,45.0,0.0,Envío estándar,Barcelona,'08021,Stripe,0.0,886963000000.0,Low,web,0,
3,#1242,refunded,unfulfilled,yes,EUR,40.1,4.9,0.0,45.0,0.0,Envío estándar,Barcelona,'08021,Stripe,45.0,886948000000.0,Low,web,0,
5,#1241,paid,fulfilled,no,EUR,38.94,0.0,0.0,38.94,0.0,Other,Unknown,Unknown,manual,0.0,886555000000.0,Low,shopify_draft_order,0,
6,#1240,paid,fulfilled,yes,EUR,65.35,4.9,0.0,70.25,10.0,Envío estándar,Barcelona,'08037,Stripe,0.0,885370000000.0,Low,web,1,¿Podemos cambiar la lechuga larga por cualquie...


#### Name -- OK

We have checked **'Name'** previously and looks OK for now, additionally, we might drop ot to preserve 'ID', we¡ll decide this later

#### Financial Status -- OK

In [43]:
order_specific_clean['Financial Status'].value_counts()

paid                  146
partially_refunded     37
voided                 23
pending                 8
refunded                5
Name: Financial Status, dtype: int64

Looks good

#### Fulfillment Status -- OK

In [44]:
order_specific_clean['Fulfillment Status'].value_counts()

fulfilled      196
unfulfilled     22
partial          1
Name: Fulfillment Status, dtype: int64

Also good

#### Accepts Marketing

In [45]:
order_specific_clean['Accepts Marketing'].value_counts()

yes    184
no      35
Name: Accepts Marketing, dtype: int64

We will convert this to boolean (although, keeping int as type) to minimise work during the analysis later.

In [47]:
order_specific_clean['Accepts Marketing'] = order_specific_clean['Accepts Marketing'].map(lambda x: 1 if x=='yes' else 0)

#### Currency

In [48]:
order_specific_clean['Currency'].value_counts()

EUR    219
Name: Currency, dtype: int64

We drop currency as all is EUR and it is irrelevant for the analysis.

In [49]:
order_specific_clean = order_specific_clean.drop(columns='Currency')

#### Subtotal -- OK

In [54]:
order_specific_clean['Subtotal'].dtype

dtype('float64')

Also looks good

#### Shipping -- OK

In [55]:
order_specific_clean['Shipping'].value_counts()

4.9    129
0.0     86
3.9      4
Name: Shipping, dtype: int64

Also looks good

#### Taxes -- OK

In [57]:
order_specific_clean['Taxes'].value_counts()

0.00     213
10.40      1
2.72       1
53.80      1
17.34      1
1.21       1
8.04       1
Name: Taxes, dtype: int64

Some values are oddly high. Let's check the non-zero values

In [58]:
order_specific_clean[order_specific_clean['Taxes'] != 0.0]

Unnamed: 0,Name,Financial Status,Fulfillment Status,Accepts Marketing,Subtotal,Shipping,Taxes,Total,Discount Amount,Shipping Method,Billing City,Billing Zip,Payment Method,Refunded Amount,Id,Risk Level,Source,Discount Code,Notes
96,#1199,paid,fulfilled,1,310.0,0.0,53.8,310.0,0.0,Envío gratuito,Barcelona,'08006,Stripe,0.0,837411000000.0,Low,web,0,
134,#1176,voided,unfulfilled,1,103.65,0.0,17.34,103.65,0.0,Envío gratuito,Barcelona,'08021,custom,0.0,815409000000.0,Low,web,0,
142,#1174,voided,unfulfilled,1,66.75,4.9,10.4,71.65,0.0,Envío estándar,Barcelona,'08006,custom,0.0,811620000000.0,Low,web,0,
1746,#1075,paid,fulfilled,1,71.2,4.9,8.04,71.2,4.9,Envío estándar,Bellaterra,'08193,Stripe,0.0,501550000000.0,Low,web,1,
1752,#1074,partially_refunded,fulfilled,0,40.35,4.9,2.72,40.35,4.9,Envío estándar,Barcelona,'08021,Stripe,2.85,501271000000.0,Low,web,1,
1870,#1065,paid,fulfilled,1,44.1,4.9,1.21,49.0,0.0,Envío estándar,Barcelona,'08021,Stripe,0.0,484438000000.0,Low,web,0,


Looks OK

#### Total <-- NEED TO CHECK WITH RAMON


In [65]:
order_specific_clean[order_specific_clean['Total'] != order_specific_clean['Subtotal'] + order_specific_clean['Shipping'] - order_specific_clean['Discount Amount']]

Unnamed: 0,Name,Financial Status,Fulfillment Status,Accepts Marketing,Subtotal,Shipping,Taxes,Total,Discount Amount,Shipping Method,Billing City,Billing Zip,Payment Method,Refunded Amount,Id,Risk Level,Source,Discount Code,Notes
6,#1240,paid,fulfilled,1,65.35,4.9,0.0,70.25,10.0,Envío estándar,Barcelona,'08037,Stripe,0.0,885370000000.0,Low,web,1,¿Podemos cambiar la lechuga larga por cualquie...
9,#1239,paid,fulfilled,1,77.45,0.0,0.0,77.45,10.0,Envío gratuito,Barcelona,'08005,Stripe,0.0,884384000000.0,Low,web,1,Cambiar hamburguesas de pollo por las de ternera
17,#1235,paid,fulfilled,1,133.95,0.0,0.0,133.95,10.0,Envío gratuito,Barcelona,'08022,Stripe,0.0,880764000000.0,Low,web,1,
22,#1233,paid,fulfilled,1,69.0,0.0,0.0,69.0,10.0,Envío gratuito,barcelona,'08021,Stripe,0.0,880288000000.0,Low,web,1,
48,#1222,paid,fulfilled,1,38.05,4.9,0.0,42.95,0.0,Envío estándar,Barcelona,'08021,Stripe,0.0,870063000000.0,Low,web,0,
56,#1220,paid,fulfilled,1,69.0,0.0,0.0,69.0,10.0,Envío gratuito,Barcelona,'08005,Stripe,0.0,869611000000.0,Low,web,1,Cambiar piña por pera\r\nNo poner las pasas
57,#1219,paid,fulfilled,1,53.65,4.9,0.0,58.55,10.0,Envío estándar,Barcelona,'08032,Stripe,0.0,868680000000.0,Low,web,1,
92,#1204,paid,fulfilled,1,49.0,4.9,0.0,53.9,10.0,Envío estándar,Barcelona,'08012,Stripe,0.0,850552000000.0,Low,web,1,Agradecería sustituir los garbanzos por judías...
119,#1186,paid,fulfilled,1,55.0,4.9,0.0,59.9,10.0,Envío estándar,Barcelona,'08012,Stripe,0.0,825292000000.0,Low,web,1,
126,#1181,paid,fulfilled,1,27.55,4.9,0.0,32.45,32.4,Envío estándar,Barcelona,'08023,Stripe,0.0,818497000000.0,Low,web,1,


#### Discount Amount <-- OK

In [73]:
order_specific_clean['Discount Amount'].value_counts()

0.0     183
4.9      24
10.0     11
32.4      1
Name: Discount Amount, dtype: int64

In [74]:
order_specific_clean['Taxes'].dtype

dtype('float64')

Looks good

#### Shipping Method

In [76]:
order_specific_clean['Shipping Method'].value_counts()

Envío estándar                              133
Envío gratuito                               51
Recogida en el mercado                       21
Other                                        10
El reparto se realiza a través de Shargo      4
Name: Shipping Method, dtype: int64

Looks good but we will replace the values to reduce the amount of text. This can be useful later if we want to display this in a plot

In [78]:
order_specific_clean['Shipping Method'] = order_specific_clean['Shipping Method'].map(lambda x: 'estándar' if x=='Envío estándar' else ('gratuito' if x == 'Envío gratuito' else ('mercado' if x == 'Recogida en el mercado' else ('shargo' if x == 'El reparto se realiza a través de Shargo' else 'otros'))))

#### 	Billing City

In [80]:
order_specific_clean['Billing City'].value_counts()

Barcelona                  166
Barcelona                   16
barcelona                   14
Unknown                      8
test                         7
jsjajsjsjjs                  2
BARCELONA                    2
Sant feliu de Llobregat      1
Bellaterra                   1
khg                          1
dsa                          1
Name: Billing City, dtype: int64

In [89]:
order_specific_clean = order_specific_clean.replace(['Barcelona ', ' Barcelona', 'barcelona', 'BARCELONA'], 'Barcelona')
order_specific_clean = order_specific_clean.replace(['khg', 'dsa', 'test', 'jsjajsjsjjs'], 'Unknown')

In [90]:
order_specific_clean['Billing City'].value_counts()

Barcelona                  198
Unknown                     19
Sant feliu de Llobregat      1
Bellaterra                   1
Name: Billing City, dtype: int64

#### Billing Zip

#### Payment Method

#### Refunded Amount

#### Id

#### Risk Level

#### Source

#### Discount Code

#### Notes