## Cleaning order-specific data

In [None]:
##Load the data

In [None]:
order_specific.head()

In [None]:
order_specific_clean = pd.DataFrame()
for order in list(order_specific['Name'].unique()):
    order_specific_clean = order_specific_clean.append(order_specific[order_specific['Name']==order].head(1))

In [None]:
order_specific_clean.isna().sum()

We can see that:
* 'Paid at': 80 orders were not registered as paid
* 'Fullfilled at': 30 orders were not fulfilled
* 'Shipping method': 10 orders do not have shipping method
* 'Billing city' and 'Billing zip': 8 orders do not have Billing city nor Billing Zip
* 'Billing country': 7 orders do not have Billing country
* 'Note attributes': 17 orders do not have Note Attributes
* 'Payment reference': 37 orders do not have a payment reference
* 'Discount Code': 200 orders do not have a discout code
* 'Notes': 184 orders do not have notes
* 'Canelled at': 208 orders were not cancelled
* 'Tags': 233 orders do not have tags
* 'Tax Name' and 'Tax Value': 231 orders do not have tax information

We will now analyse each of the variables with NaN values to correct this issue.

## Solving NaN values

#### Dropping values

We will drop the following variables as do not provide relevant information for the analysis:

'Paid at', 'Fullfilled at', 'Billing Country', 'Note Attributes', 'Payment Reference', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value'

In [1]:
order_specific_clean = order_specific_clean.drop(columns=['Paid at', 'Fulfilled at', 'Billing Country', 'Note Attributes', 'Payment Reference', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value'])


NameError: name 'order_specific_clean' is not defined

In [None]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

We don't care about the discount code but we do care whether there was a discount or not.

In [None]:
order_specific_clean['Discount Code'] = order_specific_clean['Discount Code'].fillna(value=0)

In [None]:
order_specific_clean['Discount Code'] = order_specific_clean['Discount Code'].map(lambda x: 0 if x==0 else 1)

We need to analyse 'Notes' as it will identify test records

In [None]:
order_specific_clean['Notes'] = order_specific_clean['Notes'].fillna(value='None')
order_specific_clean['Notes'] = order_specific_clean['Notes'].map(lambda x: 'test' if 'prueba' in x else ('test' if 'prova' in x else ('test' if 'Hola' in x else ('test' if 'fdfsdfsdf' in x else ('test' if 'wqweqwe' in x else ('test' if 'hohsadasdasd' in x else ('test' if 'fsdfsfsd' in x else ('test' if 'hola mundo' in x else x))))))))


We now remove all test records

In [None]:
order_specific_clean = order_specific_clean[order_specific_clean['Notes'] != 'test']

In [None]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

In [None]:
order_specific_clean['Shipping Method'].value_counts()

When 'Shipping Method' is NaN, it means that it is delivered by one of the employees. We will fill NaNs with 'Other'

In [None]:
order_specific_clean['Shipping Method'] = order_specific_clean['Shipping Method'].fillna(value='Other')

In [None]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

In [None]:
order_specific_clean[order_specific_clean['Billing City'].isna()]

In almost all cases, when the order is delivered by an employee, the Billing City and Zip are null. We will replace this with 'Unknown'.

In [None]:
order_specific_clean['Billing City'] = order_specific_clean['Billing City'].fillna(value='Unknown')

In [None]:
order_specific_clean['Billing Zip'] = order_specific_clean['Billing Zip'].fillna(value='Unknown')

In [None]:
#We check the status of the NaN values
order_specific_clean.isna().sum()

We solved all the Nan values issues.

### Order-specific data: Data cleaning

Now we don't have NaN values but we need to check the content of the variables in order to clean it.

In [None]:
order_specific_clean.head()

#### Name -- OK

We have checked **'Name'** previously and looks OK for now, additionally, we might drop ot to preserve 'ID', we¡ll decide this later

#### Financial Status -- OK

In [None]:
order_specific_clean['Financial Status'].value_counts()

Looks good

#### Fulfillment Status -- OK

In [None]:
order_specific_clean['Fulfillment Status'].value_counts()

Also good

#### Accepts Marketing

In [None]:
order_specific_clean['Accepts Marketing'].value_counts()

We will convert this to boolean (although, keeping int as type) to minimise work during the analysis later.

In [None]:
order_specific_clean['Accepts Marketing'] = order_specific_clean['Accepts Marketing'].map(lambda x: 1 if x=='yes' else 0)

#### Currency


In [None]:
order_specific_clean['Currency'].value_counts()

We drop currency as all is EUR and it is irrelevant for the analysis.

In [None]:
order_specific_clean = order_specific_clean.drop(columns='Currency')

#### Subtotal -- OK

In [None]:
order_specific_clean['Subtotal'].dtype

Also looks good

#### Shipping -- OK


In [None]:
order_specific_clean['Shipping'].value_counts()

Also looks good

#### Taxes -- OK

In [None]:
order_specific_clean['Taxes'].value_counts()

Some values are oddly high. Let's check the non-zero values

In [None]:
order_specific_clean[order_specific_clean['Taxes'] != 0.0]

Looks OK

#### Total

In [None]:
order_specific_clean[order_specific_clean['Total'] != order_specific_clean['Subtotal'] + order_specific_clean['Shipping']]

When there is a Discount regarding free shipping, the discount is applied directly to the shipment cost.

WARNING! Sometimes, the above query is not fulfilled but pandas returns the rows ??

#### Discount Amount <-- OK

In [None]:
order_specific_clean['Discount Amount'].value_counts()

In [None]:
order_specific_clean['Discount Amount'].dtype

Looks good

#### Shipping Method

In [None]:
order_specific_clean['Shipping Method'].value_counts()

Looks good but we will replace the values to reduce the amount of text. This can be useful later if we want to display this in a plot

In [None]:
order_specific_clean['Shipping Method'] = order_specific_clean['Shipping Method'].map(lambda x: 'shargo' if x=='Envío estándar' else ('gratuito' if x == 'Envío gratuito' else ('mercado' if x == 'Recogida en el mercado' else ('shargo' if x == 'El reparto se realiza a través de Shargo' else 'otros'))))

#### 	Billing City

In [None]:
order_specific_clean['Billing City'].value_counts()

In [None]:
order_specific_clean = order_specific_clean.replace(['Barcelona ', ' Barcelona', 'barcelona', 'BARCELONA'], 'Barcelona')
order_specific_clean = order_specific_clean.replace(['khg', 'dsa', 'test', 'jsjajsjsjjs'], 'Unknown')

In [None]:
order_specific_clean['Billing City'].value_counts()

#### Billing Zip


In [None]:
#### Payment Method

In [None]:
#### Refunded Amount

In [None]:
#### Id

In [None]:
#### Risk Level

In [None]:
#### Source

In [None]:
#### Discount Code


In [None]:
#### Notes