# Orders - Data Wrangling and Data Consistency Checks:

1. Importing libraries and dataset
2. Checking for columns, datatype, shape using .info()
3. Removing unneccessary columns using .drop()
4. Renaming columns using .rename()
5. Addressing missing values
6. Addressing duplicates
7. Checking for mixed datatype
8. Changing datatypes to reduce memory usage
9. Performing Descriptive Analysis
10. Exporting wrangled, consistency checked dataframe

## 1. Importing libraries and dataset

In [1]:
# Importing libraries

import pandas as pd
import os

In [2]:
# Accessing EnvFile for path

%run EnvFile.ipynb

Stored 'path' (str)


In [3]:
# Importing orders.csv to dataframe

df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'))

In [4]:
# Checking the head

df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


## 2. Checking for columns, datatype, shape using .info()

In [5]:
# Checking the info for columns, datatypes, shape of dataframe

df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                object 
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB


#### The shape of df_ords before consistency checks is (3421083, 7) with memory usage of 182.7+ MB.

## 3. Removing unneccessary columns using .drop()

In [6]:
# Removing unnecessary column eval_set

df_ords.drop(columns = ['eval_set'], inplace = True)

## 4. Renaming columns using .rename()

In [7]:
# Renaming columns with appropriate names

df_ords.rename(columns = {'order_dow':'order_day_of_week'}, inplace = True)

## 5. Addressing missing values

In [8]:
# Finding missing values

df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
order_day_of_week              0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

In [9]:
# Creating a subset of missing values in df_ords

df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [10]:
df_ords_nan

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


#### Findings of missing values

The 206209 missing values of days_since_prior_order in df_ords_nan imply that all those are orders from order_number 1 for all the users from 1 to 206209. 
It makes sense because there wouldn't be any orders prior to order_number 1. Hence there is no days since prior order.

#### Addressing missing values

I've chosen not to impute or remove null value since statistics can still be calculated with it and those rows are needed to get input on first orders from all customers. 
The order number 1 act as a flag already resulting in no necessity for separate flag column to be created.

## 6. Addressing duplicates

In [11]:
# Check for duplicates

df_ords[df_ords.duplicated()]

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order


There are no duplicates. Hence no changes to be made.

## 7. Checking for mixed datatype

In [12]:
# Check for mixed-type data in your df_ords dataframe.

for col in df_ords.columns.tolist():
    mixeddata = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len(df_ords[mixeddata]) > 0:
        print(col)

There is no mixed type data

## 8. Changing datatypes to reduce memory usage

In [13]:
# Change datatypes to reduce memory usage

df_ords['order_id'] = df_ords['order_id'].astype('int32')
df_ords['user_id'] = df_ords['user_id'].astype('int32')
df_ords['order_number'] = df_ords['order_number'].astype('int8')
df_ords['order_day_of_week'] = df_ords['order_day_of_week'].astype('int8')
df_ords['order_hour_of_day'] = df_ords['order_hour_of_day'].astype('int8')
df_ords['days_since_prior_order'] = df_ords['days_since_prior_order'].astype('float16')

In [14]:
# Checking for memory usage reduction using info

df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int32  
 1   user_id                 int32  
 2   order_number            int8   
 3   order_day_of_week       int8   
 4   order_hour_of_day       int8   
 5   days_since_prior_order  float16
dtypes: float16(1), int32(2), int8(3)
memory usage: 42.4 MB


#### The shape of df_ords after consistency checks is (3421083, 6) with memory usage of 42.4 MB.

## 9. Performing Descriptive Analysis

In [15]:
# Checking descriptive analysis

df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,
std,987581.7,59533.72,17.73316,2.046829,4.226088,0.0
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


#### There is nothing off about the data. All the min and max values are as expected.

## 10. Exporting wrangled, consistency checked dataframe

In [16]:
# Export cleaned dataframe

df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'), index = False)