# 2 Data Consistency Checks

## Table of contents
1. [Importing Libraries & Data](#imports)
2. [Products data frame](#paragraph1)
    1. [Mixed-type data](#subparagraph1)
    2. [Missing values](#subparagraph2)
    3. [Duplicates](#subparagraph3)
    4. [Export](#subparagraph3)
3. [Orders data frame](#paragraph2)
    1. [Mixed-type data](#subparagraph4)
    2. [Missing values](#subparagraph5)
    3. [Duplicates](#subparagraph6)
    4. [Export](#subparagraph7)

### 1. Importing Libraries & Data <a name="imports"></a>

In [98]:
# Importing pandas, numpy and os libraries
import pandas as pd
import numpy as np
import os

# Turning project folder path into a string
path = r'C:\Users\deuts\Desktop\Instacart Basket Analysis'

# Creating list with relevant 'orders' columns
vars_list = ['order_id', 'user_id', 'order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order']

# Importing 'orders.csv' 
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), usecols = vars_list)

# Importing 'products.csv'
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)


### 2. Products data frame <a name="paragraph1"></a>

### 2.1 Mixed-type data<a name="subparagraph1"></a>

In [99]:
# Create a data frame
df_test = pd.DataFrame()

In [100]:
# Create a mixed type columns and display it
df_test['mix'] = ['a','b',1,True]
df_test

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [101]:
# Check for mixed types in the test data frame
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [102]:
# Changing data type of 'mix' column to string
df_test['mix'] = df_test['mix'].astype('str')
df_test['mix'].dtype

dtype('O')

### 2.2 Missing values <a name="subparagraph2"></a>

In [1]:
# Identifying missing values
df_prods.isnull().sum()

NameError: name 'df_prods' is not defined

In [None]:
# Creating data frame with missing values
df_nan = df_prods[df_prods['product_name'].isnull() == True]
df_nan

In [105]:
# Checking the rows count for the products data frame
df_prods.shape

(49693, 5)

In [106]:
# Creating a new data frame for the products data frame
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [107]:
# Checking the rows count for the clean products data frame
df_prods_clean.shape

(49677, 5)

### 2.3 Duplicates <a name="subparagraph3"></a>

In [108]:
## Creating new data frame for duplicates of products data frame and displaying it to check the amount of duplicates
df_dups = df_prods_clean[df_prods_clean.duplicated()]
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [109]:
## Creating new clean data frame for clean products data frame and displaying it as a check
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()
df_prods_clean_no_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


### 2.4 Export <a name="subparagraph4"></a>

In [110]:
# Exporting clean products data frame
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_cleaned.csv'))

### 3. Orders data frame <a name="paragraph2"></a>

### 3.1 Mixed-type data <a name="subparagraph4"></a>

In [111]:
# 2. First look at the products data frame
## Descriptive Statistics
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


In [112]:
## First five rows of the products data frame
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


#### The maximum seems off, as its a really high value, it seems like it is a cap.

In [113]:
# 3. Checking for mixed-type data
## Checking df_prods for mixed-type data
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

#### 4. No mixed-type data in the orders data frame

In [114]:
# 5. Missing data Check
## Running a check for missing data in the orders data frame
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

### 3.2 Missing values <a name="subparagraph5"></a>

In [115]:
## Creating Sub data frame with missing data
df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]
df_ords_nan

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


#### For 206209 rows there is a null value/missing data in the 'days_since_prior_order' column. Explanation: For new customers it is not possible to have a previous order so the column 'days_since_prior_order' will have null value /missing data. 

In [116]:
# 6. Adressing the missing values
## Creating new data frame with a new column for new buyers
df_ords_clean = df_ords
df_ords_clean['new_customer'] = df_ords['days_since_prior_order'].isnull() == True
df_ords_clean

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,new_customer
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False
...,...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0,False
3421079,1854736,206209,11,4,10,30.0,False
3421080,626363,206209,12,1,12,18.0,False
3421081,2977660,206209,13,1,12,7.0,False


#### With the new colum 'new_customer' you can directly match a null value in the 'days_since_prior_order' to the customer type. So the  null value in the 'days_since_prior_order' column makes since, when the customer type is new customer (new_customer = True).

### 3.3 Duplicates <a name="subparagraph6"></a>

In [117]:
# 7. Check for duplicates
## Checking for duplicates in the orders data frame
df_ords_dups = df_ords_clean[df_ords_clean.duplicated()]
df_ords_dups

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,new_customer


#### There are no duplicates in the orders data frame.

### 3.4 Export <a name="subparagraph7"></a>

In [118]:
# Exporting clean orders data frame
df_ords_clean.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_cleaned.csv'))

In [119]:
# Exporting clean products data frame
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_cleaned.csv'))