# 4.5 Data Consistency Checks

## Instacart Grocery Basket Analysis

#### -Errol Hinkamp

##### Table of Contents

1. Import libraries
2. Import data
3. Exercise walkthrough
- 3.1 Check for mixed-type data
- 3.2 Remove price outliers
- 3.3 Check for missing values
- 3.4 Check for duplicates
4. Examine products dataframe
5. Check for mixed-type data
6. Check for missing values
7. Check for duplicates
8. Export dataframes

# 1. Import libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# 2. Import data

In [2]:
# Import first dataframe
path=r'C:\Users\Errol\Documents\Data Analyst Work\Achievement 4\Instacart Basket Analysis'
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col=0)
# Import second dataframe
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

  mask |= (ar1 == a)


# 3. Exercise walkthrough

### 3.1 Check for mixed-type data

In [3]:
# Create dataframe
df_test=pd.DataFrame()

In [4]:
# Create mixed-type column
df_test['mix']=['a','b',1,True]

In [5]:
df_test

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [6]:
# Check for mixed-type columns
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [7]:
# Change column data type
df_test['mix'] = df_test['mix'].astype('str')
# Check column data type
df_test.dtypes

mix    object
dtype: object

### 3.2 Remove price outliers

In [8]:
# Gaze into the future and see that I will have to remove price outliers in exercise 4.9
df_prods.loc[df_prods['prices'] >100, 'prices'] = np.nan

In [9]:
# Confirm outlying prices have been removed
df_prods['prices'].max()

25.0

### 3.3 Check for missing values

In [10]:
# Search for missing values in products dataframe
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            2
dtype: int64

In [11]:
# Create dataframe containing only missing values
df_nan=df_prods[df_prods['product_name'].isnull()==True]

In [12]:
df_nan2=df_prods[df_prods['prices'].isnull()==True]

In [13]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [14]:
df_nan2

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,
33666,33664,2 % Reduced Fat Milk,84,16,


In [15]:
# Check dataframe shape before excising missing values
df_prods.shape

(49693, 5)

In [16]:
# Create dataframe without missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]
df_prods_clean = df_prods_clean[df_prods_clean['prices'].isnull() == False]

In [17]:
# Check dataframe shape after excising missing values
df_prods_clean.shape

(49675, 5)

### 3.4 Check for duplicates

In [18]:
# Create dataframe of duplicates from products dataframe
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [19]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [20]:
# Check shape of dataframe before removing duplicates
df_prods_clean.shape

(49675, 5)

In [21]:
# Make new dataframe without duplicates
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [22]:
# Check shape of new dataframe
df_prods_clean_no_dups.shape

(49670, 5)

# 4. Examine products dataframe

In [23]:
# Get info on cleaned products dataframe
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49670.0,49670.0,49670.0,49670.0
mean,24850.238716,67.761305,11.72877,7.680437
std,14340.93185,38.31606,5.850834,4.199381
min,1.0,1.0,1.0,1.0
25%,12432.25,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.75,100.0,17.0,11.1
max,49688.0,134.0,21.0,25.0


##### The maximum value of product_id is higher than the count of values. This doesn't necessarily mean there's an issue, since the product_id doesn't actually have any meaningful numeric value and is just an identifier. Also, the difference between the count and the max is 16, which corresponds to the number of entries with missing product_name values that were dropped earlier. Mystery solved, I say.

# 5. Check for mixed-type data

In [24]:
# Check for mixed-type columns in orders dataframe
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

##### No mixed-data columns

# 6. Check for missing values

In [25]:
# Search for missing values in orders dataframe
df_ords.isnull().sum()

order_id                      0
user_id                       0
order_number                  0
orders_day_of_week            0
order_hour_of_day             0
days_since_last_order    206209
dtype: int64

##### There are missing values in the days_since_last_order field, but that's to be expected since each customer's first order will naturally have a blank in that column. No action is needed.

# 7. Check for duplicates

In [26]:
# Create new dataframe containing only duplicates
df_ords_dups = df_ords[df_ords.duplicated()]
df_ords_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order


##### No duplicates

# 8. Export dataframes

In [27]:
# Export df_prods_clean_no_dups
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))
# Export df_ords
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data','orders_checked.csv'))