# Exercise 4.5 – Data Consistency Checks
### Purpose
This notebook performs data consistency checks on the Instacart datasets,
including checks for mixed data types, missing values, and duplicate records,
to ensure data quality prior to analysis and visualization.

In [1]:
import pandas as pd
import numpy as np
import os

In [3]:
df_prods = pd.read_csv(
    r"C:\Users\sheid\OneDrive\Documents\Instacart Basket Analysis 01-05-2026\02 Data\Original Data\products.csv"
)

In [4]:
df_ords = pd.read_csv(
    r"C:\Users\sheid\OneDrive\Documents\Instacart Basket Analysis 01-05-2026\02 Data\Prepared Data\orders_wrangled.csv"
)

In [5]:
df_prods.shape

(49693, 5)

In [6]:
df_ords.shape

(3421083, 6)

In [7]:
# Create an empty test dataframe
df_test = pd.DataFrame()

In [8]:
# Create a mixed-type column
df_test['mix'] = ['a', 'b', 1, True]

In [9]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [10]:
# Check for mixed data types in dataframe columns
for col in df_test.columns.tolist():
    weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis=1)
    if len(df_test[weird]) > 0:
        print(col)

mix


  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis=1)


In [11]:
# Fix mixed-type column by converting all values to string
df_test['mix'] = df_test['mix'].astype('str')

In [12]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [13]:
# Re-check for mixed data types
for col in df_test.columns.tolist():
    weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis=1)
    if len(df_test[weird]) > 0:
        print(col)

  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis=1)


In [14]:
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [15]:
df_nan = df_prods[df_prods['product_name'].isnull() == True]
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [16]:
df_prods.shape

(49693, 5)

In [17]:
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [18]:
df_prods_clean.shape

(49677, 5)

In [19]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [20]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [21]:
df_dups.shape

(5, 5)

In [22]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [23]:
df_prods_clean_no_dups.shape

(49672, 5)

In [24]:
prepared_data_path = r"C:\Users\sheid\OneDrive\Documents\Instacart Basket Analysis 01-05-2026\02 Data\Prepared Data"

In [25]:
df_prods_clean_no_dups.to_csv(
    os.path.join(prepared_data_path, "products_checked.csv"),
    index=False
)

In [26]:
os.path.exists(os.path.join(prepared_data_path, "products_checked.csv"))

True

In [27]:
df_prods_checked = pd.read_csv(os.path.join(prepared_data_path, "products_checked.csv"))
df_prods_checked.shape

(49672, 5)

In [29]:
# Step 1: Load cleaned products dataset (exported after missing values + duplicates removal)
prepared_data_path = r"C:\Users\sheid\OneDrive\Documents\Instacart Basket Analysis 01-05-2026\02 Data\Prepared Data"

df_prods_checked = pd.read_csv(os.path.join(prepared_data_path, "products_checked.csv"))

# Quick check: rows/columns should match what we confirmed (49,672 rows, 5 columns)
df_prods_checked.shape

(49672, 5)

# Step 1: Load cleaned products dataset (exported after missing values + duplicates removal)
prepared_data_path = r"C:\Users\sheid\OneDrive\Documents\Instacart Basket Analysis 01-05-2026\02 Data\Prepared Data"

df_prods_checked = pd.read_csv(os.path.join(prepared_data_path, "products_checked.csv"))

# Quick check: rows/columns should match what we confirmed (49,672 rows, 5 columns)
df_prods_checked.shape

In [30]:
# Step 2: Descriptive statistics for orders dataset
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


### Step 2: Descriptive Statistics – Orders Dataset

Descriptive statistics were reviewed for the orders dataset using `df_ords.describe()`.  
All numeric columns showed expected ranges and distributions. Identifier variables (`order_id`, `user_id`) were within valid bounds. Temporal variables (`orders_day_of_week` ranged from 0–6 and `order_hour_of_day` ranged from 0–23) confirmed proper encoding.  

Missing values were observed only in the `days_since_prior_order` column, which is expected for first-time customer orders. No anomalies or unexpected values were identified that require further investigation.


## Step 3: Mixed-Type Data Check – Orders Dataset

In [32]:
# Step 3: Check for mixed-type data in df_ords

for col in df_ords.columns.tolist():
    weird = (
        df_ords[[col]]
        .applymap(type) != df_ords[[col]].iloc[0].apply(type)
    ).any(axis=1)
    
    if len(df_ords[weird]) > 0:
        print(col)

  .applymap(type) != df_ords[[col]].iloc[0].apply(type)
  .applymap(type) != df_ords[[col]].iloc[0].apply(type)
  .applymap(type) != df_ords[[col]].iloc[0].apply(type)
  .applymap(type) != df_ords[[col]].iloc[0].apply(type)
  .applymap(type) != df_ords[[col]].iloc[0].apply(type)
  .applymap(type) != df_ords[[col]].iloc[0].apply(type)


## Step 3: Mixed-Type Data Check – Orders Dataset

A mixed-type data check was performed on the orders dataset using a column-wise data type comparison.  
No mixed-type columns were identified, indicating that all variables contain consistent data types and do not require correction at this stage.

## Step 4: Mixed-Type Data Check — Orders Dataset
A mixed-type data check was performed on the orders dataset using a column-wise data type comparison.

No mixed-type columns were identified, indicating that all variables contain consistent data types. As a result, no data type corrections were required at this stage.

## Step 5: Missing Values Check — Orders Dataset

In [34]:
# Step 5: Check for missing values in orders dataset
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

## Step 5: Missing Values Check — Orders Dataset

A missing values check was performed on the orders dataset using `df_ords.isnull().sum()`.

Missing values were identified **only** in the `days_since_prior_order` column (206,209 rows). This is an expected characteristic of the Instacart dataset, as this field is undefined for customers placing their **first-ever order**, since there is no prior order to calculate a time difference from.

No missing values were observed in any other columns (`order_id`, `user_id`, `order_number`, `orders_day_of_week`, `order_hour_of_day`), indicating overall data completeness for key order identifiers and temporal variables.


In [35]:
# Step 6: Address missing values in orders dataset
# Fill missing days_since_prior_order with 0 to represent first-time orders

df_ords['days_since_prior_order'] = df_ords['days_since_prior_order'].fillna(0)

# Verify missing values have been addressed
df_ords.isnull().sum()

order_id                  0
user_id                   0
order_number              0
orders_day_of_week        0
order_hour_of_day         0
days_since_prior_order    0
dtype: int64

## Step 6: Missing Values Treatment — Orders Dataset

Missing values in the `days_since_prior_order` column were addressed by replacing null values with `0`.

This approach was selected because missing values in this column correspond to **first-time customer orders**, where no prior order exists. Assigning a value of `0` accurately represents the absence of a previous order while preserving all customer records in the dataset.

This method avoids unnecessary row removal and ensures that customer behavior analyses remain complete and unbiased.


In [36]:
# Step 7: Check for duplicate rows in orders dataset
df_ords_dups = df_ords[df_ords.duplicated()]

# View duplicates (if any)
df_ords_dups

# Check how many duplicate rows exist
df_ords_dups.shape

(0, 6)

## Step 7: Duplicate Values Check — Orders Dataset

A duplicate check was performed on the orders dataset using the `duplicated()` method to identify fully duplicated rows across all columns.

No duplicate rows were identified in the orders dataset (`0` duplicates found). This indicates that each order record is unique and that the dataset does not contain redundant order entries.

The absence of duplicate rows is expected, as each `order_id` represents a distinct transaction. No further action was required at this stage.


## Step 8: Duplicate Values Treatment — Orders Dataset

No duplicate rows were identified in the orders dataset during the duplicate check performed in Step 7.

As a result, no duplicate removal or data manipulation was necessary. The dataset was left unchanged to preserve all valid order records.

This approach ensures data integrity is maintained while confirming that the orders dataset does not contain redundant entries.


In [37]:
# Step 9: Export final cleaned orders dataset

prepared_data_path = r"C:\Users\sheid\OneDrive\Documents\Instacart Basket Analysis 01-05-2026\02 Data\Prepared Data"

df_ords.to_csv(
    os.path.join(prepared_data_path, "orders_checked.csv"),
    index=False
)

In [38]:
# Verify orders file exists
os.path.exists(os.path.join(prepared_data_path, "orders_checked.csv"))

True

In [39]:
df_ords_checked = pd.read_csv(
    os.path.join(prepared_data_path, "orders_checked.csv")
)

df_ords_checked.shape

(3421083, 6)