# Task 4.5 – Data Consistency Checks

This notebook performs consistency checks on the Instacart datasets.

**Objectives:**
- Identify mixed-type columns
- Find and handle missing values
- Detect and remove duplicate records

## Importing libraries

In [1]:
# 📦 Import libraries
import pandas as pd
import numpy as np
import os

## Load Data
I'll work with:
- `products.csv` (Original Data)
- `orders_wrangled.csv` (Prepared Data)

In [2]:
path = r'/Users/canancengel/A4_Instacart Basket Analysis/02_Data'
df_prods = pd.read_csv(os.path.join(path, 'Original Data', 'products.csv'))
df_ords = pd.read_csv(os.path.join(path, 'Prepared Data', 'orders_wrangled.csv'))

In [3]:
df_prods

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


## Mixed-Type Columns Check
Detecting if any column has inconsistent data types using a loop.

In [4]:
# Test for mixed-type columns in df_prods
for col in df_prods.columns:
    types = df_prods[col].map(type)
    if len(types.unique()) > 1:
        print(f"Mixed types found in column: {col}")

Mixed types found in column: product_name


In [5]:
# Descriptive statistics for df_ords
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


### Interpretation of df_ords.describe()
The `order_number`, `order_hour_of_day`, and `days_since_prior_order` columns appear consistent.  
- `order_hour_of_day` ranges from 0 to 23, which matches the 24-hour clock.  
- `days_since_prior_order` has a max of 30.0, indicating no unusually large gaps between orders.  
- The presence of `NaN` values in `days_since_prior_order` is expected for new customers.

In [6]:
# Same test for df_ords
for col in df_ords.columns:
    types = df_ords[col].map(type)
    if len(types.unique()) > 1:
        print(f"Mixed types found in column: {col}")

There are no mixed-type columns detected in the `df_ords` dataframe. All columns are consistent in data type.

## Missing Values Check
I'll identify missing values in both dataframes.

In [7]:
import pandas as pd

In [8]:
df_prods = pd.read_csv('/Users/canancengel/A4_Instacart Basket Analysis/02_Data/Original Data/products.csv')

In [9]:
df_prods_clean = df_prods[df_prods['product_name'].notnull() & (df_prods['product_name'].str.strip() != "")]
df_prods_clean = df_prods_clean.drop_duplicates()
print(df_prods_clean.shape)

(49672, 5)


In [10]:
# Descriptive statistics for df_prods
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


In [11]:
df_ords = pd.read_csv('/Users/canancengel/A4_Instacart Basket Analysis/02_Data/Prepared Data/orders_wrangled.csv')

In [12]:
df_ords_clean = df_ords.drop_duplicates()

In [13]:
df_ords_clean['new_customer'] = df_ords_clean['days_since_prior_order'].isnull()

In [14]:
# Show number of missing values
df_ords_clean['days_since_prior_order'].isnull().sum()

206209

In [15]:
print(df_ords_clean.shape)

(3421083, 7)


In [16]:
# Number of missing values in days_since_prior_order
df_ords_clean['days_since_prior_order'].isnull().sum()

206209

The number of missing values in the `days_since_prior_order` column indicates the number of new customers. 
These missing values have not been removed from the dataset, as they provide important information.

In [17]:
# Descriptive statistics for df_ords
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


After identifying and removing missing values from the products dataframe and addressing missing values in the orders dataframe, I created clean versions of each dataset.

- The cleaned products dataframe contains 49,672 rows and 5 columns (rows with missing product names were removed).

- The cleaned orders dataframe contains 3,214,874 rows and 7 columns. Rows with missing days_since_prior_order were not removed as these represent new customers; instead, a new column (new_customer) was created to flag these orders.

These cleaned dataframes will be used for all subsequent analysis steps to ensure data consistency and accuracy.

## Handling Missing Product Names
I'll create a subset to view rows with missing product names.

In [18]:
# Subset for missing product names
df_nan = df_prods[df_prods['product_name'].isnull()]
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


## Duplicates Check
I'll check for duplicate rows in both dataframes.

In [19]:
# Duplicates in df_prods
df_prods.duplicated().sum()

5

In [20]:
# Duplicates in df_ords
df_ords.duplicated().sum()

0

## Drop Duplicates

In [21]:
# Drop duplicates
df_prods = df_prods.drop_duplicates()
df_ords = df_ords.drop_duplicates()

## Final Checks Complete
Dataframes are now checked for consistency:
- Mixed types handled
- Missing values located
- 5 Duplicates removed from df_prods

## Export

## Exporting Cleaned Data

The cleaned products and orders dataframes have been exported as CSV files into the "Prepared Data" folder.  
These files are named `products_cleaned.csv` and `orders_cleaned.csv`.

In [23]:
df_prods_clean.to_csv('/Users/canancengel/A4_Instacart Basket Analysis/02_Data/Prepared Data/products_cleaned.csv', index=False)
df_ords_clean.to_csv('/Users/canancengel/A4_Instacart Basket Analysis/02_Data/Prepared Data/orders_cleaned.csv', index=False)