#### Table of Contents
1. Data Consistency Checks
2. Step 1: Import Libraries
3. Step 2: Import Datasets
4. Step 3: Descriptive Statistics
5. Step 4: Check for Mixed-Type Columns
6. Step 5: Check for Missing Values
7. Step 6: Check for Duplicates
8. Step 7: Save Cleaned Data
9. Verify Saved Files

# Data Consistency Checks
This notebook is all about performing data consistency checks for the Instacart Basket Analysis project. I’ll be cleaning the data, checking for missing values, duplicates, and mixed data types to make sure everything is ready for analysis.

## Step 1: Import Libraries
First, I’ll load the libraries I’ll need for working with the data.

In [7]:
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations

## Step 2: Import Datasets
Time to bring in the data I’ll be working with! These include the `products`, `orders`, and a few other files. I’ll load them into pandas dataframes so I can start cleaning them up and checking for any consistency issues.

In [14]:
# Importing the datasets
df_prods = pd.read_csv('../02 Data/Original Data/products.csv')  # Products dataset
df_departments = pd.read_csv('../02 Data/Original Data/departments.csv')  # Departments dataset
df_orders = pd.read_csv('../02 Data/Original Data/orders.csv')  # Orders dataset

# Prepared data
df_cleaned_prods = pd.read_csv('../02 Data/Prepared Data/cleaned_products.csv')  # Cleaned products
df_orders_wrangled = pd.read_csv('../02 Data/Prepared Data/orders_wrangled.csv')  # Wrangled orders

In [16]:
# Preview datasets
print("Products Data:")
print(df_prods.head(), "\n")

print("Departments Data:")
print(df_departments.head(), "\n")

print("Orders Data:")
print(df_orders.head(), "\n")

Products Data:
   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones Classic Favorites Mini Rigatoni Wit...        38   
4           5                          Green Chile Anytime Sauce         5   

   department_id  prices  
0             19     5.8  
1             13     9.3  
2              7     4.5  
3              1    10.5  
4             13     4.3   

Departments Data:
  department_id       1      2       3        4        5              6  \
0    department  frozen  other  bakery  produce  alcohol  international   

           7     8                9  ...            12      13         14  \
0  beverages  pets  dry goods pasta  ...  meat seafood  pantry  breakfast   

             15          16 

## Step 3: Descriptive Statistics
I'll start by running descriptive statistics on the datasets to get an overview of the data and check for anything unusual. This will help identify potential outliers or inconsistencies in the numerical columns.

In [19]:
# Descriptive statistics for orders
print("Descriptive Statistics for Orders Data:")
print(df_orders.describe(), "\n")

# Descriptive statistics for products
print("Descriptive Statistics for Products Data:")
print(df_prods.describe(), "\n")

Descriptive Statistics for Orders Data:
           order_id       user_id  order_number     order_dow  \
count  3.421083e+06  3.421083e+06  3.421083e+06  3.421083e+06   
mean   1.710542e+06  1.029782e+05  1.715486e+01  2.776219e+00   
std    9.875817e+05  5.953372e+04  1.773316e+01  2.046829e+00   
min    1.000000e+00  1.000000e+00  1.000000e+00  0.000000e+00   
25%    8.552715e+05  5.139400e+04  5.000000e+00  1.000000e+00   
50%    1.710542e+06  1.026890e+05  1.100000e+01  3.000000e+00   
75%    2.565812e+06  1.543850e+05  2.300000e+01  5.000000e+00   
max    3.421083e+06  2.062090e+05  1.000000e+02  6.000000e+00   

       order_hour_of_day  days_since_prior_order  
count       3.421083e+06            3.214874e+06  
mean        1.345202e+01            1.111484e+01  
std         4.226088e+00            9.206737e+00  
min         0.000000e+00            0.000000e+00  
25%         1.000000e+01            4.000000e+00  
50%         1.300000e+01            7.000000e+00  
75%         1.600

## Step 4: Check for Mixed-Type Columns
Next, I'll check if any of the columns have mixed data types, as this can cause problems during analysis.

In [22]:
# Checking for mixed-type columns
for col in df_orders.columns:
    mixed_types = (df_orders[col].apply(type).nunique() > 1)
    if mixed_types:
        print(f"Column {col} contains mixed data types.")

## Step 5: Check for Missing Values
Check for missing values in the data to ensure there aren’t any gaps that could cause problems during analysis.

In [25]:
# Checking for missing values
print("Missing Values in Orders Data:")
print(df_orders.isnull().sum(), "\n")

print("Missing Values in Products Data:")
print(df_prods.isnull().sum(), "\n")

Missing Values in Orders Data:
order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64 

Missing Values in Products Data:
product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64 



## Step 6: Check for Duplicates
I'll check if there are any duplicate rows in the data that need to be removed.

In [28]:
# Checking for duplicates
print("Duplicate Rows in Orders Data:")
print(df_orders.duplicated().sum(), "\n")

print("Duplicate Rows in Products Data:")
print(df_prods.duplicated().sum(), "\n")

# Removing duplicates
df_orders = df_orders.drop_duplicates()
df_prods = df_prods.drop_duplicates()

Duplicate Rows in Orders Data:
0 

Duplicate Rows in Products Data:
5 



## Step 7: Save Cleaned Data
Now that the data has been checked and cleaned, I’ll export the cleaned dataframes to my "Prepared Data" folder for further analysis.

In [31]:
# Save cleaned data
df_orders.to_csv('../02 Data/Prepared Data/cleaned_orders.csv', index=False)
df_prods.to_csv('../02 Data/Prepared Data/cleaned_products.csv', index=False)

# Verify saved files 

In [34]:
# Verify saved files (optional)
df_orders_cleaned = pd.read_csv('../02 Data/Prepared Data/cleaned_orders.csv')
df_prods_cleaned = pd.read_csv('../02 Data/Prepared Data/cleaned_products.csv')

print("Orders Data:")
print(df_orders_cleaned.head(), "\n")

print("Products Data:")
print(df_prods_cleaned.head())

Orders Data:
   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2398795        1    prior             2          3                  7   
2    473747        1    prior             3          3                 12   
3   2254736        1    prior             4          4                  7   
4    431534        1    prior             5          4                 15   

   days_since_prior_order  
0                     NaN  
1                    15.0  
2                    21.0  
3                    29.0  
4                    28.0   

Products Data:
   product_id                                       product_name  aisle_id  \
0           1                         Chocolate Sandwich Cookies        61   
1           2                                   All-Seasons Salt       104   
2           3               Robust Golden Unsweetened Oolong Tea        94   
3           4  Smart Ones C