# Data Consistency Task
This notebook focuses on performing data consistency checks on the `df_ords` dataframe. I'll analyze the data, clean it, and document the findings as part of the Instacart Basket Analysis project.

## Step 1: Import Libraries and Dataset
First, I'll import the necessary libraries and the `df_ords` dataset to start analyzing the data.

In [3]:
import pandas as pd
import numpy as np

# Load the orders dataset
df_ords = pd.read_csv('../02 Data/Prepared Data/cleaned_orders.csv')

# Preview the dataset
print("Orders Data:")
print(df_ords.head())

Orders Data:
   order_id  user_id eval_set  order_number  order_dow  order_hour_of_day  \
0   2539329        1    prior             1          2                  8   
1   2398795        1    prior             2          3                  7   
2    473747        1    prior             3          3                 12   
3   2254736        1    prior             4          4                  7   
4    431534        1    prior             5          4                 15   

   days_since_prior_order  
0                     NaN  
1                    15.0  
2                    21.0  
3                    29.0  
4                    28.0  


## Step 2: Descriptive Statistics
I'll start by running descriptive statistics on the `df_ords` dataset. This will help identify any unusual values or patterns that need to be addressed.

In [8]:
# Descriptive statistics for df_ords
print("Descriptive Statistics for Orders Data:")
print(df_ords.describe())

Descriptive Statistics for Orders Data:
           order_id       user_id  order_number     order_dow  \
count  3.421083e+06  3.421083e+06  3.421083e+06  3.421083e+06   
mean   1.710542e+06  1.029782e+05  1.715486e+01  2.776219e+00   
std    9.875817e+05  5.953372e+04  1.773316e+01  2.046829e+00   
min    1.000000e+00  1.000000e+00  1.000000e+00  0.000000e+00   
25%    8.552715e+05  5.139400e+04  5.000000e+00  1.000000e+00   
50%    1.710542e+06  1.026890e+05  1.100000e+01  3.000000e+00   
75%    2.565812e+06  1.543850e+05  2.300000e+01  5.000000e+00   
max    3.421083e+06  2.062090e+05  1.000000e+02  6.000000e+00   

       order_hour_of_day  days_since_prior_order  
count       3.421083e+06            3.214874e+06  
mean        1.345202e+01            1.111484e+01  
std         4.226088e+00            9.206737e+00  
min         0.000000e+00            0.000000e+00  
25%         1.000000e+01            4.000000e+00  
50%         1.300000e+01            7.000000e+00  
75%         1.600

### Answer:

- The `order_hour_of_day` column looks good—it ranges from **0 to 23,** which makes sense since those are the hours in a day.
- For `days_since_prior_order,` the range is **0 to 30,** which seems normal. I’m guessing the 0s represent first-time orders. No weird negative values here, so this column is fine too.

## Step 3: Check for Mixed-Type Columns

In [19]:
# Check for mixed data types in the df_ords dataframe
for col in df_ords.columns:
    mixed_types = (df_ords[col].apply(type).nunique() > 1)
    if mixed_types:
        print(f"Column {col} contains mixed data types.")

### Answer:

I ran a check for mixed data types, and nothing printed, which means all the columns in `df_ords` have consistent data types. No issues here everything checks out!

## Step 4: Check for Missing Values

In [28]:
# Check for missing values in df_ords
print("Missing Values in Orders Data:")
print(df_ords.isnull().sum())

Missing Values in Orders Data:
order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64


### Code Create First-Time Order Flag:

In [31]:
# Create a flag for first-time orders
df_ords['first_time_order'] = df_ords['days_since_prior_order'].isnull()

# Verify the new column
print(df_ords[['days_since_prior_order', 'first_time_order']].head(10))

   days_since_prior_order  first_time_order
0                     NaN              True
1                    15.0             False
2                    21.0             False
3                    29.0             False
4                    28.0             False
5                    19.0             False
6                    20.0             False
7                    14.0             False
8                     0.0             False
9                    30.0             False


### Answer:

I found that the `days_since_prior_order` column has **206,209** missing values, which I’m assuming represent first-time orders. To keep this information, I created a new column called `first_time_order`. This column uses a `True` flag for first-time orders (where `days_since_prior_order` is missing) and `False` for repeat orders. Now, I can easily distinguish between first-time and repeat customers.

## Step 5: Check for Duplicates

In [40]:
# Check for duplicate rows in df_ords
duplicates = df_ords.duplicated().sum()
print(f"Number of duplicate rows in df_ords: {duplicates}")

Number of duplicate rows in df_ords: 0


## Answer:

I checked the `df_ords` dataframe for duplicate rows, and the result came back with **0 duplicates**. This means there’s no need to remove or investigate duplicate rows. Everything looks good here!

## Step 6: Save Cleaned Data

In [46]:
# Save the cleaned orders dataframe
df_ords.to_csv('../02 Data/Prepared Data/cleaned_orders_with_flags.csv', index=False)

print("Cleaned data has been saved successfully!")

Cleaned data has been saved successfully!


## Summary of Findings

For this task, I ran data consistency checks on the df_ords dataframe as part of the Instacart Basket Analysis project. Here’s what I found and how I handled it:

1. **Descriptive Statistics:**
I ran df.describe() to get a better understanding of the data. The order_hour_of_day column ranges from 0 to 23, which makes sense since those are the hours in a day. The days_since_prior_order column ranges from 0 to 30, which also seems normal. There were no negative or unusual values in these columns.

2. **Mixed-Type Data Check:**
I checked for mixed data types, and everything came back clean. All the columns in df_ords have consistent data types, so there were no issues here.

3. **Missing Values:**
I found 206,209 missing values in the days_since_prior_order column. I’m assuming these are first-time orders since there wouldn’t be a prior order for these customers. To keep this information, I added a new column called first_time_order, which flags these rows as True for first-time orders and False for repeat orders.

4. **Duplicate Values:**
I checked for duplicate rows, and there were none. Everything looks good here!

5. **Exported Cleaned Data:**
I saved the cleaned dataframe as cleaned_orders_with_flags.csv in the Prepared Data folder. This version of the data now includes the first_time_order flag and is ready for the next steps.