# 4.7 Deriving New Variables

## Table of Contents
1. Introduction
2. Import Libraries
3. Load Data
4. Derive New Variables
    - 4.1 Create the "price_label" Column
    - 4.2 Create the "busiest_day" Column
    - 4.3 Update "busiest_day" to Include Two Busiest and Least Busy Days
    - 4.4 Add the 'order_hour_of_day' Column
    - 4.5 Create the "busiest_period_of_day" Column
5. Consistency Checks
6. Verify Data and Export
7. Conclusion

## 1. Introduction
In this notebook, I focused on creating new variables for the Instacart dataset to get more insights into customer behavior. These new variables are great for spotting trends, breaking down data into useful segments, and making future analysis smoother and faster.

### Goals:
- Categorize products based on their price range.
- Identify the busiest and least busy days for ordering.
- Break the day into segments to show which hours are busiest, least busy, or somewhere in between.

## 2. Import Libraries
To get started, I imported the libraries I needed to work with the data. These are: 
- pandas: For working with the dataset and manipulating data. 
- numpy: For anything that requires number-crunching or calculations.

In [19]:
# Importing the libraries I need
import pandas as pd
import numpy as np

In [23]:
import pandas as pd

# Load the dataset using a raw string (r"")
merged_orders_products = pd.read_csv(r"D:\YVC\Data Analytics (CF)\Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Prepared Data\merged_orders_products.csv")

# Check the first few rows
print(merged_orders_products.head())

   order_id  product_id  add_to_cart_order  reordered           product_name  \
0         2       33120                  1          1     Organic Egg Whites   
1         2       28985                  2          1  Michigan Organic Kale   
2         2        9327                  3          0          Garlic Powder   
3         2       45918                  4          1         Coconut Butter   
4         2       30035                  5          0      Natural Sweetener   

   aisle_id  department_id  prices  
0        86             16    11.3  
1        83              4    13.4  
2       104             13     3.6  
3        19             13     8.4  
4        17             13    13.7  


## 3. Load the Data
Next, I loaded the `merged_orders_products.csv` dataset. This dataset combines orders and product details, which makes it perfect for the kind of analysis I want to do. I also double-checked that it loaded correctly.

In [10]:
# Load the merged orders-products data from the specified file path
merged_orders_products = pd.read_csv(r'D:\YVC\Data Analytics (CF)\Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Prepared Data\merged_orders_products.csv')

# Display the first few rows of the dataframe to confirm successful loading
print(merged_orders_products.head())

   order_id  product_id  add_to_cart_order  reordered           product_name  \
0         2       33120                  1          1     Organic Egg Whites   
1         2       28985                  2          1  Michigan Organic Kale   
2         2        9327                  3          0          Garlic Powder   
3         2       45918                  4          1         Coconut Butter   
4         2       30035                  5          0      Natural Sweetener   

   aisle_id  department_id  prices  
0        86             16    11.3  
1        83              4    13.4  
2       104             13     3.6  
3        19             13     8.4  
4        17             13    13.7  


## 4. Derive New Variables

### 4.1 Create the "price_label" Column

Here, I created a column called `"price_label"` to group products into categories: **Low-range, Mid-range,** or **High-range,** depending on their price. This helps give more context to the data.

In [25]:
# Create a new column "price_label" based on the price range
# - Low-range product: prices <= 5
# - Mid-range product: 5 < prices <= 15
# - High-range product: prices > 15
merged_orders_products['price_label'] = merged_orders_products['prices'].apply(
    lambda x: 'Low-range product' if x <= 5 else ('Mid-range product' if x <= 15 else 'High-range product')
)

# Check the first few rows to ensure the new column is created
print(merged_orders_products[['prices', 'price_label']].head())

   prices        price_label
0    11.3  Mid-range product
1    13.4  Mid-range product
2     3.6  Low-range product
3     8.4  Mid-range product
4    13.7  Mid-range product


### 4.2 Create the "busiest_day" Column
Next, I added a column to classify each day of the week as Busiest day, Least busy, or Regularly busy, based on the `order_dow` values.

In [12]:
# Load the orders.csv file to access order-related details
orders = pd.read_csv(r'D:\YVC\Data Analytics (CF)\Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Original Data\orders.csv')

# Merge the 'order_dow' column from the orders data into the merged_orders_products dataframe
# This is needed to determine the busiest days of the week
merged_orders_products = merged_orders_products.merge(
    orders[['order_id', 'order_dow']],  # Selecting only relevant columns
    on='order_id',  # Merging based on 'order_id'
    how='left'  # Keeping all rows from merged_orders_products
)

# Confirm that the 'order_dow' column has been successfully added
print(merged_orders_products.head())

# Create a new column "busiest_day" to classify the day as:
# - 'Busiest day': order_dow == 0
# - 'Least busy': order_dow == 4
# - 'Regularly busy': all other days
merged_orders_products['busiest_day'] = merged_orders_products['order_dow'].apply(
    lambda x: 'Busiest day' if x == 0 else ('Least busy' if x == 4 else 'Regularly busy')
)

# Check the first few rows to confirm the column has been added
print(merged_orders_products[['order_dow', 'busiest_day']].head())

   order_id  product_id  add_to_cart_order  reordered           product_name  \
0         2       33120                  1          1     Organic Egg Whites   
1         2       28985                  2          1  Michigan Organic Kale   
2         2        9327                  3          0          Garlic Powder   
3         2       45918                  4          1         Coconut Butter   
4         2       30035                  5          0      Natural Sweetener   

   aisle_id  department_id  prices        price_label  order_dow  
0        86             16    11.3  Mid-range product          5  
1        83              4    13.4  Mid-range product          5  
2       104             13     3.6  Low-range product          5  
3        19             13     8.4  Mid-range product          5  
4        17             13    13.7  Mid-range product          5  
   order_dow     busiest_day
0          5  Regularly busy
1          5  Regularly busy
2          5  Regularly busy
3

### 4.3 Update "busiest_day" to Include Two Busiest and Least Busy Days
To get a clearer picture, I expanded the `"busiest_day"` column to show **two busiest** and **two least busy days,** instead of just one.

In [13]:
# "busiest_day"column to include:
# - **Busiest days**: Days 0 and 1
# - **Least busy days**: Days 4 and 5
# - **Regularly busy**: All other days

# Update "busiest_day" to include two busiest and two least busy days
merged_orders_products['busiest_days'] = merged_orders_products['order_dow'].apply(
    lambda x: 'Busiest days' if x in [0, 1] else ('Least busy days' if x in [4, 5] else 'Regularly busy')
)

# Check the first few rows to confirm the updated column
print(merged_orders_products[['order_dow', 'busiest_days']].head())

   order_dow     busiest_days
0          5  Least busy days
1          5  Least busy days
2          5  Least busy days
3          5  Least busy days
4          5  Least busy days


### 4.4 Add the 'order_hour_of_day' Column
Here, I merged the `order_hour_of_day` column from the orders dataset so I could analyze the busiest hours of the day.

In [14]:
# Add the 'order_hour_of_day' column from orders.csv
merged_orders_products = merged_orders_products.merge(
    orders[['order_id', 'order_hour_of_day']],
    on='order_id',
    how='left'
)

# Check if the column is now included
print(merged_orders_products.head())

   order_id  product_id  add_to_cart_order  reordered           product_name  \
0         2       33120                  1          1     Organic Egg Whites   
1         2       28985                  2          1  Michigan Organic Kale   
2         2        9327                  3          0          Garlic Powder   
3         2       45918                  4          1         Coconut Butter   
4         2       30035                  5          0      Natural Sweetener   

   aisle_id  department_id  prices        price_label  order_dow  \
0        86             16    11.3  Mid-range product          5   
1        83              4    13.4  Mid-range product          5   
2       104             13     3.6  Low-range product          5   
3        19             13     8.4  Mid-range product          5   
4        17             13    13.7  Mid-range product          5   

      busiest_day     busiest_days  order_hour_of_day  
0  Regularly busy  Least busy days                  9 

### 4.5 Create the "busiest_period_of_day" Column
Finally, I used the `order_hour_of_day` column to divide the day into three time categories: **Most orders, Fewest orders,** and **Average orders.**

In [20]:
#Categorize the time of day into:
# - **Most orders**: 10 AM to 4 PM
# - **Fewest orders**: Before 8 AM or after 8 PM
# - **Average orders**: All other times

# Create the "busiest_period_of_day" column based on order_hour_of_day
merged_orders_products['busiest_period_of_day'] = merged_orders_products['order_hour_of_day'].apply(
    lambda x: 'Most orders' if 10 <= x <= 16 else ('Fewest orders' if x < 8 or x > 20 else 'Average orders')
)

# Check the first few rows to confirm the new column is created
print(merged_orders_products[['order_hour_of_day', 'busiest_period_of_day']].head())

   order_hour_of_day busiest_period_of_day
0                  9        Average orders
1                  9        Average orders
2                  9        Average orders
3                  9        Average orders
4                  9        Average orders


In [103]:
# Create the "busiest_period_of_day" column based on 'order_hour_of_day_y'
merged_orders_products['busiest_period_of_day'] = merged_orders_products['order_hour_of_day_y'].apply(
    lambda x: 'Most orders' if 10 <= x <= 16 else ('Fewest orders' if x < 8 or x > 20 else 'Average orders')
)

# Check if the new column has been created successfully
print(merged_orders_products[['order_hour_of_day_y', 'busiest_period_of_day']].head())

   order_hour_of_day_y busiest_period_of_day
0                    9        Average orders
1                    9        Average orders
2                    9        Average orders
3                    9        Average orders
4                    9        Average orders


## 5. Consistency Checks
Before exporting the data, it’s super important to make sure everything checks out. I’m running some quick consistency checks to catch any missing values and confirm the data types for each column. This step helps ensure the data is clean and ready for analysis, avoiding any surprises later on.

In [132]:
# Check for missing values in the dataframe
print(merged_orders_products.isnull().sum())

order_id                     0
product_id                   0
add_to_cart_order            0
reordered                    0
product_name             28171
aisle_id                     0
department_id                0
prices                       0
price_label                  0
order_dow                    0
busiest_day                  0
order_hour_of_day_x          0
order_hour_of_day_y          0
order_hour_of_day            0
busiest_period_of_day        0
dtype: int64


In [136]:
# Check the data types of all columns
print(merged_orders_products.dtypes)

order_id                   int64
product_id                 int64
add_to_cart_order          int64
reordered                  int64
product_name              object
aisle_id                   int64
department_id              int64
prices                   float64
price_label               object
order_dow                  int64
busiest_day               object
order_hour_of_day_x        int64
order_hour_of_day_y        int64
order_hour_of_day          int64
busiest_period_of_day     object
dtype: object


##### Output:
This will highlight if there are any missing values and confirm the data types are as expected. It’s a quick way to make sure the dataset is good to go!

## 6. Verify Data and Export
Before exporting the data, I verified that the new columns looked correct. Then, I saved the updated dataset as a pickle file for efficient storage and future use.

In [212]:
# Verify the new column values
print(merged_orders_products['busiest_period_of_day'].value_counts())

# Export the dataframe as a pickle file
merged_orders_products.to_pickle(r'D:\YVC\Data Analytics (CF)\Python Fundamentals for Data Analysts\Instacart Basket Analysis\02 Data\Prepared Data\merged_orders_products_updated.pkl')
print("Data exported successfully.")

busiest_period_of_day
Most orders       18680601
Average orders    10140386
Fewest orders      3613225
Name: count, dtype: int64
Data exported successfully.


## 7. Conclusion
In this notebook, I created several new variables to make the Instacart data more insightful. These new variables help with segmenting data based on price, busy days, and periods of the day. With this cleaner, more detailed dataset, I’m ready to dive deeper into customer behavior and improve future analyses.