## Task 4.8 - Grouping Data & Aggregating Variables

### Step 1: Import Libraries and Define Path

In this step, I import the required Python libraries and set the path to access our cleaned and merged dataset from previous tasks.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Set path to your project directory
path = r'/Users/canancengel/A4_Instacart Basket Analysis/02_Data/Prepared Data'

### Step 2: Load the Prepared DataFrame

I load the `ords_prods_merged.pkl` file that contains our prepared Instacart data from Task 4.7.

In [3]:
df = pd.read_pickle(os.path.join(path, 'ords_prods_merge.pkl'))

In [4]:
# Create a subset of the first 1 million rows
df_subset = df[:1000000]

In [5]:
# Check shape and preview
print(df_subset.shape)
df_subset.head()

(1000000, 15)


Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,True,196,1,0,both,Soda,77,7,9.0
1,2539329,1,1,2,8,,True,14084,2,0,both,Organic Unsweetened Vanilla Almond Milk,91,16,12.5
2,2539329,1,1,2,8,,True,12427,3,0,both,Original Beef Jerky,23,19,4.4
3,2539329,1,1,2,8,,True,26088,4,0,both,Aged White Cheddar Popcorn,23,19,4.7
4,2539329,1,1,2,8,,True,26405,5,0,both,XL Pick-A-Size Paper Towel Rolls,54,17,1.0


### Step 3: Calculate Mean Order Number by Department (Entire DataFrame)

This step calculates the average order number for each department to analyze general order frequency patterns.

In [6]:
# Group by department and calculate mean order number
mean_orders_full = df.groupby('department_id')['order_number'].mean()

In [7]:
# Display the results
mean_orders_full

department_id
1     15.457838
2     17.277920
3     17.170395
4     17.811403
5     15.215751
6     16.439806
7     17.225802
8     15.340650
9     15.895474
10    20.197148
11    16.170638
12    15.887671
13    16.583536
14    16.773669
15    16.165037
16    17.665606
17    15.694469
18    19.310397
19    17.177343
20    16.473447
21    22.902379
Name: order_number, dtype: float64

Analysis:
The table shows the average number of orders per department based on the full dataset. When compared to the subset (first 1 million rows), we can observe that the full dataset yields slightly higher averages in many departments. This is likely due to the fact that the full dataset includes more recent and complete user behavior data, which helps capture long-term shopping patterns. Therefore, while the subset can offer a quick preview, the entire dataset provides a more accurate and reliable view for business decision-making.

In [8]:
# Subset mean order number per department
mean_orders_subset = df_subset.groupby('department_id')['order_number'].mean()

In [9]:
# Display the subset means
mean_orders_subset

department_id
1     14.800024
2     17.091743
3     17.913544
4     17.893092
5     15.214270
6     15.382135
7     17.694027
8     16.458105
9     15.957363
10    20.091818
11    16.482026
12    15.615061
13    16.484023
14    17.524632
15    15.691875
16    18.014071
17    16.150593
18    19.602850
19    17.631340
20    17.138607
21    21.956893
Name: order_number, dtype: float64

Comparison: While both datasets show similar patterns, slight variations may exist due to the limited number of entries in the subset. The full dataset provides a more complete view.

## Step 4: Create a Max Order Column per User

This step creates a new column `max_order` that shows the total number of orders each user has made.

In [10]:
df['max_order'] = df.groupby('user_id')['order_number'].transform('max')

In [11]:
df.loc[df['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [12]:
df.loc[(df['max_order'] <= 40) & (df['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

## Step 5: Create a Loyalty Flag

Users are segmented into loyalty categories (Loyal, Regular, New) based on the number of orders placed.

In [13]:
df.loc[df['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

## Step 6: Display Distribution of Loyalty Flag

This section displays the frequency of each loyalty category for further interpretation.

In [14]:
print(df['loyalty_flag'].value_counts(dropna=False))

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64


## Step 7: Calculate Average Days Between Orders per User

I calculate the mean number of days between prior orders for each user using the `days_since_prior_order` column.

In [15]:
# Calculate average days between orders per user
df['order_frequency'] = df.groupby('user_id')['days_since_prior_order'].transform('mean')

In [16]:
df['avg_price'] = df.groupby('user_id')['prices'].transform('mean')

df.loc[df['avg_price'] >= 10, 'income_flag'] = 'High spender'
df.loc[df['avg_price'] < 10, 'income_flag'] = 'Low spender'

## Step 8: Create an Income Flag

Based on average spending, users are labeled as "High Spender" or "Low Spender".

In [17]:
print(df['income_flag'].value_counts(dropna=False))

income_flag
Low spender     31770614
High spender      634245
Name: count, dtype: int64


## Step 9: Create Order Frequency Flag Using Median Days

Using the median of days since prior orders, I segment users into frequency categories.

In [18]:
# Create order frequency flag based on median of days_since_prior_order
df['median_days'] = df.groupby('user_id')['days_since_prior_order'].transform('median')

df.loc[df['median_days'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'
df.loc[(df['median_days'] > 10) & (df['median_days'] <= 20), 'order_frequency_flag'] = 'Regular customer'
df.loc[df['median_days'] <= 10, 'order_frequency_flag'] = 'Frequent customer'

## Step 10: Display Distribution of Order Frequency Flag

The output shows how users are distributed across frequency categories: Frequent, Regular, and Non-frequent.

In [19]:
print(df['order_frequency_flag'].value_counts(dropna=False))

order_frequency_flag
Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
NaN                             5
Name: count, dtype: int64


In [20]:
# Display distribution of order frequency flag
print(df['order_frequency_flag'].value_counts(dropna=False))

order_frequency_flag
Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
NaN                             5
Name: count, dtype: int64


Interpretation: The majority of users are frequent customers, suggesting high platform engagement and regular ordering behavior.

## Step 11: Create a Customer-Level Aggregated DataFrame
This step creates a new dataframe with one row per user and aggregated metrics for deeper analysis.

In [21]:
# Create a customer-level dataframe with one row per user
customer_df = df.groupby('user_id').agg({
    'order_number': 'max',
    'order_frequency': 'mean'
}).reset_index()

In [22]:
# Rename columns for clarity
customer_df.rename(columns={
    'order_number': 'total_orders',
    'order_frequency': 'avg_days_between_orders'
}, inplace=True)

In [23]:
# Check the head of the new customer-level dataframe
customer_df.head()

Unnamed: 0,user_id,total_orders,avg_days_between_orders
0,1,10,20.259259
1,2,14,15.967033
2,3,12,11.487179
3,4,5,15.357143
4,5,4,14.5


## Step 12: Export Final DataFrame

I save the final, enriched dataframe to a pickle file for future use or further analysis.

In [24]:
# Export the customer-level dataframe as a pickle file
customer_df.to_pickle(os.path.join(path, 'customer_lifetime_value.pkl'))

In [25]:
df.to_pickle(os.path.join(path, 'df_grouped.pkl'))

## Summary:
- Most users fall into the "Frequent" and "Regular" order frequency segments, showing active platform usage.
- Spending behavior indicates more users are "Low Spenders", possibly indicating price sensitivity.
- A strong portion of users are classified as "Loyal", which is a positive signal for customer retention.

These insights can be used to inform marketing strategies and personalization efforts.