# 4.8 IC Grouping Data & Aggregating Variables Task

## Contents

### 01. Importing Libraries

### 02. Split Data into Groups and Aggregate

### 03. Analyse Result

### 04. Create a Loyalty Flag

### 05. Check Statistics

### 06. Create a Spending Flag

### 07. Create a Frequency Flag

### 08. Export Dataframe

#### 01. Importing Libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Importing df_ords_prods_merge
path = r'/Users/yaseminmustafa/Desktop/CareerFoundry/Exercise 4/15-05-2025_Instacart Basket Analysis'

In [3]:
ords_prods_merge = pd.read_pickle(os.path.join(path,"02_Data/Prepared Data/ords_prods_merge_aggregated.pkl"))

In [4]:
# Check output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2,202279.0,prior,3.0,5.0,9.0,8.0,33120.0,1.0,1.0,...,Organic Egg Whites,86,16,11.3,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
1,2,202279.0,prior,3.0,5.0,9.0,8.0,28985.0,2.0,1.0,...,Michigan Organic Kale,83,4,13.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
2,2,202279.0,prior,3.0,5.0,9.0,8.0,9327.0,3.0,0.0,...,Garlic Powder,104,13,3.6,Low-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
3,2,202279.0,prior,3.0,5.0,9.0,8.0,45918.0,4.0,1.0,...,Coconut Butter,19,13,8.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
4,2,202279.0,prior,3.0,5.0,9.0,8.0,30035.0,5.0,0.0,...,Natural Sweetener,17,13,13.7,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer


In [5]:
# Check shape
ords_prods_merge.shape

(32404859, 21)

#### 02. Split Data into Groups and Aggregate

In [6]:
# Split the data into groups based on department_id and apply agg()
ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,16.559358
2,18.413176
3,18.2796
4,18.91589
5,16.497751
6,17.60939
7,18.303975
8,16.383301
9,17.022963
10,21.227447


#### 03. Analyse Result

##### Most departments in the subset have slightly higher mean order_number values than in the full dataset. The subset likely captures more recent or sequential order activity, which may be skewing averages upwards, especially if users who order frequently are overrepresented early in the dataset.

#### 04. Create a Loyalty Flag

##### The loyalty flag has already been created using the `transform()` and `loc()` functions in the previous steps and was included in the exported DataFrame. However, I have repeted the steps again.

In [7]:
# Split the data into groups based on the “user_id” column.
# Apply the transform() function on the “order_number” column to generate the maximum orders for each user.
# Create a new column, “max_order,” into which you’ll place the results of your aggregation.
ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

  ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)


In [8]:
# Check output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2,202279.0,prior,3.0,5.0,9.0,8.0,33120.0,1.0,1.0,...,Organic Egg Whites,86,16,11.3,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
1,2,202279.0,prior,3.0,5.0,9.0,8.0,28985.0,2.0,1.0,...,Michigan Organic Kale,83,4,13.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
2,2,202279.0,prior,3.0,5.0,9.0,8.0,9327.0,3.0,0.0,...,Garlic Powder,104,13,3.6,Low-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
3,2,202279.0,prior,3.0,5.0,9.0,8.0,45918.0,4.0,1.0,...,Coconut Butter,19,13,8.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer
4,2,202279.0,prior,3.0,5.0,9.0,8.0,30035.0,5.0,0.0,...,Natural Sweetener,17,13,13.7,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer


In [9]:
# Check shape 
ords_prods_merge.shape

(32404859, 21)

In [10]:
# If statements with the loc () function
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [11]:
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [12]:
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [13]:
# Print frequency 
ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15081691
Loyal customer      10095381
New customer         5151691
NaN                  2076096
Name: count, dtype: int64

In [14]:
# Check shape
ords_prods_merge.shape

(32404859, 21)

In [15]:
# Identify specific columns
ords_prods_merge[['user_id', 'loyalty_flag', 'order_number']].head(15)

Unnamed: 0,user_id,loyalty_flag,order_number
0,202279.0,New customer,3.0
1,202279.0,New customer,3.0
2,202279.0,New customer,3.0
3,202279.0,New customer,3.0
4,202279.0,New customer,3.0
5,202279.0,New customer,3.0
6,202279.0,New customer,3.0
7,202279.0,New customer,3.0
8,202279.0,New customer,3.0
9,205970.0,Regular customer,16.0


#### 05. Check Statistics

In [16]:
# Group by loyalty_flag and describe prices
ords_prods_merge.groupby('loyalty_flag')['prices'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
loyalty_flag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Loyal customer,10094068.0,7.774068,4.206601,1.0,4.2,7.4,11.2,25.0
New customer,5150733.0,7.802315,4.267897,1.0,4.2,7.4,11.3,25.0
Regular customer,15079186.0,7.799374,4.255452,1.0,4.2,7.4,11.3,25.0


#### 06. Create a Spending Flag

In [17]:
# Split the data into groups based on the “user_id” column.
# Apply the transform() function on the “prices” column to generate the average price per user. 
# Create a new column, “avg_spending,” into which you’ll place the results of your aggregation.
ords_prods_merge['avg_spending'] = ords_prods_merge.groupby('user_id')['prices'].transform('mean')

In [18]:
# Check output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_spending
0,2,202279.0,prior,3.0,5.0,9.0,8.0,33120.0,1.0,1.0,...,86,16,11.3,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875
1,2,202279.0,prior,3.0,5.0,9.0,8.0,28985.0,2.0,1.0,...,83,4,13.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875
2,2,202279.0,prior,3.0,5.0,9.0,8.0,9327.0,3.0,0.0,...,104,13,3.6,Low-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875
3,2,202279.0,prior,3.0,5.0,9.0,8.0,45918.0,4.0,1.0,...,19,13,8.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875
4,2,202279.0,prior,3.0,5.0,9.0,8.0,30035.0,5.0,0.0,...,17,13,13.7,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875


In [19]:
# Check shape
ords_prods_merge.shape

(32404859, 22)

In [20]:
# If statements with the loc () function
ords_prods_merge.loc[ords_prods_merge['avg_spending'] < 10, 'spending_flag'] = 'Low spender'

In [21]:
# If statements with the loc () function
ords_prods_merge.loc[ords_prods_merge['avg_spending'] >= 10, 'spending_flag'] = 'High spender'

In [22]:
# Print frequency 
ords_prods_merge['spending_flag'].value_counts(dropna = False)

spending_flag
Low spender     30207666
NaN              2076096
High spender      121097
Name: count, dtype: int64

In [23]:
# Check output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_spending,spending_flag
0,2,202279.0,prior,3.0,5.0,9.0,8.0,33120.0,1.0,1.0,...,16,11.3,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender
1,2,202279.0,prior,3.0,5.0,9.0,8.0,28985.0,2.0,1.0,...,4,13.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender
2,2,202279.0,prior,3.0,5.0,9.0,8.0,9327.0,3.0,0.0,...,13,3.6,Low-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender
3,2,202279.0,prior,3.0,5.0,9.0,8.0,45918.0,4.0,1.0,...,13,8.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender
4,2,202279.0,prior,3.0,5.0,9.0,8.0,30035.0,5.0,0.0,...,13,13.7,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender


In [24]:
# Check shape
ords_prods_merge.shape

(32404859, 23)

#### 07. Create a Frequency Flag

In [25]:
# Split the data into groups based on the “user_id” column.
# Apply the transform() function on the “days_since_prior_order” column to generate the median order per user. 
# Create a new column, “median_order,” into which you’ll place the results of your aggregation.
ords_prods_merge['median_order'] = ords_prods_merge.groupby('user_id')['days_since_prior_order'].transform('median')

In [26]:
# Check output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_spending,spending_flag,median_order
0,2,202279.0,prior,3.0,5.0,9.0,8.0,33120.0,1.0,1.0,...,11.3,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0
1,2,202279.0,prior,3.0,5.0,9.0,8.0,28985.0,2.0,1.0,...,13.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0
2,2,202279.0,prior,3.0,5.0,9.0,8.0,9327.0,3.0,0.0,...,3.6,Low-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0
3,2,202279.0,prior,3.0,5.0,9.0,8.0,45918.0,4.0,1.0,...,8.4,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0
4,2,202279.0,prior,3.0,5.0,9.0,8.0,30035.0,5.0,0.0,...,13.7,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0


In [27]:
# Check shape
ords_prods_merge.shape

(32404859, 24)

In [28]:
# If statements with the loc () function
ords_prods_merge.loc[ords_prods_merge['median_order'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'

In [29]:
ords_prods_merge.loc[(ords_prods_merge['median_order'] > 10) & (ords_prods_merge['median_order'] <= 20), 'order_frequency_flag'] = 'Regular customer'

In [30]:
ords_prods_merge.loc[ords_prods_merge['median_order'] <= 10, 'order_frequency_flag'] = 'Frequent customer'

In [31]:
# Print frequency 
ords_prods_merge['order_frequency_flag'].value_counts(dropna = False)

order_frequency_flag
Frequent customer        20675290
Regular customer          6594542
Non-frequent customer     3058931
NaN                       2076096
Name: count, dtype: int64

In [32]:
# Check output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,dataset_label,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_spending,spending_flag,median_order,order_frequency_flag
0,2,202279.0,prior,3.0,5.0,9.0,8.0,33120.0,1.0,1.0,...,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0,Non-frequent customer
1,2,202279.0,prior,3.0,5.0,9.0,8.0,28985.0,2.0,1.0,...,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0,Non-frequent customer
2,2,202279.0,prior,3.0,5.0,9.0,8.0,9327.0,3.0,0.0,...,Low-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0,Non-frequent customer
3,2,202279.0,prior,3.0,5.0,9.0,8.0,45918.0,4.0,1.0,...,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0,Non-frequent customer
4,2,202279.0,prior,3.0,5.0,9.0,8.0,30035.0,5.0,0.0,...,Mid-range product,Regularly busy,Regularly busy,Average orders,8.0,New customer,8.6875,Low spender,24.0,Non-frequent customer


In [33]:
# Check shape
ords_prods_merge.shape

(32404859, 25)

#### 08. Export Dataframe

In [34]:
# Export data to pkl
ords_prods_merge.to_pickle(os.path.join(path, '02_Data','Prepared Data', 'ords_prods_merge_grouped.pkl'))