# Step 7 Instacart Customer Profiling

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing Data](#2.-Importing-Data)
    * [2.1 Importing Libraries](#2.1-Importing-Libraries)
    * [2.2 Importing Data](#2.2-Importing-Data)
* [3. Age Profile](#3.-Age-Profile)
* [4. Dependents Profile](#4.-Dependents-Profile)
* [5. Income Profile](#5.-Income-Profile)
* [6. Customer Profiles](#6.-Customer-Profiles)
* [7. Exporting Data](#7.-Exporting-Data) 

# 1. Introduction

The marketing and business strategy teams at Instacart are focusing on developing more targeted marketing strategies by understanding customer demographics in their database. They decided to profile customers based on age, income, and the number of dependents, where the number of dependents indicates the count of children per user. The following categories were established for customer profiling:

- For the "age profile" variable, three groups: "Young adult", "Adult", and "Retired".
- For the "dependents profile", two groups: "No kids" and "With kids".
- For the "income profile", three groups: "Low income", "Middle income", and "High income".

Combining these classifications will yield 18 unique customer profiles, ranging from "Young adult, No kids, Low income" to "Retired, With kids, High income".


# 2. Importing Data

In [2]:
import pandas as pd
import numpy as np
import os

In [3]:
path = r'C:\Users\kuohe\Instacart Basket Analysis 12-18-2023'

### Oder Products Der

In [4]:
orders_products_der = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_der_step6.pkl'))

In [5]:
# Checking Size
orders_products_der.shape

(30992966, 32)

In [6]:
# Checking Columns
orders_products_der.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,max_order_number,price_range_products,busiest_days,busiest_period_of_day,region,loyalty_flag,mean_price_per_user,type_of_spender,median_days_since_prior_order,usage_frequency
0,2539329,1,1,2,8,,196,1,0,Soda,...,10,mid-range,Regular days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Average orders,South,New,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,10,mid-range,Least busiest days,Most orders,South,New,6.367797,Low spender,20.5,Non-frequent customer


In [7]:
orders_products_der.dtypes

order_id                            int32
user_id                             int32
order_number                         int8
orders_day_of_week                   int8
order_hour_of_day                    int8
days_since_prior_order            float64
product_id                          int32
add_to_cart_sequence                int32
reordered                            int8
product_name                       object
aisle_id                             int8
department_id                        int8
prices                            float64
First Name                         object
Surname                            object
Gender                           category
STATE                            category
Age                                 int32
date_joined                        object
No._of_dependants                   int32
family_status                    category
income                              int64
max_order_number                     int8
price_range_products              

In [9]:
# This is to be more memory-efficient so the system will not have trouble loading
variables = ['order_id', 'user_id', 'days_since_prior_order', 'department_id', 'prices', 'Age', 'No._of_dependants', 'income', 'region']
orders_products_der = orders_products_der.loc[:, variables]

In [10]:
orders_products_der.shape

(30992966, 9)

In [11]:
# Checking Column
orders_products_der.head()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,Age,No._of_dependants,income,region
0,2539329,1,,7,9.0,31,3,40423,South
1,2398795,1,15.0,7,9.0,31,3,40423,South
2,473747,1,21.0,7,9.0,31,3,40423,South
3,2254736,1,29.0,7,9.0,31,3,40423,South
4,431534,1,28.0,7,9.0,31,3,40423,South


# 3. Age Profile

The "age profile" variable will be divided into three groups: "Young adult," "Adult," and "Retired." The following are the criteria for categorization:
- Users who are younger than 26 will be categorized as "Young adults."
- "Adult" refers to those who are 65 years of age or older but under 26.
- Users will be classified as "Retired" if they are 65 or older.


## Variable Evaluation

In [13]:
# Observing the "Age" column
orders_products_der['Age'].describe()

count    3.099297e+07
mean     4.946787e+01
std      1.848520e+01
min      1.800000e+01
25%      3.300000e+01
50%      4.900000e+01
75%      6.500000e+01
max      8.100000e+01
Name: Age, dtype: float64

# Conditions

In [14]:
# If the user is younger than 26, then user will be considered as “Young adult”
orders_products_der.loc[orders_products_der['Age'] < 26, 'age_profile'] = 'Young adult'

In [15]:
# If the user is 26 and younger than 65, they will be considered as an “Adult”
orders_products_der.loc[(orders_products_der['Age'] < 65) & (orders_products_der['Age'] >= 26), 'age_profile'] = 'Adult' 

In [16]:
# Users 65 and older will be considered as "Retired"
orders_products_der.loc[orders_products_der['Age'] >= 65, 'age_profile'] = 'Retired'

## Output Evaluation 

In [17]:
# Counting the values
orders_products_der['age_profile'].value_counts(dropna = False)

age_profile
Adult          18921994
Retired         8202766
Young adult     3868206
Name: count, dtype: int64

In [19]:
# Checking the parameters set
orders_products_der.groupby('age_profile').agg({'Age' : ['min', 'max']})

Unnamed: 0_level_0,Age,Age
Unnamed: 0_level_1,min,max
age_profile,Unnamed: 1_level_2,Unnamed: 2_level_2
Adult,26,64
Retired,65,81
Young adult,18,25


In [21]:
# Double checking 
orders_products_der[['Age', 'age_profile']].head()

Unnamed: 0,Age,age_profile
0,31,Adult
1,31,Adult
2,31,Adult
3,31,Adult
4,31,Adult


In [22]:
# Double checking
orders_products_der[['Age', 'age_profile']].tail()

Unnamed: 0,Age,age_profile
32434480,25,Young adult
32434481,25,Young adult
32434482,25,Young adult
32434483,25,Young adult
32434484,25,Young adult


# 4. Dependents Profile

Two groups will be defined for the "dependents_profile": "No kids" and "With kids". The criteria for this categorization are:
- Users with no dependents (0) will be labeled as “No kids”.
- Users who have one or more dependents will be labeled as “With kids”.

### Variable Evaluation

In [24]:
# Observing the column "number_of_dependents"
orders_products_der['No._of_dependants'].describe()

count    3.099297e+07
mean     1.501799e+00
std      1.118900e+00
min      0.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      3.000000e+00
max      3.000000e+00
Name: No._of_dependants, dtype: float64

In [26]:
# Observing the frequency of values within the column
orders_products_der['No._of_dependants'].value_counts(dropna = False)

No._of_dependants
3    7779460
0    7747091
2    7740423
1    7725992
Name: count, dtype: int64

# Conditions

In [28]:
# If the value is 0 then they are considered "No kids". If the value is 1 or more then, they are considered "With kids".
result = []

for value in orders_products_der["No._of_dependants"]:
  if value == 0:
    result.append("No kids")
  else:
    result.append("With kids")

In [29]:
# Checking results
result

['With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'With kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No kids',
 'No k

In [30]:
# Adding the result to the columns
orders_products_der['dependents_profile'] = result

# Output Evaluation

In [31]:
# Checking Result
orders_products_der.head()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,Age,No._of_dependants,income,region,age_profile,dependents_profile
0,2539329,1,,7,9.0,31,3,40423,South,Adult,With kids
1,2398795,1,15.0,7,9.0,31,3,40423,South,Adult,With kids
2,473747,1,21.0,7,9.0,31,3,40423,South,Adult,With kids
3,2254736,1,29.0,7,9.0,31,3,40423,South,Adult,With kids
4,431534,1,28.0,7,9.0,31,3,40423,South,Adult,With kids


In [32]:
# Counting values within new column
orders_products_der['dependents_profile'].value_counts(dropna = False)

dependents_profile
With kids    23245875
No kids       7747091
Name: count, dtype: int64

In [33]:
# Checking the parameters
orders_products_der.groupby('dependents_profile').agg({'No._of_dependants' : ['min', 'max']})

Unnamed: 0_level_0,No._of_dependants,No._of_dependants
Unnamed: 0_level_1,min,max
dependents_profile,Unnamed: 1_level_2,Unnamed: 2_level_2
No kids,0,0
With kids,1,3


In [34]:
# Checking the parameters
orders_products_der[['dependents_profile', 'No._of_dependants']].head()

Unnamed: 0,dependents_profile,No._of_dependants
0,With kids,3
1,With kids,3
2,With kids,3
3,With kids,3
4,With kids,3


In [36]:
# Checking the parameters
orders_products_der[['dependents_profile', 'No._of_dependants']].tail()

Unnamed: 0,dependents_profile,No._of_dependants
32434480,No kids,0
32434481,No kids,0
32434482,No kids,0
32434483,No kids,0
32434484,No kids,0


# 5. Income Profile

The "income profile" variable will be segmented into three categories: "Low income", "Middle income", and "High income". The classification is based on the following criteria:
- Users earning 60k per year or less will be categorized as "Low income".
- Users with an annual income above 60k but below 250k will be classified as "Middle income".
- Users earning 250k per year or more will be designated as "High income".

## Variables Evaluation

In [37]:
# Observing the "income" column
orders_products_der['income'].describe()

count    3.099297e+07
mean     9.967341e+04
std      4.313976e+04
min      2.590300e+04
25%      6.728900e+04
50%      9.676500e+04
75%      1.281010e+05
max      5.939010e+05
Name: income, dtype: float64

## Conditions

In [39]:
# If user earns less than 60k per year, they are considered "Low income"
# If user earns between 250k and 60k per year, they are considered "Middle income"
# If user earns mroe than 250k per year, they are considered "High income"
result = []

for value in orders_products_der["income"]:
  if value < 60000:
    result.append("Low income")
  elif value >= 250000:
    result.append("High income")
  else:
    result.append("Middle income")

In [41]:
result

['Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low income',
 'Low inco

In [42]:
# Adding the result to the dataframe
orders_products_der['income_profile'] = result

# Output Evaluation

In [43]:
# Checking results
orders_products_der.head()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,Age,No._of_dependants,income,region,age_profile,dependents_profile,income_profile
0,2539329,1,,7,9.0,31,3,40423,South,Adult,With kids,Low income
1,2398795,1,15.0,7,9.0,31,3,40423,South,Adult,With kids,Low income
2,473747,1,21.0,7,9.0,31,3,40423,South,Adult,With kids,Low income
3,2254736,1,29.0,7,9.0,31,3,40423,South,Adult,With kids,Low income
4,431534,1,28.0,7,9.0,31,3,40423,South,Adult,With kids,Low income


In [46]:
# Checking the count of "income_profile"
orders_products_der['income_profile'].value_counts(dropna = False)

income_profile
Middle income    25004498
Low income        5808573
High income        179895
Name: count, dtype: int64

In [47]:
# Checking the parameters of "income_profile"
orders_products_der.groupby('income_profile').agg({'income' : ['min', 'max']})

Unnamed: 0_level_0,income,income
Unnamed: 0_level_1,min,max
income_profile,Unnamed: 1_level_2,Unnamed: 2_level_2
High income,250190,593901
Low income,25903,59999
Middle income,60000,249904


In [48]:
# Checking the parameters of "income_profile"
orders_products_der[['income_profile', 'income']].head()

Unnamed: 0,income_profile,income
0,Low income,40423
1,Low income,40423
2,Low income,40423
3,Low income,40423
4,Low income,40423


In [49]:
# Checking the parameters of "income_profile"
orders_products_der[['income_profile', 'income']].tail()

Unnamed: 0,income_profile,income
32434480,Low income,53755
32434481,Low income,53755
32434482,Low income,53755
32434483,Low income,53755
32434484,Low income,53755


# 6. Customer Profiles

In [50]:
# Profiling each customer based on the dataframe that was created
orders_products_der['customer_profile'] = orders_products_der['age_profile'] + '_' + orders_products_der['dependents_profile'] + '_' + orders_products_der['income_profile']

In [51]:
# Checking result
orders_products_der.head()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,Age,No._of_dependants,income,region,age_profile,dependents_profile,income_profile,customer_profile
0,2539329,1,,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
1,2398795,1,15.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
2,473747,1,21.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
3,2254736,1,29.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income
4,431534,1,28.0,7,9.0,31,3,40423,South,Adult,With kids,Low income,Adult_With kids_Low income


In [52]:
# Checking result
orders_products_der.tail()

Unnamed: 0,order_id,user_id,days_since_prior_order,department_id,prices,Age,No._of_dependants,income,region,age_profile,dependents_profile,income_profile,customer_profile
32434480,3308056,106143,10.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434481,2988973,106143,5.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434482,930,106143,4.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434483,467253,106143,7.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income
32434484,156685,106143,5.0,7,10.7,25,0,53755,West,Young adult,No kids,Low income,Young adult_No kids_Low income


In [53]:
# Counting the values within the column of "customer_profile"
orders_products_der['customer_profile'].value_counts(dropna = False)

customer_profile
Adult_With kids_Middle income          11328954
Retired_With kids_Middle income         5442930
Adult_No kids_Middle income             3812244
Adult_With kids_Low income              2740559
Young adult_With kids_Middle income     1980497
Retired_No kids_Middle income           1796557
Adult_No kids_Low income                 926872
Young adult_With kids_Low income         926080
Retired_With kids_Low income             694263
Young adult_No kids_Middle income        643316
Young adult_No kids_Low income           307579
Retired_No kids_Low income               213220
Adult_With kids_High income               83414
Retired_With kids_High income             41653
Adult_No kids_High income                 29951
Retired_No kids_High income               14143
Young adult_With kids_High income          7525
Young adult_No kids_High income            3209
Name: count, dtype: int64

# Calculating Customer Profiles

In [54]:
# Average price of products purchased by each customer
orders_products_der['mean_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.mean)

  orders_products_der['mean_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.mean)


In [55]:
# Max price of products purchased by each customer
orders_products_der['max_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.max)

  orders_products_der['max_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.max)


In [56]:
# Minimum price of products purchased by each customer
orders_products_der['min_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.min) 

  orders_products_der['min_price_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['prices'].transform(np.min)


In [57]:
# Checking the aggregation
orders_products_der.groupby('customer_profile').agg({'prices' : ['min', 'max', 'mean']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,min,max,mean
customer_profile,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adult_No kids_High income,1.0,25.0,7.844429
Adult_No kids_Low income,1.0,25.0,7.110357
Adult_No kids_Middle income,1.0,25.0,7.958333
Adult_With kids_High income,1.0,25.0,7.772997
Adult_With kids_Low income,1.0,25.0,7.116217
Adult_With kids_Middle income,1.0,25.0,7.958375
Retired_No kids_High income,1.0,25.0,7.728465
Retired_No kids_Low income,1.0,25.0,6.512758
Retired_No kids_Middle income,1.0,25.0,7.961454
Retired_With kids_High income,1.0,25.0,7.764586


In [58]:
# Checking Results
orders_products_der[['customer_profile', 'mean_price_per_customer_profile', 'max_price_per_customer_profile', 'min_price_per_customer_profile', 'prices']].head()

Unnamed: 0,customer_profile,mean_price_per_customer_profile,max_price_per_customer_profile,min_price_per_customer_profile,prices
0,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
1,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
2,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
3,Adult_With kids_Low income,7.116217,25.0,1.0,9.0
4,Adult_With kids_Low income,7.116217,25.0,1.0,9.0


# Customer Profile Usage Frequency

In [59]:
# Average usage frequency of each customer
orders_products_der['mean_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.mean) 

  orders_products_der['mean_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.mean)


In [60]:
# Maximum usage frequency of each customer
orders_products_der['max_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.max)

  orders_products_der['max_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.max)


In [61]:
# Minimum usage frequency of each customer
orders_products_der['min_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.min)

  orders_products_der['min_usage_per_customer_profile'] = orders_products_der.groupby(['customer_profile'])['days_since_prior_order'].transform(np.min)


In [62]:
# Checking the calculaton
orders_products_der.groupby('customer_profile').agg({'days_since_prior_order' : ['min', 'max', 'mean']})

Unnamed: 0_level_0,days_since_prior_order,days_since_prior_order,days_since_prior_order
Unnamed: 0_level_1,min,max,mean
customer_profile,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adult_No kids_High income,0.0,30.0,9.552416
Adult_No kids_Low income,0.0,30.0,11.016815
Adult_No kids_Middle income,0.0,30.0,10.717566
Adult_With kids_High income,0.0,30.0,10.363265
Adult_With kids_Low income,0.0,30.0,11.045855
Adult_With kids_Middle income,0.0,30.0,10.791625
Retired_No kids_High income,0.0,30.0,9.633782
Retired_No kids_Low income,0.0,30.0,11.396369
Retired_No kids_Middle income,0.0,30.0,10.703718
Retired_With kids_High income,0.0,30.0,10.280375


In [63]:
# Checking results
orders_products_der[['customer_profile', 'mean_usage_per_customer_profile', 'max_usage_per_customer_profile', 'min_usage_per_customer_profile', 'days_since_prior_order']].head()

Unnamed: 0,customer_profile,mean_usage_per_customer_profile,max_usage_per_customer_profile,min_usage_per_customer_profile,days_since_prior_order
0,Adult_With kids_Low income,11.045855,30.0,0.0,
1,Adult_With kids_Low income,11.045855,30.0,0.0,15.0
2,Adult_With kids_Low income,11.045855,30.0,0.0,21.0
3,Adult_With kids_Low income,11.045855,30.0,0.0,29.0
4,Adult_With kids_Low income,11.045855,30.0,0.0,28.0


# 7. Exporting Data

In [64]:
# Checking size
orders_products_der.shape

(30992966, 19)

In [65]:
# Exporting to prepared data folder
# Since there is a large amount of data, pickle format is preferred
orders_products_der.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_cust_step7.pkl'))