<a href="https://colab.research.google.com/github/aminayusif/PurchaseIQ/blob/main/PurchaseIQ_CustomerSegmentation_Using_K_Means_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

This notebook performs customer segmentation using K-Means clustering on e-commerce consumer behavior data. It includes steps for data loading, exploration, preprocessing, clustering, cluster analysis, and anomaly detection using Isolation Forest. The goal is to identify distinct customer segments and unusual purchasing patterns to inform targeted marketing strategies.

Key Sections:

Data Loading and Exploration

Data Preprocessing

Customer Segmentation (K-Means Clustering)

Cluster Analysis and Interpretation

Anomaly Detection (Isolation Forest)

Marketing Recommendations based on Segments and Anomalies

### Data Loading and Exploration

#### Import common libaries

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#### Load dataset

In [6]:
df = pd.read_csv('/content/ECommerce_consumer behaviour.csv')


In [7]:
# Display the first few rows and info
display(df.head())
display(df.info())

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,department_id,department,product_name
0,2425083,49125,1,2,18,,17,1,0,13,pantry,baking ingredients
1,2425083,49125,1,2,18,,91,2,0,16,dairy eggs,soy lactosefree
2,2425083,49125,1,2,18,,36,3,0,16,dairy eggs,butter
3,2425083,49125,1,2,18,,83,4,0,4,produce,fresh vegetables
4,2425083,49125,1,2,18,,83,5,0,4,produce,fresh vegetables


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2019501 entries, 0 to 2019500
Data columns (total 12 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   order_number            int64  
 3   order_dow               int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
 6   product_id              int64  
 7   add_to_cart_order       int64  
 8   reordered               int64  
 9   department_id           int64  
 10  department              object 
 11  product_name            object 
dtypes: float64(1), int64(9), object(2)
memory usage: 184.9+ MB


None

### Data Preprocessing

Let's check for null values in the dataset

In [8]:
df.isnull().sum()

Unnamed: 0,0
order_id,0
user_id,0
order_number,0
order_dow,0
order_hour_of_day,0
days_since_prior_order,124342
product_id,0
add_to_cart_order,0
reordered,0
department_id,0


Displaying the unique values in the '**days_since_prior_order**' column to inspect its contents

In [10]:
df['days_since_prior_order'].unique()

array([nan,  3.,  6.,  7., 30., 20.,  4.,  8., 15., 10., 28.,  9., 12.,
       11.,  2., 25., 13., 29., 14., 21.,  5.,  1., 18.,  0., 19., 17.,
       22., 26., 24., 16., 23., 27.])

We have filled the null values in the **days_since_prior_order** column with 0, as these likely represent a user's first order where there is no prior order to calculate the days since.

In [11]:
df['days_since_prior_order'] = df['days_since_prior_order'].fillna(0)

We now have no null values in our dataset

In [12]:
df.isnull().sum()

Unnamed: 0,0
order_id,0
user_id,0
order_number,0
order_dow,0
order_hour_of_day,0
days_since_prior_order,0
product_id,0
add_to_cart_order,0
reordered,0
department_id,0


We will also clean the 'department' column by removing leading/trailing spaces

In [21]:

df['department'] = df['department'].str.strip()

We will now convert **days_since_prior_order** from float to integer data type

In [14]:
df['days_since_prior_order']= df['days_since_prior_order'].astype(np.int64)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2019501 entries, 0 to 2019500
Data columns (total 12 columns):
 #   Column                  Dtype 
---  ------                  ----- 
 0   order_id                int64 
 1   user_id                 int64 
 2   order_number            int64 
 3   order_dow               int64 
 4   order_hour_of_day       int64 
 5   days_since_prior_order  int64 
 6   product_id              int64 
 7   add_to_cart_order       int64 
 8   reordered               int64 
 9   department_id           int64 
 10  department              object
 11  product_name            object
dtypes: int64(10), object(2)
memory usage: 184.9+ MB


We examine unique values and their counts for categorical columns to understand the distribution and variety of data within those columns

In the context of customer segmentation, understanding the distribution of categories like '**department**' and '**order_dow**' can give us insights into customer preferences and behavior patterns related to the types of products they buy and the days they place orders.

In [20]:
# Examine unique values and their counts for categorical columns
for col in ['order_dow', 'order_hour_of_day', 'department', 'product_name']:
    if col in df.columns:
        display(f"Unique values and counts for column: {col}")
        display(df[col].value_counts())

'Unique values and counts for column: order_dow'

Unnamed: 0_level_0,count
order_dow,Unnamed: 1_level_1
0,391831
1,349236
6,280751
5,262157
2,261912
3,238730
4,234884


'Unique values and counts for column: order_hour_of_day'

Unnamed: 0_level_0,count
order_hour_of_day,Unnamed: 1_level_1
10,173306
11,170291
14,167831
15,167157
13,166376
12,163511
16,158247
9,150248
17,129383
8,106754


'Unique values and counts for column: department'

Unnamed: 0_level_0,count
department,Unnamed: 1_level_1
produce,588996
dairy eggs,336915
snacks,180692
beverages,168126
frozen,139536
pantry,116262
bakery,72983
canned goods,66053
deli,65176
dry goods pasta,54054


'Unique values and counts for column: product_name'

Unnamed: 0_level_0,count
product_name,Unnamed: 1_level_1
fresh fruits,226039
fresh vegetables,212611
packaged vegetables fruits,109596
yogurt,90751
packaged cheese,61502
...,...
kitchen supplies,561
baby bath body care,515
baby accessories,504
beauty,387


The value counts for '**department**','**order_hour_of_day**','**product_name**' and '**order_dow**' show the frequency of each category.

'Produce' is the most frequent department, and day '0' (likely Sunday) is the most frequent order day of the week.

Based on the output for 'order_hour_of_day', it appears that the peak hours for placing orders are around 10 AM to 4 PM (hours 10 through 16), with the counts being significantly higher during this period compared to early morning or late night hours. This suggests that customers are most active during typical working or daytime hours.