## Customer Segmentation using Unsupervised Machine Learning 

### 1. Project Background

Customer segmentation is a critical strategy in modern business analytics that enables companies to understand and categorize their customer base. It involves breaking down a customer base into distinct groups with similar characteristics- such as spending habits or product interests. Rather than treating all customers uniformly, businesses can tailor their marketing strategies, product offerings, and services to meet the specific needs of different customer segments.

Traditional segmentation approaches often rely on predefined business rules or demographic categories. However, unsupervised machine learning techniques, particularly clustering algorithms, offer a data-driven alternative that can uncover hidden patterns and natural groupings within customer data that might not be immediately obvious to human analysts.

In this project, I employed unsupervised clustering methods to segment customers. By identifying these segments, businesses can:

- Personalize marketing campaigns to resonate with specific customer groups
- Optimize product recommendations based on segment preferences
- Improve customer retention by understanding the needs of different groups
- Allocate resources more efficiently by focusing on high-value segments
- Enhance customer experience through targeted service improvements

The insights gained from this analysis can inform strategic decisions across marketing, product development, and customer service departments, ultimately leading to increased customer satisfaction and business profitability.

### 2. Load Data & Necessary Libraries

In [136]:
#import all the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px


from sklearn.preprocessing import StandardScaler,RobustScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.mixture import GaussianMixture

In [137]:
data= pd.read_csv(r'D:\Customer_Segementation_with_Unsupervised_Learning\dataset\Customer_data.csv')
data.head()

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12
0,cc294636-19f0-11eb-8d74-000d3a38a36f,Android App,Offline,2020-10-30,2021-02-26,2021-02-21,2021-02-26,4.0,1.0,139.99,799.38,[KADIN]
1,f431bd5a-ab7b-11e9-a2fc-000d3a38a36f,Android App,Mobile,2017-02-08,2021-02-16,2021-02-16,2020-01-10,19.0,2.0,159.97,1853.58,"[ERKEK, COCUK, KADIN, AKTIFSPOR]"
2,69b69676-1a40-11ea-941b-000d3a38a36f,Android App,Android App,2019-11-27,2020-11-27,2020-11-27,2019-12-01,3.0,2.0,189.97,395.35,"[ERKEK, KADIN]"
3,1854e56c-491f-11eb-806e-000d3a38a36f,Android App,Android App,2021-01-06,2021-01-17,2021-01-17,2021-01-06,1.0,1.0,39.99,81.98,"[AKTIFCOCUK, COCUK]"
4,d6ea1074-f1f5-11e9-9346-000d3a38a36f,Desktop,Desktop,2019-08-03,2021-03-07,2021-03-07,2019-08-03,1.0,1.0,49.99,159.99,[AKTIFSPOR]


**Background behind the Data**

The data contains information about purchases made by customers in our hypothetical store,let's call it, '***Wamz General Stores***'. The dataset includes the following features:

- master_id: Unique client number
- order_channel: Which channel of the shopping platform is used (Android, ios, Desktop, Mobile)
- last_order_channel: The channel where the last purchase was made
- first_order_date: The date of the first purchase made by the customer
- last_order_date: The date of the customer's last purchase
- last_order_date_online: The date of the last purchase made by the customer on the online platform
- last_order_date_offline: The date of the last purchase made by the customer on the offline platform
- order_num_total_ever_online: The total number of purchases made by the customer on the online platform
- order_num_total_ever_offline: Total number of purchases made by the customer offline
- customer_value_total_ever_offline: The total price paid by the customer for offline purchases
- customer_value_total_ever_online: The total price paid by the customer for their online shopping
- interested_in_categories_12: List of categories the customer has purchased from in the last 12 months

### 3. Data Cleaning & Preparation

In [138]:
#let's start by checking the data for missingness
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19945 entries, 0 to 19944
Data columns (total 12 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   master_id                          19945 non-null  object 
 1   order_channel                      19945 non-null  object 
 2   last_order_channel                 19945 non-null  object 
 3   first_order_date                   19945 non-null  object 
 4   last_order_date                    19945 non-null  object 
 5   last_order_date_online             19945 non-null  object 
 6   last_order_date_offline            19945 non-null  object 
 7   order_num_total_ever_online        19945 non-null  float64
 8   order_num_total_ever_offline       19945 non-null  float64
 9   customer_value_total_ever_offline  19945 non-null  float64
 10  customer_value_total_ever_online   19945 non-null  float64
 11  interested_in_categories_12        19945 non-null  obj

The data has no missing values.

In [139]:
#Now check for duplicates
data[data.duplicated()]

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12


No, rows means the data also has no duplicates

The next cleaning step, would be to check for data types and data consistency.

In [140]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19945 entries, 0 to 19944
Data columns (total 12 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   master_id                          19945 non-null  object 
 1   order_channel                      19945 non-null  object 
 2   last_order_channel                 19945 non-null  object 
 3   first_order_date                   19945 non-null  object 
 4   last_order_date                    19945 non-null  object 
 5   last_order_date_online             19945 non-null  object 
 6   last_order_date_offline            19945 non-null  object 
 7   order_num_total_ever_online        19945 non-null  float64
 8   order_num_total_ever_offline       19945 non-null  float64
 9   customer_value_total_ever_offline  19945 non-null  float64
 10  customer_value_total_ever_online   19945 non-null  float64
 11  interested_in_categories_12        19945 non-null  obj

Next, we can change all the temporal feeatures into the appropriate date-time objects

In [141]:
data.columns.to_list()

['master_id',
 'order_channel',
 'last_order_channel',
 'first_order_date',
 'last_order_date',
 'last_order_date_online',
 'last_order_date_offline',
 'order_num_total_ever_online',
 'order_num_total_ever_offline',
 'customer_value_total_ever_offline',
 'customer_value_total_ever_online',
 'interested_in_categories_12']

In [142]:
date_cols=['first_order_date','last_order_date','last_order_date_online','last_order_date_offline']

In [143]:
for col in date_cols:
    data[col] = pd.to_datetime(data[col])


In [144]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19945 entries, 0 to 19944
Data columns (total 12 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   master_id                          19945 non-null  object        
 1   order_channel                      19945 non-null  object        
 2   last_order_channel                 19945 non-null  object        
 3   first_order_date                   19945 non-null  datetime64[ns]
 4   last_order_date                    19945 non-null  datetime64[ns]
 5   last_order_date_online             19945 non-null  datetime64[ns]
 6   last_order_date_offline            19945 non-null  datetime64[ns]
 7   order_num_total_ever_online        19945 non-null  float64       
 8   order_num_total_ever_offline       19945 non-null  float64       
 9   customer_value_total_ever_offline  19945 non-null  float64       
 10  customer_value_total_ever_online  

### 4. Feature Engineering

Machine learning algorithms do not understand dates the same way we humans do. Models only see and understand numbers (more specifically distance). So the difference between 31-12-2025 and 01-01-2026, for us humans, that is only one day, but for a model that is

```31122025 - 01012026 = 3010999```

And this make absolutely no sense!! So, when you have date-time features, unless its a time-series analysis, you have to convert the date-time features into more meaningful metrics.

In [145]:
#Calculate days_since_last_purchase- how many days ago did the customer last make a purchase
reference_date=pd.to_datetime(data['last_order_date'].max()) + pd.Timedelta(days=1)
data['days_since_last_purchase']=( reference_date- data['last_order_date']).dt.days
data.head()

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12,days_since_last_purchase
0,cc294636-19f0-11eb-8d74-000d3a38a36f,Android App,Offline,2020-10-30,2021-02-26,2021-02-21,2021-02-26,4.0,1.0,139.99,799.38,[KADIN],94
1,f431bd5a-ab7b-11e9-a2fc-000d3a38a36f,Android App,Mobile,2017-02-08,2021-02-16,2021-02-16,2020-01-10,19.0,2.0,159.97,1853.58,"[ERKEK, COCUK, KADIN, AKTIFSPOR]",104
2,69b69676-1a40-11ea-941b-000d3a38a36f,Android App,Android App,2019-11-27,2020-11-27,2020-11-27,2019-12-01,3.0,2.0,189.97,395.35,"[ERKEK, KADIN]",185
3,1854e56c-491f-11eb-806e-000d3a38a36f,Android App,Android App,2021-01-06,2021-01-17,2021-01-17,2021-01-06,1.0,1.0,39.99,81.98,"[AKTIFCOCUK, COCUK]",134
4,d6ea1074-f1f5-11e9-9346-000d3a38a36f,Desktop,Desktop,2019-08-03,2021-03-07,2021-03-07,2019-08-03,1.0,1.0,49.99,159.99,[AKTIFSPOR],85


Next we calculate the days since the first purchase- that signifies how long the customer has been 
with the business

In [146]:
#days since first purchase
data['days_since_first_purchase']=(reference_date-data['first_order_date']).dt.days
data.head()

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12,days_since_last_purchase,days_since_first_purchase
0,cc294636-19f0-11eb-8d74-000d3a38a36f,Android App,Offline,2020-10-30,2021-02-26,2021-02-21,2021-02-26,4.0,1.0,139.99,799.38,[KADIN],94,213
1,f431bd5a-ab7b-11e9-a2fc-000d3a38a36f,Android App,Mobile,2017-02-08,2021-02-16,2021-02-16,2020-01-10,19.0,2.0,159.97,1853.58,"[ERKEK, COCUK, KADIN, AKTIFSPOR]",104,1573
2,69b69676-1a40-11ea-941b-000d3a38a36f,Android App,Android App,2019-11-27,2020-11-27,2020-11-27,2019-12-01,3.0,2.0,189.97,395.35,"[ERKEK, KADIN]",185,551
3,1854e56c-491f-11eb-806e-000d3a38a36f,Android App,Android App,2021-01-06,2021-01-17,2021-01-17,2021-01-06,1.0,1.0,39.99,81.98,"[AKTIFCOCUK, COCUK]",134,145
4,d6ea1074-f1f5-11e9-9346-000d3a38a36f,Desktop,Desktop,2019-08-03,2021-03-07,2021-03-07,2019-08-03,1.0,1.0,49.99,159.99,[AKTIFSPOR],85,667


After learning how long a customer has been with a business, we can look to extract something like, average purchases per day- it would help to organise the customers into high-frequency shoppers and low-frequency shoppers.

We could also include the average daily spend, which is the total revenue from both online ad offline divided by how long the customer has been with the business(days since first purchase)

In [147]:
#purchases per day
#first we get the total no. of purchase & total revenue - combine offline +online
data['total_purchases']=data['order_num_total_ever_online']+data['order_num_total_ever_offline']
data['total_value']=data['customer_value_total_ever_online']+ data['customer_value_total_ever_offline']

#with that now we can calculate the avg daily purchases
data['avg_daily_purchase']= data['total_purchases'] / data['days_since_first_purchase']
data['avg_daily_value']=data['total_value'] / data['days_since_first_purchase']


#next os
data.head(2)

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12,days_since_last_purchase,days_since_first_purchase,total_purchases,total_value,avg_daily_purchase,avg_daily_value
0,cc294636-19f0-11eb-8d74-000d3a38a36f,Android App,Offline,2020-10-30,2021-02-26,2021-02-21,2021-02-26,4.0,1.0,139.99,799.38,[KADIN],94,213,5.0,939.37,0.023474,4.410188
1,f431bd5a-ab7b-11e9-a2fc-000d3a38a36f,Android App,Mobile,2017-02-08,2021-02-16,2021-02-16,2020-01-10,19.0,2.0,159.97,1853.58,"[ERKEK, COCUK, KADIN, AKTIFSPOR]",104,1573,21.0,2013.55,0.01335,1.28007


Next, we could consider average spending rate for both online and offline. I mean, on average how much does a customer spend when they purchase their orders online vs offline

In [148]:
data['online_spending_ratio']=data['customer_value_total_ever_online']/data['order_num_total_ever_online']
data['offline_spending_ratio']=data['customer_value_total_ever_offline']/data['order_num_total_ever_offline']
data.head(2)

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12,days_since_last_purchase,days_since_first_purchase,total_purchases,total_value,avg_daily_purchase,avg_daily_value,online_spending_ratio,offline_spending_ratio
0,cc294636-19f0-11eb-8d74-000d3a38a36f,Android App,Offline,2020-10-30,2021-02-26,2021-02-21,2021-02-26,4.0,1.0,139.99,799.38,[KADIN],94,213,5.0,939.37,0.023474,4.410188,199.845,139.99
1,f431bd5a-ab7b-11e9-a2fc-000d3a38a36f,Android App,Mobile,2017-02-08,2021-02-16,2021-02-16,2020-01-10,19.0,2.0,159.97,1853.58,"[ERKEK, COCUK, KADIN, AKTIFSPOR]",104,1573,21.0,2013.55,0.01335,1.28007,97.556842,79.985


The final feature we can extract is looking at the number of categories our customers typically purchase from. We will need some simple string methods to make it work

In [149]:
data['interested_in_categories_12']=data['interested_in_categories_12'].str.replace('[',' ')
data['interested_in_categories_12']=data['interested_in_categories_12'].str.replace(']',' ')
data.head(2)

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,interested_in_categories_12,days_since_last_purchase,days_since_first_purchase,total_purchases,total_value,avg_daily_purchase,avg_daily_value,online_spending_ratio,offline_spending_ratio
0,cc294636-19f0-11eb-8d74-000d3a38a36f,Android App,Offline,2020-10-30,2021-02-26,2021-02-21,2021-02-26,4.0,1.0,139.99,799.38,KADIN,94,213,5.0,939.37,0.023474,4.410188,199.845,139.99
1,f431bd5a-ab7b-11e9-a2fc-000d3a38a36f,Android App,Mobile,2017-02-08,2021-02-16,2021-02-16,2020-01-10,19.0,2.0,159.97,1853.58,"ERKEK, COCUK, KADIN, AKTIFSPOR",104,1573,21.0,2013.55,0.01335,1.28007,97.556842,79.985


In [150]:
data['no_of_categories_interested_in']=data['interested_in_categories_12'].str.split().str.len()
data.head()

Unnamed: 0,master_id,order_channel,last_order_channel,first_order_date,last_order_date,last_order_date_online,last_order_date_offline,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,...,interested_in_categories_12,days_since_last_purchase,days_since_first_purchase,total_purchases,total_value,avg_daily_purchase,avg_daily_value,online_spending_ratio,offline_spending_ratio,no_of_categories_interested_in
0,cc294636-19f0-11eb-8d74-000d3a38a36f,Android App,Offline,2020-10-30,2021-02-26,2021-02-21,2021-02-26,4.0,1.0,139.99,...,KADIN,94,213,5.0,939.37,0.023474,4.410188,199.845,139.99,1
1,f431bd5a-ab7b-11e9-a2fc-000d3a38a36f,Android App,Mobile,2017-02-08,2021-02-16,2021-02-16,2020-01-10,19.0,2.0,159.97,...,"ERKEK, COCUK, KADIN, AKTIFSPOR",104,1573,21.0,2013.55,0.01335,1.28007,97.556842,79.985,4
2,69b69676-1a40-11ea-941b-000d3a38a36f,Android App,Android App,2019-11-27,2020-11-27,2020-11-27,2019-12-01,3.0,2.0,189.97,...,"ERKEK, KADIN",185,551,5.0,585.32,0.009074,1.062287,131.783333,94.985,2
3,1854e56c-491f-11eb-806e-000d3a38a36f,Android App,Android App,2021-01-06,2021-01-17,2021-01-17,2021-01-06,1.0,1.0,39.99,...,"AKTIFCOCUK, COCUK",134,145,2.0,121.97,0.013793,0.841172,81.98,39.99,2
4,d6ea1074-f1f5-11e9-9346-000d3a38a36f,Desktop,Desktop,2019-08-03,2021-03-07,2021-03-07,2019-08-03,1.0,1.0,49.99,...,AKTIFSPOR,85,667,2.0,209.98,0.002999,0.314813,159.99,49.99,1


That is enough, but also not an exhaustive list of all the possible features we can extract from the data.  **What other features can you think of that we can feature engineer? Please feel free to add as many as you can justify!!!**

### 5. Exploratory Data Analysis (Student Led)

**Univariate Analysis**

**Bivariate Analysis**

**Multivariate Analysis**

### 6. Data Preprocessing 

In [151]:
'''I am going to drop all the temporal data, the ,master_id because identifiers are useless
and of course the interested catergories'''

data.drop(columns=['master_id','first_order_date','last_order_date',
                   'last_order_date_online','last_order_date_offline','interested_in_categories_12'],
                   inplace=True)


In [152]:
data.head()

Unnamed: 0,order_channel,last_order_channel,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,days_since_last_purchase,days_since_first_purchase,total_purchases,total_value,avg_daily_purchase,avg_daily_value,online_spending_ratio,offline_spending_ratio,no_of_categories_interested_in
0,Android App,Offline,4.0,1.0,139.99,799.38,94,213,5.0,939.37,0.023474,4.410188,199.845,139.99,1
1,Android App,Mobile,19.0,2.0,159.97,1853.58,104,1573,21.0,2013.55,0.01335,1.28007,97.556842,79.985,4
2,Android App,Android App,3.0,2.0,189.97,395.35,185,551,5.0,585.32,0.009074,1.062287,131.783333,94.985,2
3,Android App,Android App,1.0,1.0,39.99,81.98,134,145,2.0,121.97,0.013793,0.841172,81.98,39.99,2
4,Desktop,Desktop,1.0,1.0,49.99,159.99,85,667,2.0,209.98,0.002999,0.314813,159.99,49.99,1


##### 6.1 Feature Encoding

In [153]:
#extract the categorical features from the dataset for encodinh
num_df=data.select_dtypes(include=[np.number])
cat_df=data.select_dtypes(include=[object])

In [154]:
ohe=OneHotEncoder(sparse_output=False)
ohe_transform=ohe.fit_transform(cat_df)
encoded_df = pd.DataFrame(ohe_transform, columns=ohe.get_feature_names_out(cat_df.columns))
encoded_df.index = cat_df.index
final_df=pd.concat([num_df,encoded_df],axis=1)
final_df.head()

Unnamed: 0,order_num_total_ever_online,order_num_total_ever_offline,customer_value_total_ever_offline,customer_value_total_ever_online,days_since_last_purchase,days_since_first_purchase,total_purchases,total_value,avg_daily_purchase,avg_daily_value,...,no_of_categories_interested_in,order_channel_Android App,order_channel_Desktop,order_channel_Ios App,order_channel_Mobile,last_order_channel_Android App,last_order_channel_Desktop,last_order_channel_Ios App,last_order_channel_Mobile,last_order_channel_Offline
0,4.0,1.0,139.99,799.38,94,213,5.0,939.37,0.023474,4.410188,...,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,19.0,2.0,159.97,1853.58,104,1573,21.0,2013.55,0.01335,1.28007,...,4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,3.0,2.0,189.97,395.35,185,551,5.0,585.32,0.009074,1.062287,...,2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,1.0,1.0,39.99,81.98,134,145,2.0,121.97,0.013793,0.841172,...,2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,1.0,49.99,159.99,85,667,2.0,209.98,0.002999,0.314813,...,1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


##### 6.2 Feature Scaling

In [155]:
scaler=RobustScaler()
X_scaled=scaler.fit_transform(final_df)

### 7. Clustering algorithms

##### 7.1 Customer Segmentation with Hierarchical clustering

##### 7.2 Customer Segmentation with K-means clustering

#### 7.3 Customer Segmentation with Gaussian Mixture Models

##### 7.4 Customer Segmentation with DBSCAN