# Customer Clustering Analysis

This notebook performs customer clustering using simple and interpretable
behavioral features derived from transactional data.

The focus is on:
- Easy-to-understand feature logic
- Clean clustering workflow

In [18]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## Objective

The objective of this notebook is to:
- Create customer-level behavioral features
- Apply clustering to group similar customers
- Evaluate cluster quality
- Save clustering outputs for downstream use

In [19]:
print("Starting customer clustering analysis")

Starting customer clustering analysis


## Load Cleaned Transaction Data

The cleaned transactional dataset generated from the data cleaning notebook
is loaded here. This dataset is assumed to be free from missing values and
datatype issues.

In [21]:
data_path = "../data/clean_transactions.csv"
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,invoice_no,product_id,Description,quantity,invoice_date,price,customer_id,Country,transaction_value
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


## Ensure Correct Datatypes

Before feature creation, we ensure that the invoice date column
is in datetime format.

In [22]:
df["invoice_date"] = pd.to_datetime(df["invoice_date"])
df.dtypes

invoice_no                    int64
product_id                   object
Description                  object
quantity                      int64
invoice_date         datetime64[ns]
price                       float64
customer_id                 float64
Country                      object
transaction_value           float64
dtype: object

## Create Simple Customer-Level Features

Instead of complex recency calculations, we use simple and intuitive
aggregations that are easy to explain and understand.

Features created:
- Total number of orders
- Total quantity purchased
- Total amount spent
- Average order value

In [23]:
customer_features = (
    df.groupby("customer_id")
      .agg(
          total_orders=("invoice_no", "nunique"),
          total_quantity=("quantity", "sum"),
          total_spend=("transaction_value", "sum"),
          avg_order_value=("transaction_value", "mean")
      )
      .reset_index()
)

customer_features.head()

Unnamed: 0,customer_id,total_orders,total_quantity,total_spend,avg_order_value
0,12346.0,1,74215,77183.6,77183.6
1,12347.0,7,2458,4310.0,23.681319
2,12348.0,4,2341,1797.24,57.975484
3,12349.0,1,631,1757.55,24.076027
4,12350.0,1,197,334.4,19.670588


## Optional Customer Activity Duration

This feature measures how long a customer has been active in the dataset.
It helps distinguish short-term buyers from long-term customers.

In [24]:
activity_days = (
    df.groupby("customer_id")["invoice_date"]
      .apply(lambda x: (x.max() - x.min()).days)
      .reset_index(name="active_days")
)

customer_features = customer_features.merge(
    activity_days, on="customer_id", how="left"
)

customer_features.head()

Unnamed: 0,customer_id,total_orders,total_quantity,total_spend,avg_order_value,active_days
0,12346.0,1,74215,77183.6,77183.6,0
1,12347.0,7,2458,4310.0,23.681319,365
2,12348.0,4,2341,1797.24,57.975484,282
3,12349.0,1,631,1757.55,24.076027,0
4,12350.0,1,197,334.4,19.670588,0


## Feature Scaling

Clustering algorithms rely on distance calculations.
All numerical features are scaled to ensure fair contribution.

In [25]:
features_for_clustering = customer_features[
    ["total_orders", "total_quantity", "total_spend", "avg_order_value", "active_days"]
]

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_for_clustering)

## Determine Optimal Number of Clusters

Multiple cluster values are tested.
Silhouette score is used to evaluate how well customers are separated.

In [26]:
silhouette_scores = {}

for k in range(2, 9):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(scaled_features)
    score = silhouette_score(scaled_features, labels)
    silhouette_scores[k] = score

silhouette_scores

{2: 0.9684059292887868,
 3: 0.9203830580676403,
 4: 0.6101682213641073,
 5: 0.6150687297201186,
 6: 0.6269285546168829,
 7: 0.6269044674244914,
 8: 0.6220866439055609}

## Train Final Clustering Model

The number of clusters with the highest silhouette score
is selected for the final model.

In [27]:
optimal_k = max(silhouette_scores, key=silhouette_scores.get)

final_kmeans = KMeans(n_clusters=optimal_k, random_state=42)
customer_features["cluster"] = final_kmeans.fit_predict(scaled_features)

customer_features.head()

Unnamed: 0,customer_id,total_orders,total_quantity,total_spend,avg_order_value,active_days,cluster
0,12346.0,1,74215,77183.6,77183.6,0,1
1,12347.0,7,2458,4310.0,23.681319,365,0
2,12348.0,4,2341,1797.24,57.975484,282,0
3,12349.0,1,631,1757.55,24.076027,0,0
4,12350.0,1,197,334.4,19.670588,0,0


## Analyze Cluster Characteristics

Average feature values are calculated for each cluster
to understand customer behavior patterns.

In [28]:
cluster_summary = (
    customer_features
    .groupby("cluster")
    .mean()
    .reset_index()
)

cluster_summary

Unnamed: 0,cluster,customer_id,total_orders,total_quantity,total_spend,avg_order_value,active_days
0,0,15300.97624,4.257439,1107.237601,1928.799717,37.638584,130.410381
1,1,14479.333333,25.333333,117375.666667,175287.373333,44492.024666,185.666667


## Save Clustering Outputs

The trained clustering model and customer feature dataset
are saved for use in the application and dashboard.

In [30]:
import joblib

joblib.dump(final_kmeans, "../models/behavior_model.pkl")
customer_features.to_csv("../data/customer_features.csv", index=False)

print("Clustering model and customer features saved successfully")

Clustering model and customer features saved successfully


## Output Summary

This notebook produced:
- A customer-level feature dataset
- A trained clustering model
- Interpretable customer segments

These outputs will be used for model building and dashboard visualization.