<a href="https://colab.research.google.com/github/abhinavverma523/retail-customer-segmentation-kmeans/blob/main/retail_customer_segmentation_kmeans.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**examine the customer data**

In [1]:
!pip install -U kaleido



In [2]:
import pandas as pd
import numpy as np

**Load the data**

In [3]:
df = pd.read_csv('/content/Mall_Customers 2.csv')

**Display basic information about the dataset**

In [4]:
print("CUSTOMER DATA OVERVIEW:")
print("="*40)
print(f"Dataset shape: {df.shape}")
print(f"Number of customers: {len(df)}")
print()

print("FIRST 10 CUSTOMERS:")
print(df.head(10))
print()

print("DATASET INFO:")
print(df.info())
print()

print("STATISTICAL SUMMARY:")
print(df.describe())
print()

print("GENDER DISTRIBUTION:")
print(df['Gender'].value_counts())

CUSTOMER DATA OVERVIEW:
Dataset shape: (200, 5)
Number of customers: 200

FIRST 10 CUSTOMERS:
   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1    Male   19                  15                      39
1           2    Male   21                  15                      81
2           3  Female   20                  16                       6
3           4  Female   23                  16                      77
4           5  Female   31                  17                      40
5           6  Female   22                  17                      76
6           7  Female   35                  18                       6
7           8  Female   23                  18                      94
8           9    Male   64                  19                       3
9          10  Female   30                  19                      72

DATASET INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   C

***Applying data preprocessing steps for K-means clustering on customer purchase features in preparation for segmentation analysis.***

**Import required libraries for clustering and visualization**

In [5]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

**Set up plotting style**

In [6]:
plt.style.use('default')
sns.set_palette("husl")


Prepare the data for clustering
Select features for clustering: Age, Annual Income, and Spending Score

In [7]:
clustering_features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
X = df[clustering_features].copy()

print("FEATURES SELECTED FOR CLUSTERING:")
print("="*40)
for i, feature in enumerate(clustering_features):
    print(f"{i+1}. {feature}")

print(f"\nClustering data shape: {X.shape}")
print("\nFirst 5 rows of clustering data:")
print(X.head())

FEATURES SELECTED FOR CLUSTERING:
1. Age
2. Annual Income (k$)
3. Spending Score (1-100)

Clustering data shape: (200, 3)

First 5 rows of clustering data:
   Age  Annual Income (k$)  Spending Score (1-100)
0   19                  15                      39
1   21                  15                      81
2   20                  16                       6
3   23                  16                      77
4   31                  17                      40


**Check for missing values**

In [8]:
print(f"\nMissing values: {X.isnull().sum().sum()}")



Missing values: 0


**Scale the features for better clustering performance**

In [9]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("\nData standardized successfully!")
print(f"Scaled data shape: {X_scaled.shape}")


Data standardized successfully!
Scaled data shape: (200, 3)


***Calculating the optimal number of customer segments using the elbow method and silhouette scores for clustering analysis.***

**Find optimal number of clusters using Elbow Method and Silhouette Analysis**

In [10]:
def find_optimal_clusters(X_scaled, max_k=10):
    """
    Find optimal number of clusters using elbow method and silhouette analysis
    """
    inertias = []
    silhouette_scores = []
    K_range = range(2, max_k + 1)

    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X_scaled)
        inertias.append(kmeans.inertia_)
        silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

    return K_range, inertias, silhouette_scores

**Calculate metrics for different cluster numbers**

In [12]:
K_range, inertias, silhouette_scores = find_optimal_clusters(X_scaled)

print("CLUSTER EVALUATION RESULTS:")
print("="*40)
print("K\tInertia\t\tSilhouette Score")
print("-"*40)
for k, inertia, sil_score in zip(K_range, inertias, silhouette_scores):
    print(f"{k}\t{inertia:.2f}\t\t{sil_score:.3f}")

CLUSTER EVALUATION RESULTS:
K	Inertia		Silhouette Score
----------------------------------------
2	389.39		0.335
3	295.21		0.358
4	205.23		0.404
5	168.25		0.417
6	133.87		0.428
7	117.01		0.417
8	103.87		0.408
9	93.09		0.418
10	82.39		0.407


Find optimal k based on highest silhouette score

In [13]:
optimal_k = K_range[silhouette_scores.index(max(silhouette_scores))]
print(f"\nOptimal number of clusters (highest silhouette): {optimal_k}")
print(f"Silhouette score: {max(silhouette_scores):.3f}")



Optimal number of clusters (highest silhouette): 6
Silhouette score: 0.428


**Also check elbow method manually**

In [14]:
print(f"\nInertia reduction from k=2 to k=3: {inertias[0] - inertias[1]:.2f}")
print(f"Inertia reduction from k=3 to k=4: {inertias[1] - inertias[2]:.2f}")
print(f"Inertia reduction from k=4 to k=5: {inertias[2] - inertias[3]:.2f}")


Inertia reduction from k=2 to k=3: 94.17
Inertia reduction from k=3 to k=4: 89.99
Inertia reduction from k=4 to k=5: 36.98


***Applying K-means clustering with the optimal number of customer segments based on purchase behavior and demographics.***

In [15]:
# Apply K-means clustering with optimal number of clusters
# Using k=5 for better business interpretability
k_clusters = 5

print(f"APPLYING K-MEANS CLUSTERING WITH {k_clusters} CLUSTERS:")
print("="*50)

# Fit K-means model
kmeans = KMeans(n_clusters=k_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to original dataframe
df_clustered = df.copy()
df_clustered['Cluster'] = cluster_labels

# Calculate cluster centers (in original scale)
cluster_centers_scaled = kmeans.cluster_centers_
cluster_centers_original = scaler.inverse_transform(cluster_centers_scaled)

# Create cluster summary
print("CLUSTER CENTERS (Original Scale):")
print("-"*50)
cluster_summary = pd.DataFrame(
    cluster_centers_original,
    columns=clustering_features,
    index=[f'Cluster {i}' for i in range(k_clusters)]
)
print(cluster_summary.round(2))

print(f"\nCLUSTER SIZES:")
print("-"*20)
cluster_counts = pd.Series(cluster_labels).value_counts().sort_index()
for cluster, count in cluster_counts.items():
    percentage = (count / len(df_clustered)) * 100
    print(f"Cluster {cluster}: {count} customers ({percentage:.1f}%)")

print(f"\nSILHOUETTE SCORE: {silhouette_score(X_scaled, cluster_labels):.3f}")
print(f"WITHIN-CLUSTER SUM OF SQUARES: {kmeans.inertia_:.2f}")

APPLYING K-MEANS CLUSTERING WITH 5 CLUSTERS:
CLUSTER CENTERS (Original Scale):
--------------------------------------------------
             Age  Annual Income (k$)  Spending Score (1-100)
Cluster 0  46.25               26.75                   18.35
Cluster 1  25.19               41.09                   62.24
Cluster 2  32.88               86.10                   81.53
Cluster 3  39.87               86.10                   19.36
Cluster 4  55.64               54.38                   48.85

CLUSTER SIZES:
--------------------
Cluster 0: 20 customers (10.0%)
Cluster 1: 54 customers (27.0%)
Cluster 2: 40 customers (20.0%)
Cluster 3: 39 customers (19.5%)
Cluster 4: 47 customers (23.5%)

SILHOUETTE SCORE: 0.417
WITHIN-CLUSTER SUM OF SQUARES: 168.25


***Analyzing detailed customer clusters to gain insights and tailor marketing strategies for each segment.***

**Detailed cluster analysis**

In [16]:
print("DETAILED CLUSTER ANALYSIS:")
print("="*60)

cluster_names = {
    0: "Budget Shoppers",
    1: "Young Spenders",
    2: "High Value Customers",
    3: "Conservative High Earners",
    4: "Mature Moderates"
}

DETAILED CLUSTER ANALYSIS:


**Analyze each cluster**

In [17]:
for cluster_id in range(k_clusters):
    cluster_data = df_clustered[df_clustered['Cluster'] == cluster_id]

    print(f"\n{cluster_names[cluster_id].upper()} (Cluster {cluster_id}):")
    print("-" * 40)
    print(f"Size: {len(cluster_data)} customers ({len(cluster_data)/len(df_clustered)*100:.1f}%)")

    # Age analysis
    print(f"Age: {cluster_data['Age'].mean():.1f} ± {cluster_data['Age'].std():.1f} years")
    print(f"Age Range: {cluster_data['Age'].min()}-{cluster_data['Age'].max()} years")


BUDGET SHOPPERS (Cluster 0):
----------------------------------------
Size: 20 customers (10.0%)
Age: 46.2 ± 11.6 years
Age Range: 20-67 years

YOUNG SPENDERS (Cluster 1):
----------------------------------------
Size: 54 customers (27.0%)
Age: 25.2 ± 5.5 years
Age Range: 18-38 years

HIGH VALUE CUSTOMERS (Cluster 2):
----------------------------------------
Size: 40 customers (20.0%)
Age: 32.9 ± 3.9 years
Age Range: 27-40 years

CONSERVATIVE HIGH EARNERS (Cluster 3):
----------------------------------------
Size: 39 customers (19.5%)
Age: 39.9 ± 10.9 years
Age Range: 19-59 years

MATURE MODERATES (Cluster 4):
----------------------------------------
Size: 47 customers (23.5%)
Age: 55.6 ± 8.9 years
Age Range: 40-70 years


**Age analysis**

In [18]:
print(f"Age: {cluster_data['Age'].mean():.1f} ± {cluster_data['Age'].std():.1f} years")
print(f"Age Range: {cluster_data['Age'].min()}-{cluster_data['Age'].max()} years")


Age: 55.6 ± 8.9 years
Age Range: 40-70 years


**Income analysis**

In [19]:
print(f"Income: ${cluster_data['Annual Income (k$)'].mean():.1f}k ± ${cluster_data['Annual Income (k$)'].std():.1f}k")
print(f"Income Range: ${cluster_data['Annual Income (k$)'].min()}k-${cluster_data['Annual Income (k$)'].max()}k")

Income: $54.4k ± $8.8k
Income Range: $38k-$79k


**Spending analysis**

In [20]:
print(f"Spending Score: {cluster_data['Spending Score (1-100)'].mean():.1f} ± {cluster_data['Spending Score (1-100)'].std():.1f}")
print(f"Spending Range: {cluster_data['Spending Score (1-100)'].min()}-{cluster_data['Spending Score (1-100)'].max()}")

Spending Score: 48.9 ± 6.3
Spending Range: 35-60


**Gender distribution**

In [21]:
gender_dist = cluster_data['Gender'].value_counts()
print(f"Gender: {gender_dist.to_dict()}")

Gender: {'Female': 27, 'Male': 20}


**Business insights**

In [22]:
print("\n" + "="*60)
print("BUSINESS INSIGHTS & RECOMMENDATIONS:")
print("="*60)

insights = {
    0: "Budget-conscious older customers with limited spending. Focus on value deals and essential products.",
    1: "Young customers with moderate income but high spending willingness. Target with trendy products and flexible payment options.",
    2: "Premium customers - high income and high spending. Focus on luxury products, premium services, and VIP treatment.",
    3: "High earners but conservative spenders. Target with quality, investment-worthy products and highlight long-term value.",
    4: "Stable middle-aged customers with moderate behavior. Focus on family-oriented products and loyalty programs."
}

for cluster_id, insight in insights.items():
    print(f"\n{cluster_names[cluster_id]}: {insight}")


BUSINESS INSIGHTS & RECOMMENDATIONS:

Budget Shoppers: Budget-conscious older customers with limited spending. Focus on value deals and essential products.

Young Spenders: Young customers with moderate income but high spending willingness. Target with trendy products and flexible payment options.

High Value Customers: Premium customers - high income and high spending. Focus on luxury products, premium services, and VIP treatment.

Conservative High Earners: High earners but conservative spenders. Target with quality, investment-worthy products and highlight long-term value.

Mature Moderates: Stable middle-aged customers with moderate behavior. Focus on family-oriented products and loyalty programs.


***Saving the clustering results and generating detailed summaries for customer segmentation analysis.***

**Save the clustering results to CSV for visualization**

In [23]:
df_clustered['Cluster_Name'] = df_clustered['Cluster'].map(cluster_names)

**Save the main results**

In [24]:
df_clustered.to_csv('customer_clusters.csv', index=False)

**Create cluster summary table**

In [25]:
cluster_summary_detailed = pd.DataFrame({
    'Cluster_ID': range(k_clusters),
    'Cluster_Name': [cluster_names[i] for i in range(k_clusters)],
    'Customer_Count': [len(df_clustered[df_clustered['Cluster'] == i]) for i in range(k_clusters)],
    'Percentage': [len(df_clustered[df_clustered['Cluster'] == i])/len(df_clustered)*100 for i in range(k_clusters)],
    'Avg_Age': [df_clustered[df_clustered['Cluster'] == i]['Age'].mean() for i in range(k_clusters)],
    'Avg_Income': [df_clustered[df_clustered['Cluster'] == i]['Annual Income (k$)'].mean() for i in range(k_clusters)],
    'Avg_Spending': [df_clustered[df_clustered['Cluster'] == i]['Spending Score (1-100)'].mean() for i in range(k_clusters)]
})

cluster_summary_detailed.to_csv('cluster_summary.csv', index=False)

print("SAVED FILES:")
print("1. customer_clusters.csv - Full dataset with cluster assignments")
print("2. cluster_summary.csv - Cluster summary statistics")

print(f"\nCluster Summary Table:")
print(cluster_summary_detailed.round(2))

SAVED FILES:
1. customer_clusters.csv - Full dataset with cluster assignments
2. cluster_summary.csv - Cluster summary statistics

Cluster Summary Table:
   Cluster_ID               Cluster_Name  Customer_Count  Percentage  Avg_Age  \
0           0            Budget Shoppers              20        10.0    46.25   
1           1             Young Spenders              54        27.0    25.19   
2           2       High Value Customers              40        20.0    32.88   
3           3  Conservative High Earners              39        19.5    39.87   
4           4           Mature Moderates              47        23.5    55.64   

   Avg_Income  Avg_Spending  
0       26.75         18.35  
1       41.09         62.24  
2       86.10         81.53  
3       86.10         19.36  
4       54.38         48.85  


**Create data for scatter plots**

In [26]:
income_spending_data = df_clustered[['Annual Income (k$)', 'Spending Score (1-100)', 'Cluster', 'Cluster_Name']].copy()
age_spending_data = df_clustered[['Age', 'Spending Score (1-100)', 'Cluster', 'Cluster_Name']].copy()
age_income_data = df_clustered[['Age', 'Annual Income (k$)', 'Cluster', 'Cluster_Name']].copy()

print(f"\nData prepared for visualization!")
print(f"Ready to create scatter plots and cluster analysis charts.")


Data prepared for visualization!
Ready to create scatter plots and cluster analysis charts.


***Creating a scatter plot to visualize customer segments based on income and spending score using clustering results.***

In [27]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from sklearn.cluster import KMeans
import numpy as np

In [28]:
df = pd.read_csv("/content/Mall_Customers 2.csv")

Create clusters using K-means clustering
Use Annual Income and Spending Score for clustering

In [29]:
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

**Apply K-means clustering with 5 clusters**

In [30]:
kmeans = KMeans(n_clusters=5, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)


**Create cluster names**

In [31]:
cluster_names = {
    0: 'Low Income Low Spend',
    1: 'High Income Low Spend',
    2: 'Medium Income Med Spend',
    3: 'Low Income High Spend',
    4: 'High Income High Spend'
}

df['Cluster_Name'] = df['Cluster'].map(cluster_names)


**Create scatter plot**

In [32]:
fig = px.scatter(df,
                 x='Annual Income (k$)',
                 y='Spending Score (1-100)',
                 color='Cluster_Name',
                 title='Customer Segmentation: Income vs Spending Score',
                 color_discrete_sequence=['#1FB8CD', '#DB4545', '#2E8B57', '#5D878F', '#D2BA4C'])

In [33]:
# Update traces for medium size and transparency
fig.update_traces(marker=dict(size=8, opacity=0.7), cliponaxis=False)

# Update layout for centered legend (5 items)
fig.update_layout(legend=dict(orientation='h', yanchor='bottom', y=1.05, xanchor='center', x=0.5))

# Update axis labels to fit 15 character limit
fig.update_xaxes(title='Income (k$)')
fig.update_yaxes(title='Spending Score')


***Creating a scatter plot to visualize customer clusters based on age and spending score across the different segments.***

In [36]:
import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans
import numpy as np

# Load the data
df = pd.read_csv("/content/Mall_Customers 2.csv")

# Create clusters using K-means clustering on Age and Spending Score
features = df[['Age', 'Spending Score (1-100)']]
kmeans = KMeans(n_clusters=4, random_state=42)
df['Cluster'] = kmeans.fit_predict(features)

# Create meaningful cluster names based on characteristics
cluster_names = {
    0: 'Young High Spnd',
    1: 'Mid Low Spnd',
    2: 'Old Low Spnd',
    3: 'Young Low Spnd'
}

# Analyze clusters to assign better names
for cluster in range(4):
    cluster_data = df[df['Cluster'] == cluster]
    avg_age = cluster_data['Age'].mean()
    avg_spending = cluster_data['Spending Score (1-100)'].mean()

    if avg_age < 35 and avg_spending > 60:
        cluster_names[cluster] = 'Young High Spnd'
    elif avg_age < 35 and avg_spending <= 60:
        cluster_names[cluster] = 'Young Low Spnd'
    elif avg_age >= 35 and avg_spending > 60:
        cluster_names[cluster] = 'Old High Spnd'
    else:
        cluster_names[cluster] = 'Old Low Spnd'

df['Cluster_Name'] = df['Cluster'].map(cluster_names)

# Create scatter plot
fig = px.scatter(df,
                x='Age',
                y='Spending Score (1-100)',
                color='Cluster_Name',
                title='Customer Clusters: Age vs Spend Score',
                color_discrete_sequence=['#1FB8CD', '#DB4545', '#2E8B57', '#5D878F'])

# Update traces for medium size and transparency
fig.update_traces(marker=dict(size=8, opacity=0.7), cliponaxis=False)

# Update axis labels to meet 15 character limit
fig.update_xaxes(title_text='Age')
fig.update_yaxes(title_text='Spend Score')

# Center legend under title since we have 4 items (≤5)
fig.update_layout(legend=dict(orientation='h', yanchor='bottom', y=1.05, xanchor='center', x=0.5))


***Grouped bar chart comparing the average characteristics of each customer cluster across age, income, and spending score metrics.***

In [37]:
import pandas as pd
import plotly.graph_objects as go
from sklearn.cluster import KMeans
import numpy as np

# Load the data
df = pd.read_csv("/content/Mall_Customers 2.csv")

# Prepare features for clustering (using Age, Income, and Spending Score)
features = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Perform K-means clustering with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)
df['Cluster'] = kmeans.fit_predict(features)

# Calculate cluster summaries
cluster_summary = df.groupby('Cluster').agg({
    'Age': 'mean',
    'Annual Income (k$)': 'mean',
    'Spending Score (1-100)': 'mean'
}).reset_index()

# Create cluster names
cluster_names = ['Budget Conscious', 'Young Spenders', 'Moderate Income', 'High Value', 'Low Engagement']
cluster_summary['Cluster_Name'] = cluster_names

# Scale age to make it more comparable (multiply by a factor to bring it in similar range)
cluster_summary['Scaled_Age'] = cluster_summary['Age'] * 2

print("Cluster Summary:")
print(cluster_summary)

# Create grouped bar chart
fig = go.Figure()

# Add bars for each metric
fig.add_trace(go.Bar(
    name='Scaled Age',
    x=cluster_summary['Cluster_Name'],
    y=cluster_summary['Scaled_Age'],
    marker_color='#1FB8CD'
))

fig.add_trace(go.Bar(
    name='Avg Income',
    x=cluster_summary['Cluster_Name'],
    y=cluster_summary['Annual Income (k$)'],
    marker_color='#DB4545'
))

fig.add_trace(go.Bar(
    name='Avg Spending',
    x=cluster_summary['Cluster_Name'],
    y=cluster_summary['Spending Score (1-100)'],
    marker_color='#2E8B57'
))

# Update layout
fig.update_layout(
    title='Customer Cluster Comparison',
    xaxis_title='Cluster',
    yaxis_title='Value',
    barmode='group',
    legend=dict(orientation='h', yanchor='bottom', y=1.05, xanchor='center', x=0.5)
)

# Update traces
fig.update_traces(cliponaxis=False)



Cluster Summary:
   Cluster        Age  Annual Income (k$)  Spending Score (1-100)  \
0        0  46.213483           47.719101               41.797753   
1        1  32.454545          108.181818               82.727273   
2        2  24.689655           29.586207               73.655172   
3        3  40.394737           87.000000               18.631579   
4        4  31.787879           76.090909               77.757576   

       Cluster_Name  Scaled_Age  
0  Budget Conscious   92.426966  
1    Young Spenders   64.909091  
2   Moderate Income   49.379310  
3        High Value   80.789474  
4    Low Engagement   63.575758  
