
# Customer Segmentation Analysis using K-Means Clustering

This notebook demonstrates:
- Data cleaning and preprocessing
- Exploratory data analysis
- Optimal cluster selection using Elbow Method and Silhouette Score
- K-Means clustering for customer segmentation
- Visualization of customer segments


In [1]:

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score


In [2]:

# Load dataset
data = pd.read_csv("housing.csv")
data.head()


Unnamed: 0,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
0,0.02731 0.00 7.070 0 0.4690 6.4210 78...
1,0.02729 0.00 7.070 0 0.4690 7.1850 61...
2,0.03237 0.00 2.180 0 0.4580 6.9980 45...
3,0.06905 0.00 2.180 0 0.4580 7.1470 54...
4,0.02985 0.00 2.180 0 0.4580 6.4300 58...


## Dataset Overview

In [3]:

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 1 columns):
 #   Column                                                                                            Non-Null Count  Dtype 
---  ------                                                                                            --------------  ----- 
 0    0.00632  18.00   2.310  0  0.5380  6.5750  65.20  4.0900   1  296.0  15.30 396.90   4.98  24.00  505 non-null    object
dtypes: object(1)
memory usage: 4.1+ KB


In [4]:

data.describe()


Unnamed: 0,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
count,505
unique,505
top,0.02731 0.00 7.070 0 0.4690 6.4210 78...
freq,1


## Data Cleaning and Preprocessing

In [5]:

# Drop duplicate rows
data = data.drop_duplicates()

# Handle missing values
data = data.fillna(data.median(numeric_only=True))

# Select numerical features for clustering
num_data = data.select_dtypes(include=['int64', 'float64'])


## Feature Scaling

In [6]:

scaler = StandardScaler()
scaled_data = scaler.fit_transform(num_data)


ValueError: at least one array or dtype is required

## Elbow Method to Find Optimal K

In [7]:

inertia = []
K = range(1, 11)

for k in K:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(scaled_data)
    inertia.append(km.inertia_)

plt.plot(K, inertia, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()


NameError: name 'scaled_data' is not defined

## Silhouette Score Analysis

In [None]:

sil_scores = []

for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(scaled_data)
    sil_scores.append(silhouette_score(scaled_data, labels))

plt.plot(range(2,11), sil_scores, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score Analysis")
plt.show()


## K-Means Clustering

In [None]:

# Choose optimal number of clusters (example: k=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
data['Cluster'] = kmeans.fit_predict(scaled_data)

data.head()


## Cluster Visualization

In [None]:

# Visualize clusters using first two numerical features
plt.figure(figsize=(8,6))
sns.scatterplot(
    x=num_data.columns[0],
    y=num_data.columns[1],
    hue=data['Cluster'],
    palette='Set2',
    data=data
)
plt.title("Customer Segments")
plt.show()



## Conclusion

- Customers were successfully segmented using **K-Means clustering**
- Optimal clusters were identified using **Elbow Method** and **Silhouette Score**
- Visualizations provide actionable insights for business decision-making
