# Customer Segmentation Using K-Means Clustering
This report performs customer segmentation using **K-Means Clustering**, a widely used unsupervised learning algorithm. It aims to group similar customers together based on their demographic and behavioral attributes. This type of analysis helps businesses tailor their services to specific customer segments.

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

## Load Dataset

In [None]:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
train.head()

## Data Cleaning and Preprocessing
Before performing clustering, we clean and preprocess the dataset:
- Drop ID and target (`Segmentation`) column
- Impute missing values
- Convert categorical variables using One-Hot Encoding
- Scale numeric features to have zero mean and unit variance

In [None]:
data = train.drop(columns=['ID', 'Segmentation'])
categorical_features = data.select_dtypes(include='object').columns.tolist()
numerical_features = data.select_dtypes(exclude='object').columns.tolist()

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse=False))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numerical_features),
    ('cat', categorical_pipeline, categorical_features)
])

processed_data = preprocessor.fit_transform(data)

## Exploratory Data Analysis: Univariate Analysis
We examine each feature individually to understand its distribution.

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(18, 12))
axes = axes.flatten()
for i, col in enumerate(data.columns):
    if data[col].dtype == 'object':
        sns.countplot(y=col, data=data, ax=axes[i])
    else:
        sns.histplot(data[col].dropna(), kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

## Determining Optimal Clusters: Elbow Method
The Elbow Method is used to identify the ideal number of clusters by plotting the within-cluster sum of squares (inertia).

In [None]:
inertia = []
for k in range(1, 11):
    model = KMeans(n_clusters=k, random_state=42, n_init=10)
    model.fit(processed_data)
    inertia.append(model.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.grid(True)
plt.show()

## K-Means Clustering and PCA Visualization
We choose 4 clusters based on the elbow plot and visualize the clusters using PCA (Principal Component Analysis).

In [None]:
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(processed_data)

pca = PCA(n_components=2)
pca_data = pca.fit_transform(processed_data)

plt.figure(figsize=(10, 6))
plt.scatter(pca_data[:, 0], pca_data[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Customer Segments Visualized with PCA')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.show()

## Conclusion
K-Means clustering successfully grouped customers into four meaningful segments based on their attributes. This segmentation can guide targeted marketing strategies and personalized customer service. Further analysis could include profiling each cluster and comparing with actual business outcomes.