**Question 1**: Difference between K-Means and Hierarchical Clustering (with use cases)       
**Answer**: **K-Means Clustering**

- Partitions data into a fixed number (K) of clusters

- Uses centroids and minimizes within-cluster variance

- Fast and scalable for large datasets

- Sensitive to initial centroids and outliers

**Use case**:
Customer segmentation in large retail datasets where the number of segments is roughly known in advance.             
**Hierarchical Clustering**

- Builds clusters step by step (bottom-up or top-down)

- Does not require predefining number of clusters

- Produces a dendrogram showing cluster hierarchy

- Computationally expensive for large datasets

**Use case**:
Genetic or document similarity analysis where understanding relationships between clusters is important.

**Question 2**: Purpose of the Silhouette Score             

**Answer**: The Silhouette Score measures how well a data point fits within its assigned cluster compared to other clusters.

- Value ranges from −1 to +1

- Higher value → better clustering

- Considers both:

**Cohesion** (within-cluster distance)

**Separation** (between-cluster distance)

Why it matters:   
Helps evaluate clustering quality without ground truth labels.

**Question 3**: Core parameters of DBSCAN and their influence


**Answer**: **eps (epsilon)**              
- Radius around a data point

- Controls neighborhood size

- Too small → many points marked as noise

- Too large → clusters merge

**min_samples**

- Minimum number of points required to form a dense region

- Higher value → stricter clusters

- Lower value → more clusters, more noise sensitivity

 Together, they define density, not shape or size.

**Question 4**: Why feature scaling is important in clustering

**Answer**:                 
- Distance-based algorithms (K-Means, DBSCAN) rely on Euclidean distance

- Features with larger ranges dominate distance calculations

- Scaling ensures equal contribution from all features

Without scaling:

- Clusters become biased

- DBSCAN density estimation fails

- K-Means centroids shift incorrectly

**Question 5**: Elbow Method in K-Means

**Answer**: The Elbow Method plots:

- Number of clusters (K) vs

- Inertia (within-cluster sum of squares)

As K increases:

- Inertia decreases

- At some point, improvement slows → elbow point

This elbow indicates the optimal number of clusters.

**Question 6**: K-Means on make_blobs data (with visualization)

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(kmeans.cluster_centers_[:, 0],
            kmeans.cluster_centers_[:, 1],
            marker='X', s=200)
plt.title("K-Means Clustering on make_blobs")
plt.show()



**Question 7**: DBSCAN on Wine dataset (with scaling)

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

X, _ = load_wine(return_X_y=True)

X_scaled = StandardScaler().fit_transform(X)

dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)




**Question 8**: DBSCAN on make_moons data (outliers highlighted)

In [None]:
from sklearn.datasets import make_moons

X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("DBSCAN on make_moons (Noise = -1)")
plt.show()


**Question 9**: PCA + Agglomerative Clustering on Wine dataset

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering

X, _ = load_wine(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)
plt.title("Agglomerative Clustering on PCA-reduced Wine Data")
plt.show()


**Question 10**: Real-world e-commerce clustering workflow        
**Answer**:              
**Step 1**: Algorithm choice

K-Means for large-scale customer segmentation

DBSCAN for detecting niche or anomalous customers

Hierarchical clustering for exploratory analysis

**Step 2**: Data preprocessing

Handle missing values (mean / median / mode)

Encode categorical variables

Apply StandardScaler

Remove extreme outliers            
**Step 3**: Choosing number of clusters

Elbow Method

Silhouette Score

Business interpretability (marketing relevance)

**Step 4**: Business benefits

Personalized promotions

Improved customer retention

Better product recommendations

Higher conversion rates

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('kmeans', KMeans(n_clusters=5, random_state=42))
])

pipeline.fit(X_scaled)
print("Customer clustering completed")
