##Question 1: What is the difference between K-Means and Hierarchical Clustering? Provide a use case for each.

Answer:

K-Means Clustering is a partition-based clustering algorithm. It divides the dataset into K fixed clusters, where each data point belongs to the nearest cluster center (centroid). It works best when clusters are spherical, well-separated, and the number of clusters is already known.

Hierarchical Clustering builds clusters in a tree-like structure (dendrogram). It does not require specifying the number of clusters initially. It can be Agglomerative (bottom-up) or Divisive (top-down). It is useful when we want to understand cluster relationships.

##Key Differences

 * K-Means needs K in advance; Hierarchical does not.

 * K-Means is faster for large datasets; Hierarchical is slower.

* K-Means works better for round clusters; Hierarchical can capture complex patterns.

##Use Case

* K-Means Use Case: Customer segmentation in retail where clusters are well separated (e.g., low spenders, medium spenders, high spenders).

* Hierarchical Use Case: Grouping documents/news articles based on similarity to explore topic hierarchy.


___
#Question 2: Explain the purpose of the Silhouette Score in evaluating clustering algorithms.

Answer:
The Silhouette Score is used to evaluate how well clustering has been performed. It measures how similar a point is to its own cluster compared to other clusters.

* A high silhouette score means the data point is well matched to its own cluster and far from other clusters.

* A low score means the point may be assigned to the wrong cluster.

* A negative score indicates the point is likely in the wrong cluster.

Silhouette score ranges from -1 to +1:

* +1 → perfect clustering

* 0 → overlapping clusters

* -1 → incorrect clustering

It helps in selecting the best clustering model and also helps to choose the optimal number of clusters.

___


##Question 3: What are the core parameters of DBSCAN, and how do they influence the clustering process?

Answer:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) forms clusters based on density.

**Core Parameters**

**1 eps (epsilon):**
It defines the maximum distance between two points to be considered neighbors.

* Small eps → too many points become noise

* Large eps → clusters merge together

**2  min_samples:**
It defines the minimum number of points required to form a dense region (core point).

* High min_samples → fewer clusters, more noise

* Low min_samples → more clusters, less noise

**Influence on clustering**

* DBSCAN identifies core points, border points, and noise points.

* It is useful for detecting clusters of arbitrary shapes and also identifies outliers naturally.



____

##Question 4: Why is feature scaling important when applying clustering algorithms like K-Means and DBSCAN?

Answer:

Feature scaling is important because clustering algorithms like K-Means and DBSCAN use distance calculations (usually Euclidean distance).

If one feature has a larger range (example: income 0–1,00,000) and another has a smaller range (example: age 0–60), then income will dominate the distance calculation.

**Why scaling matters**

* It ensures all features contribute equally

* It improves cluster accuracy

* It prevents bias due to large-scale features

* It improves DBSCAN’s eps selection and neighbor detection

**Common scaling methods:**

* StandardScaler (mean=0, std=1)

* MinMaxScaler (0 to 1)

____



#Question 5: What is the Elbow Method in K-Means clustering and how does it help determine the optimal number of clusters?

Answer:

The Elbow Method is used to find the best value of K in K-Means clustering.

It plots:

* K (number of clusters) on x-axis

* WCSS (Within Cluster Sum of Squares) on y-axis

WCSS decreases as K increases, but after a certain point, the improvement becomes very small. That point forms a bend like an elbow.

**How it helps**

* The “elbow point” gives the optimal K

* It avoids selecting too many clusters

* It ensures good clustering with minimum complexity

In [None]:
#Question 6: Generate synthetic data using make_blobs(n_samples=300, centers=4), apply KMeans clustering, and visualize the results with cluster centers. (Include your Python code and output in the code box below.)

#Answer:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Plot results
plt.figure(figsize=(7,5))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', s=200, marker='X', label='Centers')
plt.title("KMeans Clustering on make_blobs Data")
plt.legend()
plt.show()


In [None]:
#Question 7: Load the Wine dataset, apply StandardScaler , and then train a DBSCAN model. Print the number of clusters found (excluding noise). (Include your Python code and output in the code box below.)

#Answer:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Load dataset
wine = load_wine()
X = wine.data

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Count clusters excluding noise (-1)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters found (excluding noise):", n_clusters)


In [None]:
#Question 8: Generate moon-shaped synthetic data using make_moons(n_samples=200, noise=0.1), apply DBSCAN, and highlight the outliers in the plot. (Include your Python code and output in the code box below.)

#Answer:
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

# Generate moon-shaped data
X, y = make_moons(n_samples=200, noise=0.1, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot results
plt.figure(figsize=(7,5))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40)

# Highlight outliers (noise points)
outliers = labels == -1
plt.scatter(X[outliers, 0], X[outliers, 1], c='yellow', s=80, label='Outliers')

plt.title("DBSCAN on make_moons Data with Outliers")
plt.legend()
plt.show()


In [None]:
#Question 9: Load the Wine dataset, reduce it to 2D using PCA, then apply Agglomerative Clustering and visualize the result in 2D with a scatter plot. (Include your Python code and output in the code box below.)

#Answer:
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering

# Load dataset
wine = load_wine()
X = wine.data

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X_pca)

# Plot
plt.figure(figsize=(7,5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, s=40)
plt.title("Agglomerative Clustering on Wine Dataset (PCA Reduced)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.show()


#Question 10: You are working as a data analyst at an e-commerce company. The marketing team wants to segment customers based on their purchasing behavior to run targeted promotions. The dataset contains customer demographics and their product purchase history across categories. Describe your real-world data science workflow using clustering: ● Which clustering algorithm(s) would you use and why? ● How would you preprocess the data (missing values, scaling)? ● How would you determine the number of clusters? ● How would the marketing team benefit from your clustering analysis? (Include your Python code and output in the code box below.)

Answer:


#Workflow

 **1  Business Understanding**

 The goal is to group customers into meaningful segments based on purchase behavior and demographics.

**2  Data Cleaning & Preprocessing**

* Handle missing values using:*

* mean/median for numerical data

* mode for categorical data

*  Convert categorical variables using OneHotEncoding

* Scale numerical features using StandardScaler (important for distance-based clustering)

**3  Algorithm Selection**

* K-Means: good for large customer datasets, fast, easy to interpret

* DBSCAN: useful for detecting outliers (fraud/unusual customers)

* For this use case, I will mainly use K-Means for segmentation.

**4  Selecting number of clusters**

* Use Elbow Method*

* Use Silhouette Score to confirm best clustering quality

**Marketing Benefits**

* Helps create personalized campaigns

* Identify high-value customers for premium offers

* Identify discount seekers and offer coupons

* Improve customer retention and increase revenue

In [None]:
#Python Code Example

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Example dataset (dummy customer purchase behavior)
data = {
    "age": [22, 25, 45, 35, 52, 23, 40, 60],
    "income": [25000, 30000, 80000, 60000, 90000, 28000, 65000, 100000],
    "electronics_spend": [2000, 2500, 12000, 9000, 15000, 1800, 9500, 16000],
    "grocery_spend": [5000, 4500, 3000, 3500, 2000, 5200, 3200, 1800]
}

df = pd.DataFrame(data)

# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

# Apply KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Evaluate
sil_score = silhouette_score(X_scaled, labels)

print("Cluster Labels:", labels)
print("Silhouette Score:", sil_score)
