1️ What is unsupervised learning in the context of machine learning?

Unsupervised learning is a type of machine learning where the model is trained on data without labeled outputs. The goal is to discover hidden patterns, structures, or groupings within the data. Common tasks include clustering (grouping similar data points), dimensionality reduction, and anomaly detection. Unlike supervised learning, there is no ground truth label to guide the learning process.

2️ How does the K-Means clustering algorithm work?

K-Means partitions data into K clusters by minimizing the distance between data points and their assigned cluster centroid.
Steps:

Initialize K centroids randomly.

Assign each data point to the nearest centroid.

Recompute centroids as the mean of assigned points.

Repeat assignment and centroid update until convergence (centroids stop changing).
The algorithm aims to minimize intra-cluster variance (distance within clusters).

3️ Explain the concept of a dendrogram in hierarchical clustering.

A dendrogram is a tree-like diagram used to visualize the hierarchical relationships between clusters. It shows how clusters are merged (or split) at different distance thresholds. The vertical axis represents distance (dissimilarity), and the horizontal axis shows data points or clusters. Cutting the dendrogram at a certain height determines the number of clusters.

4️ What is the main difference between K-Means and Hierarchical Clustering?

K-Means: Partition-based, requires predefined number of clusters (K), efficient for large datasets.

Hierarchical Clustering: Builds a hierarchy of clusters (tree structure), does not require specifying K beforehand.
K-Means is faster but assumes spherical clusters, while hierarchical clustering is more flexible but computationally expensive.

5️ What are the advantages of DBSCAN over K-Means?

DBSCAN can:

Detect clusters of arbitrary shapes (not just spherical).

Identify noise and outliers automatically.

Work well when clusters have irregular boundaries.
Unlike K-Means, it does not require specifying the number of clusters beforehand.

6️ When would you use Silhouette Score in clustering?

Silhouette Score is used to evaluate clustering quality. It measures how similar a point is to its own cluster compared to other clusters. Use it when choosing the optimal number of clusters or comparing clustering algorithms. Higher score (close to 1) indicates well-separated clusters.

7️ What are the limitations of Hierarchical Clustering?

Computationally expensive for large datasets (O(n²)).

Once merged or split, decisions cannot be reversed.

Sensitive to noise and outliers.

Difficult to scale compared to algorithms like K-Means.

8️ Why is feature scaling important in clustering algorithms like K-Means?

K-Means relies on distance calculations (Euclidean distance). If features have different scales (e.g., age vs salary), larger scale features dominate the distance calculation, leading to biased clustering. Scaling ensures each feature contributes equally.

9️ How does DBSCAN identify noise points?

DBSCAN labels points as noise if they do not belong to any dense region. A point is considered noise if:

It has fewer than min_samples neighbors within radius eps.
Such points are marked with label -1.

10 Define inertia in the context of K-Means.

Inertia is the sum of squared distances between each data point and its assigned cluster centroid. It measures cluster compactness. Lower inertia indicates tighter, more cohesive clusters.

1️1️ What is the elbow method in K-Means clustering?

The elbow method helps determine the optimal number of clusters (K). It plots inertia vs K. As K increases, inertia decreases. The “elbow point” where the decrease slows significantly indicates the best K.

1️2️ Describe the concept of "density" in DBSCAN.

Density refers to the number of points within a specified radius (eps). DBSCAN forms clusters where points are densely packed and separates sparse regions as noise. Dense regions create meaningful clusters.

1️3️ Can hierarchical clustering be used on categorical data?

Yes, but it requires appropriate distance metrics such as Hamming distance or similarity measures instead of Euclidean distance. Standard implementations usually assume numeric data.

1️4️ What does a negative Silhouette Score indicate?

A negative Silhouette Score means a data point is closer to another cluster than its own cluster. This indicates poor clustering and possible misclassification of that point.

1️5️ Explain the term "linkage criteria" in hierarchical clustering.

Linkage criteria define how distance between clusters is computed during merging:

Single linkage: minimum distance between points

Complete linkage: maximum distance

Average linkage: mean distance
Different linkage methods produce different cluster shapes and structures.

1️6️ Why might K-Means perform poorly on data with varying cluster sizes or densities?

K-Means assumes clusters are spherical and similar in size. When clusters vary in size, density, or shape, centroids may shift incorrectly, causing misclassification and inaccurate cluster boundaries.

1️7️ What are the core parameters in DBSCAN, and how do they influence clustering?

Core parameters:

eps: radius defining neighborhood size

min_samples: minimum number of points to form a dense region
Large eps → fewer clusters, more merging
Small eps → more clusters, possibly more noise
Higher min_samples → stricter density requirement

1️8️ How does K-Means++ improve upon standard K-Means initialization?

K-Means++ selects initial centroids more intelligently by spreading them apart. This reduces poor initialization, speeds up convergence, and often leads to better clustering results compared to random initialization.

1️9️ What is agglomerative clustering?

Agglomerative clustering is a bottom-up hierarchical clustering method. Each data point starts as its own cluster, and the algorithm repeatedly merges the closest clusters until a stopping criterion (like number of clusters) is reached.

2️0️ What makes Silhouette Score a better metric than just inertia for model evaluation?

Inertia only measures compactness within clusters but ignores separation between clusters. Silhouette Score considers both:

Cohesion (how close points are within cluster)

Separation (how far clusters are from each other)
Hence, Silhouette Score gives a more balanced and reliable evaluation of clustering quality than inertia alone.



1️ Question:

Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.

Answer:

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("KMeans Clustering (4 centers)")
plt.show()

2️ Question:

Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.

Answer:

from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
X = iris.data

agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

print(labels[:10])

3️ Question:

Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.

Answer:

from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("DBSCAN on Moons (Outliers = -1)")
plt.show()

4️ Question:

Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.

Answer:

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

unique, counts = np.unique(labels, return_counts=True)
print(dict(zip(unique, counts)))

5️ Question:

Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result.

Answer:

from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)

db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("DBSCAN on Circles")
plt.show()

6️ Question:

Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.

Answer:

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

data = load_breast_cancer()
X = MinMaxScaler().fit_transform(data.data)

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

print(kmeans.cluster_centers_)

7️ Question:

Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.

Answer:

from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)

db = DBSCAN(eps=0.9, min_samples=5)
labels = db.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("DBSCAN with varying std")
plt.show()

8️ Question:

Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means.

Answer:

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)
plt.title("KMeans on Digits PCA")
plt.show()

9 Question:

Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart.

Answer:

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

scores = []
ks = range(2, 6)

for k in ks:
    labels = KMeans(n_clusters=k, random_state=42).fit_predict(X)
    scores.append(silhouette_score(X, labels))

plt.bar(ks, scores)
plt.xlabel("k")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for k=2 to 5")
plt.show()

 10.Question:

Load the Iris dataset and use hierarchical clustering. Plot a dendrogram with average linkage.

Answer:

from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

iris = load_iris()
Z = linkage(iris.data, method='average')

plt.figure(figsize=(10,5))
dendrogram(Z)
plt.title("Dendrogram (Average Linkage)")
plt.show()

1️1️ Question:

Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.

Answer:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Decision boundary
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("KMeans with Decision Boundaries")
plt.show()

1️2️ Question:

Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.

Answer:

from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data

tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X)

db = DBSCAN(eps=5, min_samples=5)
labels = db.fit_predict(X_2d)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels)
plt.title("DBSCAN on t-SNE Digits")
plt.show()

1️3️ Question:

Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.

Answer:

from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

agg = AgglomerativeClustering(n_clusters=4, linkage='complete')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("Agglomerative Clustering (Complete Linkage)")
plt.show()

1️4️ Question:

Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot.

Answer:

from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = load_breast_cancer()
X = data.data

inertia = []
K = range(2, 7)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

plt.plot(K, inertia, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Inertia vs K")
plt.show()

1️5️ Question:

Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.

Answer:

from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)

agg = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("Agglomerative Clustering (Single Linkage)")
plt.show()

1️6️ Question:

Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise).

Answer:

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

db = DBSCAN(eps=1.5, min_samples=5)
labels = db.fit_predict(X)

clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", clusters)

1️7️ Question:

Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points.

Answer:

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

centers = kmeans.cluster_centers_

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], s=200, marker='X')
plt.title("KMeans with Cluster Centers")
plt.show()

1️8️ Question:

Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.

Answer:

from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN

iris = load_iris()
X = iris.data

db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X)

noise_count = list(labels).count(-1)
print("Number of noise samples:", noise_count)

1️9️ Question:

Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.

Answer:

from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.title("KMeans on Non-linear Moons Data")
plt.show()

2️0️ Question:

Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.

Answer:

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

digits = load_digits()
X = digits.data

pca = PCA(n_components=3)
X_3d = pca.fit_transform(X)

kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_3d)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_3d[:, 0], X_3d[:, 1], X_3d[:, 2], c=labels)
ax.set_title("3D PCA + KMeans on Digits")
plt.show()
