# Clustering

## Theoretical Questions

**Question 1:** What is unsupervised learning in the context of machine learning?

**Answer:**
**Unsupervised learning** is a type of machine learning where the algorithm learns patterns from **unlabeled data**. Unlike supervised learning, there is no predefined target or output variable. The goal is to explore the data and find meaningful structures or patterns on its own, such as grouping similar data points together (clustering) or reducing the number of features (dimensionality reduction).

**Question 2:** How does K-Means clustering algorithm work?

**Answer:**
The **K-Means clustering algorithm** works by partitioning a dataset into a predefined number of 'K' distinct, non-overlapping clusters. The steps are as follows:
1. **Initialization:** Randomly select 'K' initial cluster centroids (center points) from the data.
2. **Assignment Step:** Assign each data point to the nearest cluster centroid, based on a distance metric (usually Euclidean distance).
3. **Update Step:** Recalculate the centroid of each cluster by taking the mean of all data points assigned to it.
4. **Iteration:** Repeat the Assignment and Update steps until the centroids no longer move significantly, meaning the clusters have stabilized.

**Question 3:** Explain the concept of a dendrogram in hierarchical clustering.

**Answer:**
A **dendrogram** is a tree-like diagram used to visualize the arrangement of the clusters produced by hierarchical clustering. It illustrates how clusters are merged (in agglomerative clustering) or split (in divisive clustering). The y-axis of the dendrogram represents the distance or dissimilarity between clusters. By cutting the dendrogram at a certain height, you can determine the number of clusters for your dataset.

**Question 4:** What is the main difference between K-Means and Hierarchical Clustering?

**Answer:**
The main difference is that **K-Means** is a **partitional** clustering algorithm, while **Hierarchical Clustering** is, as the name suggests, **hierarchical**.
- In **K-Means**, you must specify the number of clusters (K) beforehand, and it assigns each data point to exactly one cluster.
- In **Hierarchical Clustering**, you do not need to pre-specify the number of clusters. It builds a hierarchy of clusters, which can be visualized using a dendrogram, allowing you to choose the number of clusters after the fact.

**Question 5:** What are the advantages of DBSCAN over K-Means?

**Answer:**
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has several key advantages over K-Means:
1. **Arbitrary Shaped Clusters:** DBSCAN can find clusters of arbitrary shapes, whereas K-Means assumes that clusters are spherical.
2. **No Need to Specify K:** You do not need to pre-specify the number of clusters. The algorithm finds the number of clusters on its own.
3. **Outlier Detection:** DBSCAN has a built-in mechanism for identifying noise points (outliers) that do not belong to any cluster.

**Question 6:** When would you use Silhouette Score in clustering?

**Answer:**
The **Silhouette Score** is used to evaluate the quality of clusters created by an algorithm like K-Means. You would use it to determine how well-separated and dense the clusters are. It is particularly useful for:
- **Choosing the Optimal Number of Clusters (K):** You can calculate the Silhouette Score for different values of K and choose the K that yields the highest score.
- **Comparing Different Clustering Algorithms:** It can be used to compare the performance of different clustering algorithms on the same dataset.

**Question 7:** What are the limitations of Hierarchical Clustering?

**Answer:**
The main limitations of Hierarchical Clustering are:
- **High Computational Complexity:** It is computationally expensive, typically with a time complexity of O(n^3) and space complexity of O(n^2), making it unsuitable for very large datasets.
- **Irreversible Decisions:** Once a merge or split is made, it cannot be undone. An early incorrect decision can lead to poor final clusters.
- **Sensitivity to Noise:** It can be sensitive to noise and outliers in the data.

**Question 8:** Why is feature scaling important in clustering algorithms like K-Means?

**Answer:**
Feature scaling is crucial because K-Means is a **distance-based algorithm**. If features are on different scales (e.g., age in years vs. income in thousands), the feature with the larger range will dominate the distance calculation. This means the clusters will be biased towards that feature. Scaling the features (e.g., using `StandardScaler` or `MinMaxScaler`) ensures that all features contribute equally to the distance computation, leading to more meaningful and accurate clusters.

**Question 9:** How does DBSCAN identify noise points?

**Answer:**
DBSCAN identifies noise points based on its concept of density. It classifies data points into three types:
- **Core points:** Points that have at least a minimum number of other points (`MinPts`) within a specified radius (`eps`).
- **Border points:** Points that are within the radius of a core point but do not have enough neighbors to be core points themselves.
- **Noise points (outliers):** Any point that is neither a core point nor a border point. These are points that are isolated in low-density regions.

**Question 10:** Define inertia in the context of K-Means.

**Answer:**
**Inertia** in K-Means is the **sum of squared distances** of samples to their closest cluster center. It measures how internally coherent the clusters are. A lower inertia value means that the clusters are more dense and well-defined. The K-Means algorithm aims to find the cluster centers that minimize this inertia value.

**Question 11:** What is the elbow method in K-Means clustering?

**Answer:**
The **elbow method** is a heuristic used to determine the optimal number of clusters (K) in a K-Means algorithm. It works by plotting the inertia (sum of squared distances) for a range of K values. As K increases, inertia decreases. The plot typically looks like an arm, and the "elbow" point—where the rate of decrease in inertia sharply slows down—is considered to be the optimal value for K.

**Question 12:** Describe the concept of "density" in DBSCAN.

**Answer:**
In DBSCAN, "density" at a particular point is defined by two parameters: `eps` (epsilon) and `MinPts` (minimum points). A region is considered **dense** if it contains at least `MinPts` number of points within a radius of `eps`. This concept of density allows the algorithm to form clusters by connecting dense regions and to identify points in sparse regions as noise.

**Question 13:** Can hierarchical clustering be used on categorical data?

**Answer:**
Yes, hierarchical clustering can be used on categorical data, but it requires an appropriate dissimilarity measure. Standard distance metrics like Euclidean distance are not suitable for categorical features. Instead, you would need to use a metric like the **Jaccard distance** or create a dissimilarity matrix based on techniques like Gower's distance, which can handle mixed data types.

**Question 14:** What does a negative Silhouette Score indicate?

**Answer:**
A negative Silhouette Score for a data point indicates that it has likely been **assigned to the wrong cluster**. It means that the average distance to the points in its own cluster is greater than the average distance to the points in the nearest neighboring cluster. This suggests that the point is closer to a different cluster than the one it was assigned to.

**Question 15:** Explain the term "linkage criteria" in hierarchical clustering.

**Answer:**
**Linkage criteria** defines how the distance between two clusters is measured in agglomerative hierarchical clustering. This is crucial for deciding which clusters to merge at each step. Common linkage criteria include:
- **Single Linkage:** The distance between the closest points in the two clusters.
- **Complete Linkage:** The distance between the farthest points in the two clusters.
- **Average Linkage:** The average distance between all pairs of points in the two clusters.
- **Ward's Linkage:** Minimizes the increase in variance after merging clusters.

**Question 16:** Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

**Answer:**
K-Means performs poorly in such scenarios because it implicitly assumes that clusters are **spherical and have similar sizes and densities**. Its objective is to minimize inertia, which works best when clusters are well-separated and roughly equal in size. When faced with varying densities or sizes, K-Means can struggle to place centroids correctly, often splitting large or sparse clusters and incorrectly grouping smaller, denser ones.

**Question 17:** What are the core parameters in DBSCAN, and how do they influence clustering?

**Answer:**
The two core parameters in DBSCAN are:
1. **`eps` (epsilon):** This is the radius that defines the neighborhood around a data point. It determines how close points must be to each other to be considered part of the same cluster.
2. **`MinPts` (Minimum Points):** This is the minimum number of data points required to form a dense region (a core point).

They influence clustering by defining density. A larger `eps` or a smaller `MinPts` will result in more points being clustered together, while a smaller `eps` or a larger `MinPts` will result in more points being classified as noise.

**Question 18:** How does K-Means++ improve upon standard K-Means initialization?

**Answer:**
Standard K-Means uses random initialization for its centroids, which can sometimes lead to poor clustering results or slow convergence. **K-Means++** improves this with a smarter initialization technique. It selects the initial centroids to be far away from each other, which helps to avoid placing multiple centroids within the same cluster. This generally leads to better final clusters and faster convergence.

**Question 19:** What is agglomerative clustering?

**Answer:**
**Agglomerative clustering** is a type of hierarchical clustering that follows a **"bottom-up"** approach. It starts by treating each data point as its own individual cluster. Then, at each step, it merges the two closest clusters based on a chosen linkage criterion. This process is repeated until all data points belong to a single, large cluster.

**Question 20:** What makes Silhouette Score a better metric than just inertia for model evaluation?

**Answer:**
The Silhouette Score is often better because it measures two important aspects of cluster quality: **cohesion** (how close points are within a cluster) and **separation** (how far apart different clusters are).

Inertia, on the other hand, only measures cohesion. A major limitation of inertia is that it always decreases as the number of clusters (K) increases, making it hard to identify the optimal K without a heuristic like the elbow method. The Silhouette Score provides a more balanced view and often has a clear peak at the optimal number of clusters.

## Practical Questions

**Question 21:** Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.title('K-Means Clustering on Synthetic Blobs')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

**Question 22:** Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
X = iris.data

agg_cluster = AgglomerativeClustering(n_clusters=3)
y_pred = agg_cluster.fit_predict(X)

print("First 10 predicted labels:")
print(y_pred[:10])

**Question 23:** Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.

In [None]:
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import numpy as np

X, y = make_moons(n_samples=200, noise=0.1, random_state=42)

dbscan = DBSCAN(eps=0.2, min_samples=5)
y_db = dbscan.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_db, cmap='viridis')

# Highlight outliers (label -1)
outliers = X[y_db == -1]
if outliers.any():
    plt.scatter(outliers[:, 0], outliers[:, 1], c='red', s=70, label='Outliers')

plt.title('DBSCAN Clustering on Moons Data')
plt.legend()
plt.show()

**Question 24:** Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

wine = load_wine()
X = wine.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)

cluster_sizes = np.bincount(y_kmeans)
for i, size in enumerate(cluster_sizes):
    print(f"Size of cluster {i}: {size}")

**Question 25:** Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result.

In [None]:
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, y = make_circles(n_samples=500, factor=0.5, noise=0.05, random_state=42)

dbscan = DBSCAN(eps=0.1)
y_db = dbscan.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_db, cmap='plasma')
plt.title('DBSCAN on Concentric Circles')
plt.show()

**Question 26:** Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

cancer = load_breast_cancer()
X = cancer.data

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

print("Cluster Centroids:")
print(kmeans.cluster_centers_)

**Question 27:** Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=200, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)

dbscan = DBSCAN(eps=1.0, min_samples=5)
y_db = dbscan.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_db, cmap='viridis')
plt.title('DBSCAN on Blobs with Varying Density')
plt.show()

**Question 28:** Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means.

In [None]:
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data

X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

kmeans = KMeans(n_clusters=10, random_state=42)
y_kmeans = kmeans.fit_predict(X_pca)

plt.figure(figsize=(10, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans, s=20, cmap='viridis')
plt.title('K-Means Clustering on Digits Dataset (after PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

**Question 29:** Create synthetic data using make_blobs and evaluate silhouette scores for k=2 to 5. Display as a bar chart.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=500, centers=4, cluster_std=1.0, random_state=42)

k_range = range(2, 6)
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    y_kmeans = kmeans.fit_predict(X)
    score = silhouette_score(X, y_kmeans)
    silhouette_scores.append(score)

plt.figure(figsize=(8, 5))
plt.bar([str(k) for k in k_range], silhouette_scores, color='skyblue')
plt.title('Silhouette Scores for Different Values of K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.show()

**Question 30:** Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage.

In [None]:
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data

linked = linkage(X, 'average')

plt.figure(figsize=(12, 8))
dendrogram(linked, orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram (Average Linkage)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

**Question 31:** Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

X, y = make_blobs(n_samples=200, centers=3, cluster_std=1.5, random_state=42)

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Plotting decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 8))
plt.imshow(Z, interpolation='nearest', extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Pastel2, aspect='auto', origin='lower')
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=250, marker='*', c='red', label='Centroids')
plt.title('K-Means on Overlapping Clusters')
plt.legend()
plt.show()

**Question 32:** Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.

In [None]:
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

dbscan = DBSCAN(eps=5, min_samples=5)
y_db = dbscan.fit_predict(X_tsne)

plt.figure(figsize=(10, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_db, cmap='viridis')
plt.title('DBSCAN on Digits Dataset (after t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()

**Question 33:** Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=200, centers=4, cluster_std=0.7, random_state=42)

agg_cluster = AgglomerativeClustering(n_clusters=4, linkage='complete')
y_pred = agg_cluster.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title('Agglomerative Clustering (Complete Linkage)')
plt.show()

**Question 34:** Load the Breast Cancer dataset and compare inertia values for K=2 to 6 using K-Means. Show results in a line plot.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X = cancer.data
X_scaled = StandardScaler().fit_transform(X)

k_range = range(2, 7)
inertias = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(k_range)
plt.grid(True)
plt.show()

**Question 35:** Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.

In [None]:
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, y = make_circles(n_samples=500, factor=0.5, noise=0.05, random_state=42)

agg_cluster = AgglomerativeClustering(n_clusters=2, linkage='single')
y_pred = agg_cluster.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='plasma')
plt.title('Agglomerative Clustering (Single Linkage) on Circles')
plt.show()

**Question 36:** Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise).

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

wine = load_wine()
X = wine.data
X_scaled = StandardScaler().fit_transform(X)

dbscan = DBSCAN(eps=2.5, min_samples=5)
y_db = dbscan.fit_predict(X_scaled)

n_clusters = len(np.unique(y_db[y_db != -1]))
print(f"Number of clusters found by DBSCAN (excluding noise): {n_clusters}")

**Question 37:** Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=250, marker='*', c='red', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.legend()
plt.show()

**Question 38:** Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
import numpy as np

iris = load_iris()
X = iris.data

dbscan = DBSCAN(eps=0.5, min_samples=5)
y_db = dbscan.fit_predict(X)

n_noise = np.sum(y_db == -1)
print(f"Number of noise points identified by DBSCAN: {n_noise}")

**Question 39:** Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.

In [None]:
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, y = make_moons(n_samples=200, noise=0.05, random_state=42)

kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.title('K-Means on Non-Linearly Separable Data (Fails)')
plt.show()

**Question 40:** Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.

In [None]:
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

digits = load_digits()
X = digits.data

X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

kmeans = KMeans(n_clusters=10, random_state=42)
y_kmeans = kmeans.fit_predict(X_pca)

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y_kmeans, cmap='viridis')
ax.set_title('K-Means on Digits Dataset (3D PCA)')
ax.set_xlabel('PC 1')
ax.set_ylabel('PC 2')
ax.set_zlabel('PC 3')
plt.show()

**Question 41:** Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, y = make_blobs(n_samples=500, centers=5, cluster_std=1.0, random_state=42)

kmeans = KMeans(n_clusters=5, random_state=42)
y_kmeans = kmeans.fit_predict(X)

score = silhouette_score(X, y_kmeans)
print(f"Silhouette Score for K-Means (k=5): {score:.4f}")

**Question 42:** Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X = cancer.data

X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

agg_cluster = AgglomerativeClustering(n_clusters=2)
y_pred = agg_cluster.fit_predict(X_pca)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred, cmap='viridis')
plt.title('Agglomerative Clustering on Breast Cancer Data (after PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

**Question 43:** Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side.

In [None]:
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

X, y = make_circles(n_samples=500, factor=0.5, noise=0.08, random_state=42)

y_kmeans = KMeans(n_clusters=2, random_state=42).fit_predict(X)
y_dbscan = DBSCAN(eps=0.15).fit_predict(X)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

ax1.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='plasma')
ax1.set_title('K-Means Result')

ax2.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='plasma')
ax2.set_title('DBSCAN Result')

plt.suptitle('K-Means vs. DBSCAN on Noisy Circles')
plt.show()

**Question 44:** Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering.

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt
import numpy as np

iris = load_iris()
X = iris.data

kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X)

sample_silhouette_values = silhouette_samples(X, y_kmeans)

plt.figure(figsize=(10, 6))
y_ax_lower = 10
for i in range(3):
    ith_cluster_silhouette_values = sample_silhouette_values[y_kmeans == i]
    ith_cluster_silhouette_values.sort()
    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_ax_upper = y_ax_lower + size_cluster_i
    plt.fill_betweenx(np.arange(y_ax_lower, y_ax_upper), 0, ith_cluster_silhouette_values)
    y_ax_lower = y_ax_upper + 10

plt.title('Silhouette Plot for each Sample')
plt.xlabel('Silhouette Coefficient')
plt.ylabel('Cluster Label')
plt.yticks([])
plt.show()

**Question 45:** Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=200, centers=4, cluster_std=0.9, random_state=42)

agg_cluster = AgglomerativeClustering(n_clusters=4, linkage='average')
y_pred = agg_cluster.fit_predict(X)

plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title('Agglomerative Clustering (Average Linkage)')
plt.show()

**Question 46:** Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features).

In [None]:
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

wine = load_wine()
X = wine.data

kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X)

df = pd.DataFrame(X[:, :4], columns=wine.feature_names[:4])
df['cluster'] = y_kmeans

sns.pairplot(df, hue='cluster', palette='viridis')
plt.suptitle('K-Means Clusters on Wine Dataset (First 4 Features)', y=1.02)
plt.show()

**Question 47:** Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np

X, y = make_blobs(n_samples=250, centers=3, cluster_std=1.2, random_state=42)

dbscan = DBSCAN(eps=1.0, min_samples=5)
y_db = dbscan.fit_predict(X)

n_clusters = len(np.unique(y_db[y_db != -1]))
n_noise = np.sum(y_db == -1)

print(f"Number of clusters found: {n_clusters}")
print(f"Number of noise points found: {n_noise}")

**Question 48:** Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters.

In [None]:
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

agg_cluster = AgglomerativeClustering(n_clusters=10)
y_pred = agg_cluster.fit_predict(X_tsne)

plt.figure(figsize=(10, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_pred, cmap='viridis')
plt.title('Agglomerative Clustering on Digits Dataset (after t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()