# **Theoretical**



## 1. What is unsupervised learning in the context of machine learning?
Unsupervised learning is a type of machine learning where the algorithm learns patterns from unlabeled data without any predefined output labels or human supervision. Unlike supervised learning, there are no correct answers or target variables provided to the model. The system tries to learn the underlying structure, patterns, or relationships in the data on its own. Common unsupervised learning tasks include clustering (grouping similar data points), dimensionality reduction (reducing the number of variables), and anomaly detection (identifying unusual data points). Unsupervised learning is particularly useful when you don't know what you're looking for in the data or when labeling data would be too expensive or time-consuming.

## 2. How does K-Means clustering algorithm work?
K-Means is an iterative clustering algorithm that partitions data into K distinct, non-overlapping clusters. Here's how it works in detail:

1. **Initialization**: Randomly select K data points as initial cluster centroids (or use a smarter initialization method like K-Means++).

2. **Assignment Step**: For each data point, calculate the distance (usually Euclidean) to all centroids and assign the point to the nearest centroid's cluster.

3. **Update Step**: Recalculate the centroids by taking the mean of all data points assigned to each cluster.

4. **Iteration**: Repeat the assignment and update steps until either:
   - The centroids no longer change significantly (convergence)
   - A maximum number of iterations is reached
   - The within-cluster sum of squares (inertia) stops decreasing significantly

The algorithm aims to minimize the within-cluster sum of squares (inertia), which measures how tightly grouped the points in each cluster are.

## 3. Explain the concept of a dendrogram in hierarchical clustering.
A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering. It provides a visual representation of the hierarchical relationships between data points and clusters. Key features:

- **Leaves**: The bottom of the dendrogram represents individual data points.
- **Nodes**: Points where clusters merge, showing which clusters or points are joined.
- **Height/Y-axis**: Represents the distance or dissimilarity between merging clusters. The higher the merge point, the more dissimilar the clusters were when they merged.
- **Cutting the dendrogram**: Drawing a horizontal line at a certain height determines the number of clusters - the line will intersect the dendrogram at K points, giving K clusters.

Dendrograms are useful for understanding the natural hierarchy in data and determining an appropriate number of clusters by examining where large "jumps" in merge distances occur.

## 4. What is the main difference between K-Means and Hierarchical Clustering?
The main differences are:

1. **Approach**:
   - K-Means is a partitional clustering algorithm that divides data into K distinct clusters all at once.
   - Hierarchical clustering builds a hierarchy of clusters either through:
     * Agglomerative (bottom-up) approach: Starts with each point as its own cluster and merges them
     * Divisive (top-down) approach: Starts with all points in one cluster and splits them

2. **Number of clusters**:
   - K-Means requires specifying K (number of clusters) beforehand.
   - Hierarchical clustering doesn't require pre-specifying K (you can decide later by cutting the dendrogram).

3. **Result structure**:
   - K-Means gives a flat set of clusters.
   - Hierarchical clustering gives a nested tree of clusters (dendrogram).

4. **Flexibility**:
   - K-Means creates spherical clusters of roughly equal size.
   - Hierarchical clustering can reveal more complex cluster structures.

5. **Performance**:
   - K-Means is generally faster (O(n)) for large datasets.
   - Hierarchical clustering is more computationally expensive (O(n² or n³)).

## 5. What are the advantages of DBSCAN over K-Means?
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has several advantages over K-Means:

1. **No need to specify number of clusters**: DBSCAN automatically determines the number of clusters based on data density.

2. **Handles arbitrary cluster shapes**: Unlike K-Means which finds spherical clusters, DBSCAN can find clusters of any shape.

3. **Robust to outliers**: DBSCAN explicitly models noise points, while K-Means forces all points into clusters.

4. **Handles varying densities**: With appropriate parameter settings, DBSCAN can find clusters of different densities.

5. **Works well with spatial data**: DBSCAN is particularly effective for geospatial data or data where density matters.

6. **Deterministic results**: For core points, DBSCAN's results are deterministic (unlike K-Means which depends on random initialization).

7. **No assumption of cluster size**: K-Means tends to create clusters of similar size, while DBSCAN doesn't have this bias.

However, DBSCAN requires careful tuning of its parameters (eps and min_samples) and struggles with high-dimensional data or data with significantly varying densities.

## 6. When would you use Silhouette Score in clustering?
The Silhouette Score is used in these scenarios:

1. **Evaluating clustering quality**: When you want to assess how well-separated the clusters are and how appropriately each point has been assigned to its cluster.

2. **Determining optimal number of clusters**: When using methods like K-Means where you need to choose K, the Silhouette Score can help identify the K that produces the most distinct clusters.

3. **Comparing different clustering algorithms**: To objectively compare the results of different clustering approaches on the same data.

4. **Validating cluster assignments**: When you need to check if points are consistently closer to their own cluster than to other clusters.

The score ranges from -1 to 1, where:
- 1 indicates perfect clustering (points are very close to their cluster and far from others)
- 0 indicates overlapping clusters
- -1 indicates incorrect clustering

It's particularly useful when ground truth labels aren't available (which is typical in unsupervised learning).

## 7. What are the limitations of Hierarchical Clustering?
Hierarchical clustering has several limitations:

1. **Computational complexity**:
   - Agglomerative: O(n³) time and O(n²) memory for standard implementations
   - Divisive: Even more computationally expensive
   - Becomes impractical for large datasets (>10,000 points)

2. **Sensitivity to noise and outliers**: A few outliers can significantly affect the merging process and resulting hierarchy.

3. **Difficulty with updates**: Adding new data points requires recomputing the entire hierarchy.

4. **Irreversible decisions**: Once clusters are merged or split, the decision cannot be undone in subsequent steps.

5. **Difficulty choosing the "right" clusters**: Determining where to cut the dendrogram can be subjective.

6. **Memory intensive**: Needs to store the full distance/similarity matrix for agglomerative clustering.

7. **Not optimized for globular clusters**: Like K-Means, it often performs poorly with non-convex cluster shapes.

8. **Linkage criteria impact**: Different linkage methods (single, complete, average, Ward's) can produce very different results.

9. **Difficulty scaling**: Doesn't work well with very large datasets due to memory and computational constraints.

## 8. Why is feature scaling important in clustering algorithms like K-Means?
Feature scaling is crucial for K-Means and many other clustering algorithms because:

1. **Distance-based sensitivity**: K-Means uses distance metrics (typically Euclidean) to assign points to clusters. Features with larger scales will dominate the distance calculation, making the algorithm effectively ignore features with smaller scales.

2. **Centroid calculation**: Since centroids are means of feature values, unscaled features with larger ranges will have a greater influence on centroid positions.

3. **Convergence issues**: Without scaling, the algorithm might take longer to converge or get stuck in poor local optima.

4. **Comparable features**: Scaling ensures all features contribute equally to the similarity measure, which is especially important when features represent different units or measurement scales.

Common scaling methods include:
- Standardization (subtract mean, divide by std dev) - good when data is normally distributed
- Min-Max scaling (scale to [0,1] range) - good for bounded data
- Robust scaling (uses median and IQR) - good for data with outliers

Without scaling, clustering results can be misleading and dominated by the highest-variance features regardless of their actual importance.

## 9. How does DBSCAN identify noise points?
DBSCAN identifies noise points (also called outliers) through the following process:

1. **Core points**: A point is a core point if at least min_samples points (including itself) are within its ε (eps) neighborhood.

2. **Border points**: A point that is not a core point but is within the ε neighborhood of a core point.

3. **Noise points**: Any point that is neither a core point nor a border point is considered noise.

In other words:
- Noise points have fewer than min_samples points in their ε neighborhood.
- None of these points are within ε distance of any core point.
- They don't belong to any cluster's dense region.

These points are labeled as -1 in DBSCAN's output. The identification of noise is automatic and doesn't require a separate outlier detection step, which is one of DBSCAN's advantages over methods like K-Means that force all points into clusters.

## 10. Define inertia in the context of K-Means.
Inertia, in K-Means clustering, is the sum of squared distances of samples to their closest cluster center. It's also called within-cluster sum of squares (WCSS). Mathematically, it's defined as:

Inertia = Σ (for all points i) distance(x_i, centroid of cluster containing x_i)²

Where:
- x_i is a data point
- The distance is typically Euclidean distance
- The centroid is the mean of all points in the cluster

Key properties:
1. **Optimization target**: K-Means tries to minimize inertia during its iterative process.
2. **Cluster compactness**: Lower inertia means tighter, more compact clusters.
3. **Limitation**: Inertia assumes clusters are convex and isotropic, and isn't reliable for comparing across different numbers of clusters or datasets.
4. **Elbow method**: Used to determine optimal K by looking for the "elbow" where inertia stops decreasing significantly.

Inertia tends to decrease as K increases (with K=n giving zero inertia but meaningless clusters), so it can't be used alone to choose K.

## 11. What is the elbow method in K-Means clustering?
The elbow method is a heuristic used to determine the optimal number of clusters (K) in K-Means clustering. Here's how it works:

1. **Process**:
   - Run K-Means for a range of K values (typically from 1 to some reasonable maximum)
   - For each K, calculate the inertia (within-cluster sum of squares)
   - Plot inertia against K

2. **Identifying the elbow**:
   - As K increases, inertia decreases (more clusters can fit the data better)
   - Look for the point where the rate of decrease sharply changes (the "elbow")
   - This point suggests diminishing returns from increasing K further

3. **Interpretation**:
   - Before the elbow: Adding clusters significantly improves fit
   - After the elbow: Additional clusters provide marginal improvement
   - The elbow K is often chosen as optimal

4. **Limitations**:
   - Sometimes no clear elbow exists
   - Subjective to identify exactly where the elbow is
   - Doesn't work well when data has overlapping clusters
   - Inertia always decreases with K, making very large K seem better

Often used in conjunction with other metrics like silhouette score for more robust K selection.

## 12. Describe the concept of "density" in DBSCAN.
In DBSCAN, "density" refers to the concentration of data points in a particular region of the feature space. The algorithm's core idea is that clusters are dense regions separated by less dense regions. Key aspects:

1. **Density parameters**:
   - ε (eps): Radius of the neighborhood around a point
   - min_samples: Minimum number of points required to form a dense region

2. **Density definitions**:
   - A point is in a dense region if at least min_samples points are within ε distance
   - A cluster is a maximal set of density-connected points
   - Density-reachable: A point q is density-reachable from p if there's a chain of points where each is within ε of the next and all have min_samples neighbors

3. **Density-based clustering**:
   - Discovers clusters of arbitrary shape by connecting dense regions
   - Can separate clusters by sparse regions
   - Identifies outliers as points in low-density regions

4. **Advantages of density-based approach**:
   - Doesn't assume spherical clusters like K-Means
   - Can find clusters of varying shapes and sizes
   - Naturally handles noise/outliers in sparse regions

The density concept makes DBSCAN particularly effective for spatial data and datasets with irregular cluster shapes.

## 13. Can hierarchical clustering be used on categorical data?
Yes, hierarchical clustering can be used with categorical data, but with some important considerations:

1. **Distance metrics**: Need to use appropriate dissimilarity measures for categorical data:
   - Hamming distance: Fraction of features that differ
   - Jaccard distance: For sets of categories
   - Simple matching coefficient: For binary categorical data
   - Gower's distance: Can handle mixed data types

2. **Linkage methods**: Some linkage methods work better than others:
   - Single linkage can lead to chaining
   - Complete or average linkage are often better choices

3. **Preprocessing**:
   - May need to one-hot encode or use other encoding schemes
   - Some implementations require numeric input

4. **Limitations**:
   - Results can be sensitive to the distance metric chosen
   - Some information may be lost in encoding categorical variables
   - Interpretation can be more challenging than with numerical data

5. **Alternatives**:
   - For purely categorical data, specific algorithms like ROCK or COOLCAT may be more appropriate
   - For mixed data, consider Gower's distance with hierarchical clustering

The key is choosing an appropriate dissimilarity measure that properly captures the relationships between categorical values.

## 14. What does a negative Silhouette Score indicate?
A negative Silhouette Score indicates that:

1. **Poor clustering**: On average, data points are closer to points in other clusters than to points in their own cluster.

2. **Possible interpretations**:
   - The number of clusters may be incorrect (too many or too few)
   - The clustering algorithm may be inappropriate for the data structure
   - The data may not have meaningful cluster structure
   - The distance metric may not be suitable for the data

3. **Specific meaning**:
   - The average distance between a point and points in other clusters (a) is less than the average distance to points in its own cluster (b)
   - This suggests points are assigned to the "wrong" clusters

4. **Actions to consider**:
   - Try a different number of clusters
   - Consider a different clustering algorithm
   - Re-examine feature scaling or preprocessing
   - Check if the data actually contains meaningful clusters
   - Try a different distance metric if appropriate

While positive scores indicate good clustering (higher is better), negative scores suggest the current clustering configuration is worse than random assignment.

## 15. Explain the term "linkage criteria" in hierarchical clustering.
Linkage criteria determine how the distance between two clusters is computed during hierarchical clustering. Different criteria lead to different cluster structures. Common linkage methods:

1. **Single linkage** (nearest neighbor):
   - Distance between clusters = minimum distance between any two points in different clusters
   - Tends to produce "chaining" - long, elongated clusters
   - Sensitive to noise and outliers

2. **Complete linkage** (farthest neighbor):
   - Distance = maximum distance between any two points in different clusters
   - Tends to produce compact, spherical clusters of similar size
   - Less sensitive to noise but can break large clusters

3. **Average linkage**:
   - Distance = average distance between all pairs of points in different clusters
   - Compromise between single and complete linkage
   - Less sensitive to outliers than single linkage

4. **Ward's method** (minimum variance):
   - Minimizes the total within-cluster variance
   - Merges clusters that lead to smallest increase in total variance
   - Tends to produce clusters of similar size
   - Works well with Euclidean distance

5. **Centroid linkage**:
   - Distance = distance between cluster centroids
   - Can lead to inversion in dendrograms (where later merges occur at lower distances than earlier ones)

The choice of linkage affects:
- The shape and size of resulting clusters
- The interpretation of the dendrogram
- The algorithm's sensitivity to noise
- The computational complexity

## 16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?
K-Means tends to perform poorly with varying cluster sizes or densities because:

1. **Equal-size assumption**: K-Means implicitly assumes clusters are roughly equally sized, as it assigns points to the nearest centroid without considering cluster density or size.

2. **Spherical cluster bias**: The algorithm works best when clusters are spherical and equally dense, as it uses Euclidean distance which spreads equally in all directions.

3. **Centroid influence**: In clusters with different densities:
   - Dense clusters may be split to assign points to sparse clusters
   - Sparse clusters may be merged into nearby dense clusters

4. **Distance metric limitation**: Euclidean distance favors equal-sized clusters since points at the same distance belong to the nearest centroid regardless of local density.

5. **Initialization sensitivity**: With uneven cluster sizes, random initialization is more likely to place initial centroids in large clusters, ignoring smaller ones.

6. **Variance sensitivity**: K-Means minimizes within-cluster variance, which can lead to:
   - Splitting large, loose clusters
   - Merging small, tight clusters

7. **Example failure cases**:
   - One large, diffuse cluster and one small, dense cluster
   - Clusters with very different numbers of points
   - Adjacent clusters with different densities

Alternatives like DBSCAN or Gaussian Mixture Models often perform better on such data.

## 17. What are the core parameters in DBSCAN, and how do they influence clustering?
DBSCAN has two core parameters:

1. **eps (ε)**:
   - The maximum distance between two points for one to be considered in the neighborhood of the other
   - Influences:
     * Cluster size: Larger ε leads to larger clusters (more points are reachable)
     * Number of clusters: Larger ε may merge separate clusters
     * Noise points: Larger ε may convert noise to border points
   - Too small: Many small clusters and noise
   - Too large: Few large clusters, may merge distinct groups

2. **min_samples**:
   - Minimum number of points required to form a dense region (core point)
   - Influences:
     * Cluster formation: Higher values require more points to start a cluster
     * Noise sensitivity: Higher values make algorithm more robust to noise
     * Cluster granularity: Higher values find only very dense regions as clusters
   - Too small: Many small clusters including around noise
   - Too large: May miss legitimate smaller clusters

Choosing parameters:
- For ε: Look at k-distance plot (distance to k=min_samples-th neighbor)
- For min_samples: Depends on data size and expected cluster density (often start with 2*dimensions)
- Domain knowledge about expected cluster density helps

Additional considerations:
- Distance metric choice affects ε interpretation
- Data scaling is crucial as ε is sensitive to feature scales
- Higher dimensions require larger min_samples due to curse of dimensionality

## 18. How does K-Means++ improve upon standard K-Means initialization?
K-Means++ improves standard K-Means initialization by providing a smarter method for selecting initial centroids, addressing these issues with random initialization:

1. **Standard K-Means problems**:
   - Random initialization can lead to poor clusterings
   - Centroids might start too close together
   - Some clusters might get no initial centroids
   - Requires more runs to get good results

2. **K-Means++ algorithm**:
   a. Choose first centroid uniformly at random from data points
   b. For each subsequent centroid:
      i. Compute D(x), the distance from each point to nearest existing centroid
      ii. Choose new centroid with probability proportional to D(x)²
      iii. Repeat until K centroids are chosen
   c. Proceed with standard K-Means

3. **Advantages**:
   - Initial centroids are spread out, covering the dataset better
   - Higher probability of picking centroids in different clusters
   - Leads to more consistent and better final clusterings
   - Often converges faster with fewer iterations
   - Reduces need for multiple random initializations

4. **Theoretical guarantees**:
   - Expected approximation ratio of O(log k) to optimal clustering
   - In practice, often finds better solutions than random initialization

5. **Practical impact**:
   - Typically requires fewer iterations to converge
   - Produces more stable results across different runs
   - Especially helpful when K is large compared to number of points

## 19. What is agglomerative clustering?
Agglomerative clustering is a bottom-up hierarchical clustering approach where:

1. **Initialization**:
   - Start with each data point as its own cluster (n clusters)

2. **Iterative process**:
   a. Compute pairwise distances between all clusters
   b. Merge the two closest clusters based on linkage criterion
   c. Update distance matrix to reflect new cluster
   d. Repeat until all points are in one cluster

3. **Key components**:
   - **Distance metric**: How to measure distance between points (Euclidean, Manhattan, etc.)
   - **Linkage criterion**: How to define distance between clusters (single, complete, average, Ward's)

4. **Output**:
   - Produces a dendrogram showing the complete merging history
   - Can be cut at any level to obtain a specific number of clusters

5. **Characteristics**:
   - Deterministic (unlike K-Means with random initialization)
   - Reveals hierarchical structure in data
   - Doesn't require pre-specifying number of clusters
   - More interpretable than flat clustering methods

6. **Variants**:
   - **Single linkage**: Minimum distance between clusters (can lead to chaining)
   - **Complete linkage**: Maximum distance (creates compact clusters)
   - **Average linkage**: Average distance between points
   - **Ward's method**: Minimizes variance when merging

7. **Applications**:
   - When hierarchical relationships are important
   - For small to medium datasets
   - When you want to explore data at multiple granularities

## 20. What makes Silhouette Score a better metric than just inertia for model evaluation?
The Silhouette Score is often more informative than inertia because:

1. **Relative vs absolute measure**:
   - Inertia is an absolute measure of cluster compactness
   - Silhouette Score measures both cluster cohesion (how close points are to their cluster) and separation (how far points are from other clusters)

2. **Normalization**:
   - Silhouette Score is normalized between -1 and 1, making it comparable across different datasets and scales
   - Inertia values are dataset-specific and not directly comparable

3. **Cluster separation**:
   - Inertia only measures within-cluster distances
   - Silhouette considers both within-cluster and between-cluster distances

4. **Number of clusters**:
   - Inertia always decreases as K increases, making it hard to choose K
   - Silhouette Score can indicate when adding clusters doesn't improve separation

5. **Interpretability**:
   - Silhouette values:
     * Near 1: Points are well-clustered
     * Near 0: Points are on/very near cluster boundaries
     * Negative: Points may be in wrong cluster
   - Inertia has no such direct interpretation

6. **Shape awareness**:
   - Silhouette can work better with non-spherical clusters
   - Inertia assumes spherical clusters due to centroid-based measurement

7. **Practical advantages**:
   - Helps identify poorly clustered points
   - Can compare different clustering algorithms
   - Works better for choosing K in many cases

However, Silhouette Score is more computationally expensive to calculate (O(n²)) and may not work well with very large datasets.

# **practical**

In [None]:
# 21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a
scatter plot

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=200, c='red', marker='X')
plt.title("K-Means Clustering with 4 Centers")
plt.show()



In [None]:
# 22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10
predicted labels ?


from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
X = iris.data
agg = AgglomerativeClustering(n_clusters=3)
y_pred = agg.fit_predict(X)

print("First 10 predicted labels:", y_pred[:10])

In [None]:
# 23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot

from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
dbscan = DBSCAN(eps=0.3, min_samples=5)
y_pred = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title("DBSCAN on Moons Dataset (Outliers shown in purple)")
plt.show()


In [None]:
# 24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each
cluster

from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = wine.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_scaled)
y_pred = kmeans.predict(X_scaled)

from collections import Counter
print("Cluster sizes:", Counter(y_pred))

In [None]:
# 25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result

from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5)
y_pred = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title("DBSCAN on Circles Dataset")
plt.show()

In [None]:
#  26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster
centroids

from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

cancer = load_breast_cancer()
X = cancer.data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

print("Cluster centroids:\n", kmeans.cluster_centers_)

In [None]:
# 27.  Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with
DBSCAN

from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)
dbscan = DBSCAN(eps=0.8, min_samples=5)
y_pred = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title("DBSCAN on Blobs with Varying STD")
plt.show()

In [None]:
28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means


from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=10, random_state=42)
y_pred = kmeans.fit_predict(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred, cmap='tab10')
plt.title("K-Means Clusters on Digits PCA")
plt.show()

In [None]:
# 29.  Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
scores = []
k_values = range(2, 6)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    y_pred = kmeans.fit_predict(X)
    score = silhouette_score(X, y_pred)
    scores.append(score)

plt.bar(k_values, scores)
plt.xlabel("Number of clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for Different k Values")
plt.show()

In [None]:
# 30 .  Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage

from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data

Z = linkage(X, method='average')
plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.title("Dendrogram with Average Linkage (Iris Dataset)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()

In [None]:
# 31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with
decision boundaries

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.0, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
y_pred = kmeans.predict(X)

# Create mesh grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title("K-Means with Decision Boundaries")
plt.show()

In [None]:
# 32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results

from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

dbscan = DBSCAN(eps=5, min_samples=5)
y_pred = dbscan.fit_predict(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_pred, cmap='tab10')
plt.title("DBSCAN Clustering after t-SNE")
plt.show()


In [None]:
# 33.  Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot
the result

from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
agg = AgglomerativeClustering(n_clusters=4, linkage='complete')
y_pred = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title("Agglomerative Clustering with Complete Linkage")
plt.show()


In [None]:
# 34.  Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a
line plot

from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X = cancer.data

inertias = []
k_values = range(2, 7)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

plt.plot(k_values, inertias, marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.xticks(k_values)
plt.show()

In [None]:
# 35.  Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with
single linkage .

from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
y_pred = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title('Agglomerative Clustering with Single Linkage')
plt.show()

In [None]:
# 36.  Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding
noise)

from sklearn.datasets import load_wine
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = wine.data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

dbscan = DBSCAN(eps=1.5, min_samples=5)
y_pred = dbscan.fit_predict(X_scaled)

n_clusters = len(set(y_pred)) - (1 if -1 in y_pred else 0)
print(f"Number of clusters (excluding noise): {n_clusters}")


In [None]:
# 37.  Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the
data points

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(X)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis', alpha=0.5)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=200, c='red', marker='X', label='Cluster Centers')
plt.title('K-Means Clustering with Cluster Centers')
plt.legend()
plt.show()

In [None]:
# 38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise

from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN

iris = load_iris()
X = iris.data

dbscan = DBSCAN(eps=0.5, min_samples=5)
y_pred = dbscan.fit_predict(X)

n_noise = list(y_pred).count(-1)
print(f"Number of noise samples: {n_noise}")

In [None]:
# 39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the
clustering result

from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
kmeans = KMeans(n_clusters=2, random_state=42)
y_pred = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title('K-Means on Non-linearly Separable Data (Moons)')
plt.show()

In [None]:
# 40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D
scatter plot.

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

digits = load_digits()
X = digits.data
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

kmeans = KMeans(n_clusters=10, random_state=42)
y_pred = kmeans.fit_predict(X_pca)

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y_pred, cmap='tab10', s=20)
ax.set_title('K-Means Clustering on Digits (3D PCA)')
plt.show()

In [None]:
 # 41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the
clustering

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42)
y_pred = kmeans.fit_predict(X)

score = silhouette_score(X, y_pred)
print(f"Silhouette Score: {score:.3f}")

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title(f"K-Means Clustering (Silhouette Score: {score:.3f})")
plt.show()


In [None]:
# 42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering.
Visualize in 2D

from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

cancer = load_breast_cancer()
X = cancer.data
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

agg = AgglomerativeClustering(n_clusters=2)
y_pred = agg.fit_predict(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred, cmap='viridis')
plt.title("Agglomerative Clustering on PCA-reduced Breast Cancer Data")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

In [None]:
# 43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN
side-by-side

from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

# KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
y_dbscan = dbscan.fit_predict(X)

# Plot
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.title("KMeans Clustering")

plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='viridis')
plt.title("DBSCAN Clustering")
plt.show()

In [None]:
# 44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt
import numpy as np

iris = load_iris()
X = iris.data
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X)

silhouette_vals = silhouette_samples(X, y_pred)
y_lower = 10

plt.figure(figsize=(8, 6))
for i in range(3):
    cluster_silhouette_vals = silhouette_vals[y_pred == i]
    cluster_silhouette_vals.sort()
    y_upper = y_lower + cluster_silhouette_vals.shape[0]
    plt.fill_betweenx(np.arange(y_lower, y_upper),
                     0, cluster_silhouette_vals,
                     alpha=0.7)
    plt.text(-0.05, y_lower + 0.5 * cluster_silhouette_vals.shape[0], str(i))
    y_lower = y_upper + 10

plt.axvline(x=np.mean(silhouette_vals), color="red", linestyle="--")
plt.title("Silhouette Plot for Iris Dataset")
plt.xlabel("Silhouette Coefficient Values")
plt.ylabel("Cluster Label")
plt.show()

In [None]:
# 45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage.
Visualize clusters

from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
y_pred = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.title("Agglomerative Clustering with Average Linkage")
plt.show()

In [None]:
# 46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4
features)

from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
import seaborn as sns
import pandas as pd

wine = load_wine()
X = wine.data
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X)

# Create DataFrame with first 4 features and cluster assignments
df = pd.DataFrame(X[:, :4], columns=wine.feature_names[:4])
df['Cluster'] = y_pred

sns.pairplot(df, hue='Cluster', palette='viridis')
plt.suptitle("Pairplot of Wine Data with KMeans Clusters", y=1.02)
plt.show()

In [None]:
# 47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the
count

from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)
dbscan = DBSCAN(eps=0.8, min_samples=10)
y_pred = dbscan.fit_predict(X)

unique, counts = np.unique(y_pred, return_counts=True)
for label, count in zip(unique, counts):
    if label == -1:
        print(f"Noise points: {count}")
    else:
        print(f"Cluster {label} size: {count}")

In [None]:
# 48.  Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the
clusters

from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.data
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

agg = AgglomerativeClustering(n_clusters=10)
y_pred = agg.fit_predict(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_pred, cmap='tab10', s=10)
plt.title("Agglomerative Clustering on t-SNE reduced Digits Data")
plt.show()