Theoretical Questions
1. What is unsupervised learning in the context of machine learning?

Unsupervised learning is a machine learning approach where the model learns patterns from unlabeled data without predefined outputs. It identifies structures, such as clusters or associations, in the data.

2. How does K-Means clustering algorithm work?

K-Means partitions data into K clusters by:

Randomly initializing K centroids.
Assigning points to the nearest centroid.
Updating centroids as the mean of assigned points.
Repeating until convergence or max iterations.
3. Explain the concept of a dendrogram in hierarchical clustering.

A dendrogram is a tree-like diagram showing the hierarchical relationship between data points in hierarchical clustering. It visualizes the order and distance of merges (or splits) between clusters.

4. What is the main difference between K-Means and Hierarchical Clustering?

K-Means partitions data into a fixed number of clusters (K) using centroids, while hierarchical clustering builds a tree of clusters (dendrogram) without requiring a predefined K, allowing nested clusters.

5. What are the advantages of DBSCAN over K-Means?

Identifies clusters of arbitrary shape.
Automatically detects noise/outliers.
No need to specify the number of clusters.
Robust to varying cluster densities.
6. When would you use Silhouette Score in clustering?

Silhouette Score is used to evaluate clustering quality by measuring how similar points are within their cluster compared to other clusters. It’s used to select optimal K or compare clustering algorithms.

7. What are the limitations of Hierarchical Clustering?

High computational complexity (O(n²) or O(n³)).
Sensitive to noise and outliers.
Hard to scale to large datasets.
Fixed merges cannot be undone.
8. Why is feature scaling important in clustering algorithms like K-Means?

Feature scaling ensures all features contribute equally to distance calculations. K-Means relies on Euclidean distance, so unscaled features with larger ranges can dominate clustering.

9. How does DBSCAN identify noise points?

DBSCAN labels points as noise if they have fewer than min_samples neighbors within a radius eps and are not part of a dense cluster.

10. Define inertia in the context of K-Means.

Inertia is the sum of squared distances between each point and its assigned cluster centroid, measuring cluster compactness.

11. What is the elbow method in K-Means clustering?

The elbow method plots inertia against K (number of clusters) and selects K where adding more clusters yields diminishing reductions in inertia, forming an "elbow" shape.

12. Describe the concept of "density" in DBSCAN.

Density in DBSCAN refers to the number of points within a radius eps. Points with at least min_samples neighbors within eps are core points, forming dense regions (clusters).

13. Can hierarchical clustering be used on categorical data?

Yes, with appropriate distance metrics (e.g., Hamming distance) and linkage criteria. However, numerical data with Euclidean distance is more common.

14. What does a negative Silhouette Score indicate?

A negative Silhouette Score indicates that a point is closer to points in another cluster than its own, suggesting poor clustering quality or misassignment.

15. Explain the term "linkage criteria" in hierarchical clustering.

Linkage criteria define how distances between clusters are calculated in hierarchical clustering (e.g., single: minimum distance, complete: maximum distance, average: mean distance).

16. Why might K-Means clustering perform poorly on data with varying cluster sizes or density?

K-Means assumes spherical clusters of similar size and density. Varying sizes or densities lead to incorrect centroid placement and poor cluster separation.

17. What are the core parameters in DBSCAN, and how do they influence clustering?

eps: Radius for neighbor search; smaller values create tighter clusters, larger values merge clusters.
min_samples: Minimum points to form a core point; higher values reduce noise but may miss small clusters.
18. How does K-Means++ improve upon standard K-Means initialization?

K-Means++ initializes centroids by choosing the first randomly, then selecting subsequent centroids with probability proportional to the distance from existing centroids, reducing poor initializations.

19. What is agglomerative clustering?

Agglomerative clustering is a bottom-up hierarchical clustering method that starts with each point as a cluster and iteratively merges the closest pairs based on a linkage criterion.

20. What makes Silhouette Score a better metric than just inertia for model evaluation?

Silhouette Score evaluates both cohesion (within-cluster distance) and separation (between-cluster distance), while inertia only measures within-cluster compactness, ignoring inter-cluster separation.

In [None]:


"""21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.
python"""


from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('K-Means Clustering (4 Centers)')
plt.show()

In [None]:
#22 Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.


Copy
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load data
iris = load_iris()
X = iris.data

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

# Print first 10 labels
print("First 10 predicted labels:", labels[:10])

In [None]:
#23 Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.


Copy
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import numpy as np

# Generate data
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Visualize (outliers are label -1)
plt.scatter(X[:, 0], X[:, 1], c=np.where(labels == -1, 'red', labels), cmap='viridis')
plt.title('DBSCAN Clustering (Outliers in Red)')
plt.show()

In [None]:
#24 Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.


Copy
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

# Load data
wine = load_wine()
X = wine.data

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Print cluster sizes
print("Cluster sizes:", np.bincount(labels))

In [None]:

#25 Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result.


Copy
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate data
X, _ = make_circles(n_samples=200, factor=0.5, noise=0.05, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN on Concentric Circles')
plt.show()

In [None]:
#26 Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.

Copy
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Load data
cancer = load_breast_cancer()
X = cancer.data

# Scale
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

# Print centroids
print("Cluster centroids:\n", kmeans.cluster_centers_)

In [1]:
27 Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.


Copy
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.0, 1.5], random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Visualize
plt.scatter(X[:, 0 perspectiva])

SyntaxError: invalid syntax (ipython-input-1-3135755819.py, line 1)

In [None]:


# 28. Load the Digits dataset, reduce to 2D using PCA, and visualize clusters from K-Means.**

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load data
digits = load_digits()
X = digits.data

# Reduce to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Apply K-Means
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

# Visualize
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.title('K-Means on Digits (PCA 2D)')
plt.show()

In [None]:
#29. Create synthetic data using make_blobs and evaluate silhouette scores for k=2 to 5. Display as a bar chart.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Compute silhouette scores
scores = []
for k in range(2, 6):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    scores.append(silhouette_score(X, labels))

# Visualize
plt.bar(range(2, 6), scores)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for K-Means')
plt.show()

In [None]:

#30 Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage.


Copy
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X = iris.data

# Perform hierarchical clustering
Z = linkage(X, method='average')

# Plot dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.title('Dendrogram (Average Linkage) for Iris')
plt.show()

In [None]:
#31 Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Generate data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Create mesh grid for decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Visualize
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('K-Means with Decision Boundaries')
plt.show()

In [None]:
# Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.


Copy
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Load data
digits = load_digits()
X = digits.data

# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=2, min_samples=5)
labels = dbscan.fit_predict(X_tsne)

# Visualize
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN on Digits (t-SNE 2D)')
plt.show()

In [None]:
# Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.



from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=4, linkage='complete')
labels = agg.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering (Complete Linkage)')
plt.show()

In [None]:

# Load the Breast Cancer dataset and compare inertia values for K=2 to 6 using K-Means. Show results in a line plot.
python

Collapse

Wrap

Run

Copy
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load data
cancer = load_breast_cancer()
X = cancer.data

# Compute inertia
inertias = []
for k in range(2, 7):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Visualize
plt.plot(range(2, 7), inertias, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Inertia for K-Means on Breast Cancer')
plt.show()

In [None]:

# Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.


Copy
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Generate data
X, _ = make_circles(n_samples=200, factor=0.5, noise=0.05, random_state=42)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering (Single Linkage) on Circles')
plt.show()

In [None]:

# Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise).



from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Load data
wine = load_wine()
X = wine.data

# Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Count clusters (exclude noise: label -1)
n_clusters = len(np.unique(labels)) - (1 if -1 in labels else 0)
print("Number of clusters:", n_clusters)

In [None]:
#Generate synthetic data with make_blobs and apply K-Means. Then plot the cluster centers on top of the data points.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=200, linewidths=3)
plt.title('K-Means with Cluster Centers')
plt.show()

In [None]:
# Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.

from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
import numpy as np

# Load data
iris = load_iris()
X = iris.data

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Count noise points (label -1)
noise_count = np.sum(labels == -1)
print("Number of noise points:", noise_count)

In [None]:
#Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.

from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate data
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('K-Means on Non-Linearly Separable Data')
plt.show()

In [None]:


# Load the Digits dataset, apply PCA to reduce to 3 components, then use K-Means and visualize with a 3D scatter plot.

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load data
digits = load_digits()
X = digits.data

# Reduce to 3D
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

# Apply K-Means
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

# Visualize
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels, cmap='viridis')
ax.set_title('K-Means on Digits (PCA 3D)')
plt.show()

In [None]:

# Generate synthetic blobs with 5 centers and apply K-Means. Then use silhouette_score to evaluate the clustering.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate data
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)

# Compute silhouette score
score = silhouette_score(X, labels)
print("Silhouette Score:", score)

In [None]:

#Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D.

from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Load data
cancer = load_breast_cancer()
X = cancer.data

# Reduce to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=2)
labels = agg.fit_predict(X_pca)

# Visualize
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering on Breast Cancer (PCA 2D)')
plt.show()

In [None]:



# Generate noisy circular data using make_circles and visualize clustering results from K-Means and DBSCAN side-by-side.

from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

# Generate data
X, _ = make_circles(n_samples=200, factor=0.5, noise=0.05, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# Visualize side-by-side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
ax1.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
ax1.set_title('K-Means on Circles')
ax2.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis')
ax2.set_title('DBSCAN on Circles')
plt.show()

In [None]:
# Load the Iris dataset and plot the Silhouette Coefficient for each sample after K-Means clustering.

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt
import numpy as np

# Load data
iris = load_iris()
X = iris.data

# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Compute silhouette scores
silhouette_vals = silhouette_samples(X, labels)

# Visualize
plt.bar(range(len(silhouette_vals)), silhouette_vals)
plt.xlabel('Sample Index')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficients for Iris (K-Means)')
plt.show()

In [None]:



#Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters.
python


from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg.fit_predict(X)

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering (Average Linkage)')
plt.show()

In [None]:
#Load the Wine dataset, apply K-Means, and visualize the cluster assignments in a seaborn pairplot (first 4 features).

from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
import seaborn as sns
import pandas as pd

# Load data
wine = load_wine()
X = wine.data[:, :4]  # First 4 features
df = pd.DataFrame(X, columns=wine.feature_names[:4])

# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

# Visualize
sns.pairplot(df, hue='Cluster', diag_kind='hist')
plt.suptitle('K-Means Clustering on Wine (First 4 Features)', y=1.02)
plt.show()

In [None]:

#6.9. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count.

from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np

# Generate data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Count clusters and noise
n_clusters = len(np.unique(labels)) - (1 if -1 in labels else 0)
n_noise = np.sum(labels == -1)
print(f"Number of clusters: {n_clusters}, Number of noise points: {n_noise}")

In [None]:










# Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters.



from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Load data
digits = load_digits()
X = digits.data

# Reduce to 2D
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=10)
labels = agg.fit_predict(X_tsne)

# Visualize
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering on Digits (t-SNE 2D)')
plt.show()