
# Lecture 09a ‚Äî K-Means Basics (In-Class Micro-Lab)

**Duration:** ~50 minutes  
**Artifact:** `elbow_and_silhouette.png` (saved to the repo)

## Learning Objectives
By the end of this micro-lab you will be able to:
- Explain what clustering does and why it is unsupervised.
- Run K-Means on a simple dataset and visualize the clusters.
- Choose a reasonable number of clusters using inertia and silhouette score.
- Produce a clean visualization artifact and a brief written reflection.

> **Instructions:** Work top-to-bottom. Look for the **üëâ Your Turn** prompts and complete those cells.


In [None]:

# ‚úÖ Environment check (scikit-learn, numpy, matplotlib)
import sys, platform
import numpy as np
import matplotlib.pyplot as plt

print("Python:", sys.version.split()[0], "| Platform:", platform.platform())
# We import sklearn where it's needed below to keep import errors easy to spot.



---
## 1) Intro to Clustering (Concept + Visualization)

Clustering tries to find structure in unlabeled data by grouping similar points.  
There are **no labels**‚Äîthe algorithm discovers patterns based on a similarity notion (usually distance).


In [None]:

# Generate a simple unlabeled dataset
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)

# Quick scatter plot
plt.figure()
plt.scatter(X[:, 0], X[:, 1], s=20)
plt.title("Unlabeled Data (Visual Exploration)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.tight_layout()
plt.savefig("clusters_raw.png", dpi=150)
plt.show()

print("Saved visual exploration figure to clusters_raw.png")



### üëâ Your Turn (Short Reflection)
In **2‚Äì3 sentences**, answer:
1. How can an algorithm group points without labels?
2. Looking at the scatter, how many clusters do *you* expect? Why?



---
## 2) Implementing K-Means (Hands-On)

We'll fit a K-Means model, visualize the clusters, and compare different `k` values.


In [None]:

from sklearn.cluster import KMeans

# Set your initial guess for the number of clusters
K = 4  # üëâ Try 3, 5, or 6 after you run once

kmeans = KMeans(n_clusters=K, n_init="auto", random_state=0)
labels = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

# Plot results with centroids
plt.figure()
plt.scatter(X[:, 0], X[:, 1], s=20, c=labels)
plt.scatter(centroids[:, 0], centroids[:, 1], marker="x", s=100)
plt.title(f"K-Means Clustering (k={K})")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.tight_layout()
plt.savefig("kmeans_clusters.png", dpi=150)
plt.show()

print("Saved clustering figure to kmeans_clusters.png")



### üëâ Your Turn (Experiment)
1. Re-run the cell above with **different values of `K`** (e.g., 3, 5, 6).  
2. In 2‚Äì3 sentences, describe what happens when `K` is **too small** vs **too large**.



---
## 3) Evaluating K-Means (Elbow + Silhouette)

Two common guides for choosing `k`:
- **Inertia** (within-cluster sum of squares): smaller is better, but always decreases with larger `k`.
- **Silhouette score** ([-1, 1]): larger is better; considers how well-separated clusters are.


In [None]:

from sklearn.metrics import silhouette_score

ks = list(range(2, 11))
inertias = []
sil_scores = []

for k in ks:
    model = KMeans(n_clusters=k, n_init="auto", random_state=0).fit(X)
    inertias.append(model.inertia_)
    labels_k = model.labels_
    sil = silhouette_score(X, labels_k)
    sil_scores.append(sil)

print("ks:", ks)
print("inertia:", [round(v, 2) for v in inertias])
print("silhouette:", [round(v, 3) for v in sil_scores])


In [None]:

# Plot the elbow (inertia) curve
plt.figure()
plt.plot(ks, inertias, marker="o")
plt.title("Elbow Method (Inertia vs k)")
plt.xlabel("k (number of clusters)")
plt.ylabel("Inertia (lower is better)")
plt.xticks(ks)
plt.tight_layout()
plt.savefig("elbow.png", dpi=150)
plt.show()

print("Saved elbow curve to elbow.png")


In [None]:

# Plot the silhouette score curve
plt.figure()
plt.plot(ks, sil_scores, marker="o")
plt.title("Silhouette Score vs k (higher is better)")
plt.xlabel("k (number of clusters)")
plt.ylabel("Silhouette Score")
plt.xticks(ks)
plt.tight_layout()
plt.savefig("silhouette.png", dpi=150)
plt.show()

print("Saved silhouette curve to silhouette.png")


In [None]:

# Create a single combined artifact image by stacking elbow.png over silhouette.png
from PIL import Image

elbow_img = Image.open("elbow.png")
sil_img = Image.open("silhouette.png")

# Pad to same width if needed
w = max(elbow_img.width, sil_img.width)
def pad_to_width(img, w):
    if img.width == w:
        return img
    new_img = Image.new("RGB", (w, img.height), (255, 255, 255))
    new_img.paste(img, ((w - img.width)//2, 0))
    return new_img

elbow_img_p = pad_to_width(elbow_img, w)
sil_img_p   = pad_to_width(sil_img, w)

combined = Image.new("RGB", (w, elbow_img_p.height + sil_img_p.height), (255, 255, 255))
combined.paste(elbow_img_p, (0, 0))
combined.paste(sil_img_p, (0, elbow_img_p.height))
combined.save("elbow_and_silhouette.png")

print("Saved combined artifact to elbow_and_silhouette.png")



---
## 4) Artifact & Reflection (Submit These)

**Files to keep/commit:**
- `kmeans_clusters.png` ‚Äî your best clustering figure
- `elbow_and_silhouette.png` ‚Äî combined evaluation artifact
- This notebook with your written responses

### üëâ Your Turn (Reflection)
In **3‚Äì5 sentences**, answer:
1. Based on your elbow and silhouette results, what `k` would you choose for this dataset, and why?  
2. In your own words, what does K-Means optimize?  
3. Name one scenario where K-Means might perform poorly (and briefly explain why).

> **Tip:** When done, `git add . && git commit -m "Lecture 09a artifacts"` and push.



---
### ‚≠ê Optional Extension (If You Have Time)
- Try `init="random"` vs `init="k-means++"` and compare runtime / results (inertia).
- Add Gaussian noise or change `cluster_std` in `make_blobs` and see how silhouette changes.
- Try a non-spherical dataset (e.g., two interleaving moons via `make_moons`) and note K-Means' limitations.
