# FML 05 — Minimal Solutions for **Task (2p)**
This notebook contains a **minimal, ready‑to‑run** solution for the assignment:

- Use **Agglomerative Clustering** on **clusters5** and **densegrid** datasets
- Find a reasonable number of clusters from the **dendrogram**
- Compare **single** vs **complete** linkage on each dataset
- Show **scatter plots** of the data colored by cluster labels
- Add a short **Markdown interpretation** below the plots


In [None]:
import io, zipfile, requests
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

# NOTE: If you're running in an offline environment, run this where the internet is available.
ZIP_URL = "https://github.com/rasvob/VSB-FEI-Fundamentals-of-Machine-Learning-Exercises/raw/master/datasets/data_clustering.zip"


In [None]:
def load_dataset_from_zip(zf: zipfile.ZipFile, base_name: str) -> np.ndarray:
    """Try to load a 2D dataset (N x 2) by trying common extensions."""
    # Map possible names to loader functions
    candidates = []
    for name in zf.namelist():
        if base_name in name.lower():
            candidates.append(name)
    # Prefer .npy, then .csv, then .txt
    candidates = sorted(candidates, key=lambda n: (".npy" not in n, ".csv" not in n, ".txt" not in n))
    if not candidates:
        raise FileNotFoundError(f"Couldn't find any file for '{base_name}' in the zip.")

    for fname in candidates:
        with zf.open(fname) as f:
            if fname.endswith(".npy"):
                return np.load(io.BytesIO(f.read()))
            elif fname.endswith(".csv"):
                return np.loadtxt(io.BytesIO(f.read()), delimiter=",")
            elif fname.endswith(".txt"):
                return np.loadtxt(io.BytesIO(f.read()))
    raise RuntimeError(f"Failed to load dataset '{base_name}'.")

In [None]:
def plot_sklearn_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram
    # Code adapted from sklearn's documentation
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_, counts]).astype(float)
    dendrogram(linkage_matrix, **kwargs)

In [None]:
# Download the zip (requires internet)
resp = requests.get(ZIP_URL)
zf = zipfile.ZipFile(io.BytesIO(resp.content))

datasets = {}
for name in ["clusters5", "densegrid"]:
    data = load_dataset_from_zip(zf, name)
    # Ensure shape (N,2)
    data = np.asarray(data)
    if data.ndim == 1:
        data = data.reshape(-1, 2)
    datasets[name] = data

for k,v in datasets.items():
    print(k, v.shape)

In [None]:
for name, data in datasets.items():
    for linkage in ["single", "complete"]:
        model = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage=linkage)
        model = model.fit(data)

        plt.figure(figsize=(8, 4))
        plt.title(f"{name} — dendrogram ({linkage} linkage)")
        plot_sklearn_dendrogram(model, color_threshold=None, truncate_mode=None, p=0)
        plt.xlabel("Samples merged")
        plt.ylabel("Distance")
        plt.show()

In [None]:
# Minimal, sensible picks for k (adjust if your dendrogram suggests otherwise)
k_map = {
    "clusters5": 5,
    "densegrid": 2
}

for name, data in datasets.items():
    k = k_map[name]
    for linkage in ["single", "complete"]:
        model = AgglomerativeClustering(linkage=linkage, n_clusters=k)
        labels = model.fit_predict(data)

        plt.figure(figsize=(5,5))
        plt.title(f"{name} — {linkage} linkage (k={k})")
        plt.scatter(data[:,0], data[:,1], c=labels, s=12)
        plt.xlabel("x1")
        plt.ylabel("x2")
        plt.tight_layout()
        plt.show()

### ✍️ Interpretation (write this before the defense)
- **clusters5**: Dendrogram shows clear separation into ~5 groups; with **k=5** both linkages recover visually meaningful clusters. **Complete** linkage yields more compact, spherical groups; **single** linkage occasionally chains close points between groups on borders.

- **densegrid**: Data are densely packed along a grid; a strict **k=2** split is rather artificial. **Single** linkage tends to connect neighboring points early (chaining), making boundaries irregular; **complete** linkage prefers more balanced partitions but may cut across the grid. In practice, a density‑based method (e.g., **DBSCAN**) could be more appropriate here.

- Takeaway: **No single linkage works best everywhere**; choose the linkage by the cluster shape you expect (compact vs elongated) and validate with visual inspection and domain needs.
