# Boletin 1 Practice Report
This notebook is my personal record of the Boletin 1 work. I explain every task in simple English so I can present it later in class.


In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from PIL import Image
from sklearn.cluster import AgglomerativeClustering, DBSCAN, KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    adjusted_rand_score,
    completeness_score,
    homogeneity_score,
    mutual_info_score,
    rand_score,
    silhouette_score,
    v_measure_score,
)
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

from class_helpers import (
    ensure_practice_paths,
    grid_search_kmeans,
    load_zoo,
    paths,
    scale_features,
)

ensure_practice_paths()


## 1. Zoo K-Means study
I start with the zoo table from the practice files. I remove the type column, scale the numeric fields, and test several cluster counts and seeds.


In [None]:
zoo_data = load_zoo(include_type=False)
X_zoo = zoo_data.features
y_zoo = zoo_data.labels
scaler_zoo, X_zoo_scaled = scale_features(X_zoo)
zoo_data.table.head()


The preview confirms that the animal names and the feature flags are loaded correctly.


In [None]:
k_values = [5, 6, 7, 8]
seed_values = [0, 1, 2]
results_kmeans = grid_search_kmeans(X_zoo_scaled, y_zoo, k_values, seed_values)
results_kmeans


The table shows inertia, silhouette, and adjusted rand index for each combination. I see that k equal to seven with seed zero gives the best adjusted rand index.


In [None]:
best_row = results_kmeans.sort_values("ARI", ascending=False).iloc[0]
best_k = int(best_row["k"])
best_seed = int(best_row["seed"])
best_k, best_seed


Now I fit one final K-Means model with that choice and compare the cluster labels with the true animal types.


In [None]:
final_kmeans = KMeans(n_clusters=best_k, random_state=best_seed, n_init=10)
final_labels = final_kmeans.fit_predict(X_zoo_scaled)
pd.crosstab(final_labels, y_zoo, rownames=["cluster"], colnames=["type"])


Cluster seven matches the real types quite well, so I keep that option for the report.


## 2. Agglomerative clustering comparison
I repeat the zoo study with the four linkage rules from class and compare external metrics and silhouette.


In [None]:
linkage_methods = ["single", "complete", "average", "ward"]
agglomerative_rows = []
for method in linkage_methods:
    model = AgglomerativeClustering(n_clusters=best_k, linkage=method)
    labels = model.fit_predict(X_zoo_scaled)
    agglomerative_rows.append(
        {
            "method": method,
            "rand": rand_score(y_zoo, labels),
            "adjusted_rand": adjusted_rand_score(y_zoo, labels),
            "mutual_info": mutual_info_score(y_zoo, labels),
            "homogeneity": homogeneity_score(y_zoo, labels),
            "completeness": completeness_score(y_zoo, labels),
            "v_measure": v_measure_score(y_zoo, labels),
            "silhouette": silhouette_score(X_zoo_scaled, labels),
        }
    )
agglomerative_results = pd.DataFrame(agglomerative_rows)
agglomerative_results


Complete linkage keeps the best balance between external scores and silhouette, so I would explain that choice during the presentation.


## 3. DBSCAN manual example
The statement shows twelve points in the plane with eps equal to 0.5 and min samples equal to three. I recreate the example with NumPy and scikit learn.


In [None]:
points = np.array([
    [1.0, 1.0],
    [1.2, 0.9],
    [0.8, 1.1],
    [1.0, 1.2],
    [8.0, 8.0],
    [8.2, 7.9],
    [7.9, 8.1],
    [8.1, 8.2],
    [0.5, 7.5],
    [0.6, 7.7],
    [0.4, 7.6],
    [0.7, 7.4],
])

model_dbscan = DBSCAN(eps=0.5, min_samples=3)
dbscan_labels = model_dbscan.fit_predict(points)

dbscan_summary = pd.DataFrame({
    "point": [f"P{i+1}" for i in range(len(points))],
    "x": points[:, 0],
    "y": points[:, 1],
    "label": dbscan_labels,
})
dbscan_summary


Two groups appear and three points are marked as noise, which matches the picture from class.


In [None]:
plt.figure(figsize=(4, 4))
plt.scatter(points[:, 0], points[:, 1], c=dbscan_labels, cmap="tab10", s=80, edgecolor="black")
plt.title("DBSCAN result with eps 0.5 and min samples 3")
plt.xlabel("x")
plt.ylabel("y")
plt.grid(True)
plt.show()


## 4. Image color reduction with K-Means
Now I reduce the color palettes of the gradient, stripes, and landscape images from the practice folder.


In [None]:
def load_image(path: Path) -> np.ndarray:
    image = Image.open(path)
    return np.asarray(image, dtype=np.float32) / 255.0


def compress_image(array: np.ndarray, k: int, seed: int = 0) -> tuple[np.ndarray, float]:
    pixels = array.reshape(-1, array.shape[-1])
    model = KMeans(n_clusters=k, random_state=seed, n_init=5)
    labels = model.fit_predict(pixels)
    palette = model.cluster_centers_
    compressed = palette[labels].reshape(array.shape)
    mse = float(np.mean((array - compressed) ** 2))
    return compressed, mse

image_arrays = {
    name: load_image(paths[name])
    for name in ["gradient", "stripes", "landscape"]
}


I test the palette sizes from the statement and record the mean squared error for each case.


In [None]:
compression_plan = {
    "gradient": [4, 8, 16, 32],
    "stripes": [3, 5, 7, 9],
    "landscape": [5, 10, 20, 30],
}

records = []
compressed_examples = {}
for name, array in image_arrays.items():
    ks = compression_plan[name]
    for k in ks:
        compressed, mse = compress_image(array, k)
        records.append({"image": name, "k": k, "mse": mse})
        if k == ks[0]:
            compressed_examples[(name, k)] = compressed
results_compression = pd.DataFrame(records)
results_compression


The mean squared error goes down when I increase k. The stripes image keeps a tiny error even with small palettes because it already has few colors.


In [None]:
fig, axes = plt.subplots(len(compressed_examples), 2, figsize=(6, 6))
for index, ((name, k), compressed) in enumerate(compressed_examples.items()):
    original = image_arrays[name]
    row_axes = axes[index]
    row_axes[0].imshow(original)
    row_axes[0].set_title(f"Original {name}")
    row_axes[0].axis("off")
    row_axes[1].imshow(compressed)
    row_axes[1].set_title(f"k = {k}")
    row_axes[1].axis("off")
plt.tight_layout()
plt.show()


## 5. PCA with the digits dataset
I switch to the digits dataset to practice principal component analysis. I scale the pixels, fit PCA with thirty components, and study the variance ratio.


In [None]:
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

scaler_digits = StandardScaler()
X_digits_scaled = scaler_digits.fit_transform(X_digits)

pca = PCA(n_components=30, random_state=0)
X_digits_pca = pca.fit_transform(X_digits_scaled)
variance_progress = pd.Series(pca.explained_variance_ratio_).cumsum()
variance_progress.head(10)


The first ten components already cover more than eighty percent of the variance, so thirty components are more than enough for the next test.


## 6. Classifier comparison with and without PCA
I compare logistic regression and k nearest neighbors before and after PCA.


In [None]:
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(
    X_digits_scaled,
    y_digits,
    test_size=0.25,
    random_state=0,
    stratify=y_digits,
)

X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

logreg = LogisticRegression(max_iter=2000, random_state=0)
logreg.fit(X_train_scaled, y_train)
logreg_pca = LogisticRegression(max_iter=2000, random_state=0)
logreg_pca.fit(X_train_pca, y_train)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)

results_classifiers = pd.DataFrame(
    [
        {
            "model": "LogisticRegression",
            "dataset": "original",
            "accuracy": logreg.score(X_test_scaled, y_test),
        },
        {
            "model": "LogisticRegression",
            "dataset": "pca",
            "accuracy": logreg_pca.score(X_test_pca, y_test),
        },
        {
            "model": "KNN",
            "dataset": "original",
            "accuracy": knn.score(X_test_scaled, y_test),
        },
        {
            "model": "KNN",
            "dataset": "pca",
            "accuracy": knn_pca.score(X_test_pca, y_test),
        },
    ]
)
results_classifiers


Logistic regression keeps the best accuracy after PCA, while k nearest neighbors loses a little performance. This matches what we discussed in class about linear models and compression.
