# Clustering Analysis Project – Red Wine Dataset (Step-by-Step)

This notebook follows the project steps:
- Load and inspect the dataset  
- Prepare (standardize) the features  
- Run **K-Means**, **Mean Shift**, and **Hierarchical (Agglomerative)** clustering  
- Evaluate with **Silhouette Score** and **Adjusted Rand Index (ARI)** (using the `quality` column as a reference label)  
- Visualize clusters with **PCA**  
- Interpret which method performed best  

> Dataset file expected in the same folder as this notebook: `winequality-red.csv` (semicolon `;` delimiter)


## 1) Import Required Libraries

In [5]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, MeanShift, estimate_bandwidth, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, adjusted_rand_score

plt.rcParams["figure.figsize"] = (8, 5)


## 2) Load the Red Wine Dataset

**Important:** this dataset uses `;` as the separator.

In [8]:

df = pd.read_csv("winequality-red.csv", sep=";")
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'winequality-red.csv'

## 3) Inspect the Data

In [None]:

print("Shape (rows, columns):", df.shape)

print("\nColumns:")
print(df.columns.tolist())

print("\nMissing values per column:")
print(df.isna().sum())

print("\nSummary statistics:")
df.describe()


## 4) Prepare the Data (Standardization)

We will:
1. Separate `quality` (we **do not** use it for clustering)  
2. Standardize the remaining features with **StandardScaler**


In [None]:

y_ref = df["quality"]            # reference only (for ARI)
X = df.drop(columns=["quality"]) # clustering features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("X shape:", X.shape)
print("X_scaled shape:", X_scaled.shape)


## 5) Apply K-Means Clustering

We test k=2..10 and choose the best k using the silhouette score.


In [None]:

k_values = range(2, 11)
inertias = []
sil_scores = []

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42, n_init="auto")
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

plt.figure()
plt.plot(list(k_values), inertias, marker="o")
plt.title("K-Means Elbow Plot (Inertia vs k)")
plt.xlabel("k")
plt.ylabel("Inertia")
plt.show()

plt.figure()
plt.plot(list(k_values), sil_scores, marker="o")
plt.title("K-Means Silhouette Score vs k")
plt.xlabel("k")
plt.ylabel("Silhouette score")
plt.show()

best_k = list(k_values)[int(np.argmax(sil_scores))]
print("Best k by silhouette:", best_k)


### Run K-Means with best k

In [None]:

kmeans = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
kmeans_labels = kmeans.fit_predict(X_scaled)

kmeans_sil = silhouette_score(X_scaled, kmeans_labels)
kmeans_ari = adjusted_rand_score(y_ref, kmeans_labels)

print("K-Means")
print(" - clusters:", best_k)
print(" - Silhouette:", round(kmeans_sil, 4))
print(" - ARI vs quality:", round(kmeans_ari, 4))


## 6) Apply Mean Shift Clustering

Mean Shift automatically finds the number of clusters.


In [None]:

bandwidth = estimate_bandwidth(
    X_scaled,
    quantile=0.2,
    n_samples=min(500, len(df)),
    random_state=42
)
print("Estimated bandwidth:", bandwidth)

meanshift = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms_labels = meanshift.fit_predict(X_scaled)

ms_n_clusters = len(np.unique(ms_labels))
ms_sil = silhouette_score(X_scaled, ms_labels) if ms_n_clusters > 1 else float("nan")
ms_ari = adjusted_rand_score(y_ref, ms_labels)

print("Mean Shift")
print(" - clusters:", ms_n_clusters)
print(" - Silhouette:", round(ms_sil, 4))
print(" - ARI vs quality:", round(ms_ari, 4))


## 7) Apply Hierarchical (Agglomerative) Clustering

We use the same number of clusters as K-Means (`best_k`) for a fair comparison.


In [None]:

agg = AgglomerativeClustering(n_clusters=best_k, linkage="ward")
agg_labels = agg.fit_predict(X_scaled)

agg_sil = silhouette_score(X_scaled, agg_labels)
agg_ari = adjusted_rand_score(y_ref, agg_labels)

print("Agglomerative")
print(" - clusters:", best_k)
print(" - Silhouette:", round(agg_sil, 4))
print(" - ARI vs quality:", round(agg_ari, 4))


## 8) Compare Results

In [None]:

results = pd.DataFrame({
    "Algorithm": ["K-Means", "Mean Shift", "Agglomerative"],
    "Num_Clusters": [len(np.unique(kmeans_labels)), len(np.unique(ms_labels)), len(np.unique(agg_labels))],
    "Silhouette": [kmeans_sil, ms_sil, agg_sil],
    "ARI_vs_quality": [kmeans_ari, ms_ari, agg_ari]
})

results


## 9) Visualize Clusters with PCA

We reduce to 2D using PCA and plot the clusters.


In [None]:

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

def plot_clusters(labels, title):
    plt.figure()
    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, s=15)
    plt.title(title)
    plt.xlabel("PCA 1")
    plt.ylabel("PCA 2")
    plt.show()

plot_clusters(kmeans_labels, f"K-Means (k={best_k}) - PCA")
plot_clusters(ms_labels, f"Mean Shift (k={len(np.unique(ms_labels))}) - PCA")
plot_clusters(agg_labels, f"Agglomerative (k={best_k}) - PCA")

print("Total explained variance (2 PCA components):", round(pca.explained_variance_ratio_.sum(), 4))


## 10) Interpret Results (Simple)

- **Clusters** group wines that have similar chemistry (acidity, sugar, alcohol, sulfates, etc.).  
- **Silhouette score** tells how clean the clusters are (higher = better separation).  
- **ARI** compares clusters to the `quality` score (reference only, because clustering is unsupervised).

**Pick the “best” algorithm** mainly by the highest silhouette score, and use the PCA plots to see if clusters look separated.


## 11) Short Written Summary (Copy/Paste)

We loaded the Red Wine Quality dataset (`winequality-red.csv`) and prepared it for clustering by removing the `quality` column from the input features (because clustering is unsupervised). All remaining chemical measurements were standardized with `StandardScaler` so that every feature had comparable scale.  

Next, we applied three clustering algorithms: **K-Means**, **Mean Shift**, and **Agglomerative (hierarchical)** clustering. For K-Means, we tested multiple values of *k* (2 to 10) and selected the *k* with the best **silhouette score**. Mean Shift automatically determined the number of clusters based on the estimated bandwidth. Agglomerative clustering was run using the same number of clusters as the selected K-Means solution so results could be compared more fairly.  

To evaluate the clusters, we used the **silhouette score** (higher is better, meaning the clusters are more clearly separated). We also calculated **Adjusted Rand Index (ARI)** by comparing cluster labels to the `quality` scores as a reference check. Finally, we used **PCA** to reduce the data to two dimensions and plotted each algorithm’s clusters to visually compare how well they separate.  

Overall, the algorithm with the highest silhouette score (and a reasonable number of clusters) is considered the best choice for this dataset, and the PCA plots help confirm whether the clusters look distinct in 2D.


## 12) Submit

Submit:
1. This notebook (`clustering_wine.ipynb`)  
2. The short written summary above
