# HW4: K-Means Clustering with MNIST Dataset

In this assignment, clustering and labeling of digits will be performed on MNIST dataset using K-Means clustering algorithm. Clustering will be performed with three different distance metrics (Euclidean, Manhattan, Cosine) and performance will be evaluated with 5-fold cross validation.

## Requirements
- Load MNIST dataset and normalize it.
- Create \( k=10 \) clusters with K-Means algorithm.
- Cluster with three distance metrics: Euclidean, Manhattan, Cosine.
- Apply 5-fold cross validation.
- Calculate confusion matrices and accuracy scores for training and test sets.
- Report the results.

In [2]:
import numpy as np
from keras.datasets import mnist
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist


(X, y), _ = mnist.load_data()
X = X.reshape((X.shape[0], -1)).astype(np.float32) / 255.0  


n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
distance_metrics = {
    "euclidean": "euclidean",
    "manhattan": "cityblock",
    "cosine": "cosine"
}
results = {}

for metric_name, metric in distance_metrics.items():
    print(f"\n--- {metric_name.upper()} ---")
    train_accuracies = []
    test_accuracies = []
    fold = 1
    for train_idx, test_idx in skf.split(X, y):
        print(f"\nFold {fold}")
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

      
        if metric_name == "euclidean":
            
            kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
            kmeans.fit(X_train)
            train_clusters = kmeans.predict(X_train)
            test_clusters = kmeans.predict(X_test)
            centers = kmeans.cluster_centers_
        else:
           
            kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
            kmeans.fit(X_train)
            centers = kmeans.cluster_centers_
           
            train_clusters = cdist(X_train, centers, metric).argmin(axis=1)
            test_clusters = cdist(X_test, centers, metric).argmin(axis=1)

       
        table = np.zeros((10, 10), dtype=int)
        for true_label, cluster in zip(y_train, train_clusters):
            table[cluster, true_label] += 1

       
        cluster_labels = -np.ones(10, dtype=int)
        used_labels = set()
        used_clusters = set()
        table_copy = table.copy()
        for _ in range(10):
            i, j = np.unravel_index(np.argmax(table_copy), table_copy.shape)
            if cluster_labels[i] == -1 and j not in used_labels:
                cluster_labels[i] = j
                used_labels.add(j)
                used_clusters.add(i)
            table_copy[i, j] = -1 

      
        train_pred_labels = np.array([cluster_labels[c] for c in train_clusters])
        train_acc = accuracy_score(y_train, train_pred_labels)
        train_conf = confusion_matrix(y_train, train_pred_labels)
        print("Train Accuracy:", train_acc)
        print("Train Confusion Matrix:\n", train_conf)
        train_accuracies.append(train_acc)

        test_pred_clusters = cdist(X_test, centers, metric).argmin(axis=1)
        test_pred_labels = np.array([cluster_labels[c] for c in test_pred_clusters])
        test_acc = accuracy_score(y_test, test_pred_labels)
        test_conf = confusion_matrix(y_test, test_pred_labels)
        print("Test Accuracy:", test_acc)
        print("Test Confusion Matrix:\n", test_conf)
        test_accuracies.append(test_acc)
        fold += 1

    print(f"\n{metric_name.upper()} Mean Train Accuracy: {np.mean(train_accuracies):.4f}")
    print(f"{metric_name.upper()} Mean Test Accuracy: {np.mean(test_accuracies):.4f}")
    results[metric_name] = {
        "train_acc": train_accuracies,
        "test_acc": test_accuracies
    }


--- EUCLIDEAN ---

Fold 1
Train Accuracy: 0.49904166666666666
Train Confusion Matrix:
 [[   0    0    0    0    0    0    0    0    0    0    0]
 [2018 2245    2   11  138   31    0  155   10  129    0]
 [2391    0 2965    6    5    6    0    6    9    6    0]
 [ 369   10  285 3369  258  141    0  161   59  115    0]
 [ 176   11  345  179 3102  148    0   48   41  854    0]
 [ 255   10  116   29    1 2589    0  132 1527   14    0]
 [ 838   51  123   13 1414  304    0  101  262 1230    0]
 [ 272   87  220   65   24   65    0 3923    1   78    0]
 [ 231   12  259   32    3 1438    0    4 3025    8    0]
 [ 363   30  232   47  918  158    0   36  161 2736    0]
 [ 113   34  194    7   61 2327    0    7 1954   62    0]]
Test Accuracy: 0.5011666666666666
Test Confusion Matrix:
 [[  0   0   0   0   0   0   0   0   0   0   0]
 [499 572   1   5  22   7   0  27   4  47   0]
 [585   0 753   3   0   0   0   3   0   4   0]
 [ 82   1  75 838  64  35   0  46  12  38   0]
 [ 56   3  85  41 813  27  

## Results and Comments

In this section, we analyze and interpret the results obtained from the K-Means clustering using different distance metrics on the MNIST dataset.

### Overall Performance Evaluation

The mean training and test accuracies for each distance metric across the 5 folds are as follows:

- **Euclidean Distance Metric**:
  - **Mean Training Accuracy**: 0.4996
  - **Mean Test Accuracy**: 0.4992
- **Manhattan Distance Metric**:
  - **Mean Training Accuracy**: 0.3780
  - **Mean Test Accuracy**: 0.3778
- **Cosine Distance Metric**:
  - **Mean Training Accuracy**: 0.5108
  - **Mean Test Accuracy**: 0.5097

Based on these results, the **Cosine distance metric** achieved the highest performance, with mean training and test accuracies of 0.5108 and 0.5097, respectively. The Euclidean metric performed slightly lower (0.4996 and 0.4992), while the Manhattan metric yielded the lowest accuracies (0.3780 and 0.3778).

### Insights from Confusion Matrices

The confusion matrices provide detailed insights into the classification performance for each metric:

- **Euclidean Distance Metric**:
  - Classes such as 0, 1, and 2 were well-separated in both training and test sets.
  - Significant confusion was observed between classes 4 and 9. For example, in Fold 1's test confusion matrix, 226 samples of class 9 were misclassified as class 4.

- **Manhattan Distance Metric**:
  - Due to overall low performance, many classes were misclassified.
  - High confusion rates were particularly noted between classes 3, 5, and 8. In Fold 5's test matrix, 464 samples of class 5 were misclassified as class 9.

- **Cosine Distance Metric**:
  - Strong performance was observed for classes 0, 1, and 2.
  - While confusion between classes 4 and 9 persisted, it was less pronounced compared to the Euclidean metric. In Fold 1's test matrix, 248 samples of class 9 were misclassified as class 4, but overall accuracy remained higher.

### Comments

- The **Cosine distance metric** is the most suitable for the MNIST dataset in this context. This is likely because Cosine distance measures the angle between data points, which may better capture the shape-based similarities of digits. This is supported by the highest accuracy scores of 0.5108 (training) and 0.5097 (test).
- The **Euclidean distance metric** performed close to Cosine (0.4996 and 0.4992) but was less effective in distinguishing between visually similar classes like 4 and 9.
- The **Manhattan distance metric** performed poorly (0.3780 and 0.3778), possibly due to its inability to model the relationships between pixel values effectively.
- The confusion matrices highlight the model's weaknesses, particularly the confusion between classes 4 and 9, which may stem from their visual similarities. Addressing this issue could involve additional feature engineering or more advanced algorithms.

### Conclusion

When applying K-Means clustering to the MNIST dataset, the **Cosine distance metric** provides the best performance, achieving a mean test accuracy of 0.5097, outperforming both Euclidean (0.4992) and Manhattan (0.3778) metrics. Therefore, for this dataset and algorithm combination, the Cosine metric is recommended. However, further optimizations could be explored to reduce confusion between specific classes like 4 and 9.