<a href="https://colab.research.google.com/github/chefs-kiss/ML_J2026/blob/main/PA3_Classification_with_Fashion_MNIST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name:

Who you worked with:

## Objectives
The goals of this project are to:
- Perform EDA, PCA, and visualize the data
- Implement K-means clustering
- Evaluate the clustering results against the true labels
- Thoughtfully interpret and discuss the results

## Overview
In this assignment, you will explore the Fashion MNIST dataset, which contains grayscale images of 10 different clothing items. You will focus on applying unsupervised learning techniques, specifically K-means clustering, to see if the algorithm can naturally identify patterns that correspond to different clothing categories. You will also critically evaluate the performance of clustering against the ground truth labels and reflect on the limitations of the algorithm.

## Schedule
Here is the suggested schedule for working on this project:
- Weekend: Read through project instructions, complete Task 0.
- Tuesday: Complete Tasks 1-2.
- Wednesday: Complete Tasks 3-4.
- Thursday: Complete Task 5.

This project is due on Thursday, 3/6, by 11:59pm.


# Task 0: Data Exploration

We'll be working with the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset, which contains 70,000 grayscale images of 10 different fashion items.

The images show individual articles of clothing at low resolution (28 by 28 pixels), as seen here:

<table>
  <tr><td>
    <img src="https://tensorflow.org/images/fashion-mnist-sprite.png"
         alt="Fashion MNIST sprite"  width="600">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> <a href="https://github.com/zalandoresearch/fashion-mnist">Fashion-MNIST samples</a> (by Zalando, MIT License).<br/>&nbsp;
  </td></tr>
</table>

Fashion MNIST is intended as a replacement for the classic [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. MNIST is often used as the "Hello, World" of machine learning programs for computer vision. The MNIST dataset contains images of handwritten digits (0, 1, 2, etc.) in a format identical to that of the articles of clothing you'll use here.

Fashion MNIST has more variety, and it's a slightly more challenging problem than regular MNIST. Both datasets are relatively small and are used to verify that an algorithm works as expected. They're good starting points to test and debug code.

The dataset we will be using contains 60,000 images to train the model.


Here, we'll load the data and perform some exploratory data analysis.

##Load Data

We will be working with a dataset from TensorFlow. This library is one we will come back to when working with neural nets. For now, we only are using it for one of it's datasets.

In [None]:
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist

# Load the Fashion MNIST dataset
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

We will only be using the training dataset since we're doing an unsupervised approach.

##Sample image
Now, let's visualize an example image from the dataset.

In [None]:
# Display an example image
plt.figure(figsize=(3, 3))
plt.imshow(train_images[0], cmap='gray')
plt.title(f'Label: {train_labels[0]}')
plt.show()

##DataFrame

We will create a pandas DataFrame so that we can do some initial EDA. Remember that for forming the clusters, we will only use the features. However, we can still use the labels later to evaluate our results.

Before we create the pandas DataFrame, we're going to create a huge vector of our data. This is the first step required to get our data into a pandas DataFrame. It is also introduces a helpful method called `reshape`. If you decide to use image data in the future, this method will be very helpful to know.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Flatten the images from 28x28 to a 784-dimensional vector to use with our algorithm
train_data = train_images.reshape((train_images.shape[0], 28 * 28)).astype(np.float32)

# Create a DataFrame to make viewing easier (if necessary for your case)
X = pd.DataFrame(train_data)

#labels for evaluating clusters later on
y = train_labels

Let's take a peak at what this data looks like

In [None]:
X.head(3)

Each feature is a single pixel in the 28x28 image, with values ranging from 0 to 255. The labels `y` are an array of integers, ranging from 0 to 9.

In [None]:
list(y[:20])

Each image is mapped to a single label. Since the class names are not included with the dataset, we're going to store them here to use later when evaluating and plotting our images.

In [None]:
y_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

## Explore the data

Let's explore the format of the dataset before training the model. The following shows there are 60,000 images, with each image represented as 28 x 28 pixels:

In [None]:
X.shape

Likewise, there are 60,000 labels:

In [None]:
len(y)

###üíª Question1: Is The Data Balanced?

Let's take a look at how many of each type of clothing article we have.

Below is the dataframe of our target.

Add a new line of code that takes `y_df` and finds the counts of each class.

Is our dataset balanced?


In [None]:
y_df = pd.DataFrame(y)
#add code here

###‚úè Question2: Image Flattening

Our images are flattened from a 28√ó28 image to 784 features. What spatial information might be lost in this process? How could this impact our clustering results?

##PCA

It would take a very long time to generate pairplots for 700+ features. Instead, we'll use principal component analysis (PCA) for dimensionality reduction, so that we can visualize a projection of the data. Here, we reduce the data to a few dimensions.


In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=9)
pca.fit(X)
projection = pca.transform(X)
projection_df = pd.DataFrame(projection)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure()
sns.pairplot(projection_df)
plt.show()

###‚úè Question3: PCA plots
How do the different shapes of these plots (such as nugget-shaped, normal, or multi-modal) help us understand the characteristics of clothing categories? For example, do they show variations in style within a category, or do they highlight differences between categories?

## Task 1: Closest Centroid

First, you'll implement a function that takes an array of data points and an array of centroids, and returns an array giving the index of the closest centroid to each data point. Note that the shapes should be:

* `data` has shape $(N, D)$, where $N$ is the number of datapoints and $D$ is the dimensionality of each datapoint.
* `centroids` has shape $(k, D)$, where $k$ is the number of centroids, and $D$ is the dimensionality of each datapoint.
* `closest_centroids` (the return value) has shape $(N,)$, where $N$ is the number of datapoints.

The code below has been outlined in such a way that only basic programming logic needs to be used. You may find that using `numpy` methods helpful when calculating the Euclidean.

Hint: since we have high-dimensional data you can use `np.sum((point - centroid) ** 2)` to find the sum of squared differences when you do the distance calculation.


In [None]:
import numpy as np

def closest_centroid(data, centroids):
    # init to store the closest centroid index for each data point
    closest_centroids = []

    # looping through each data point
    for point in data:
        min_distance = float('inf')  #init min distance to huge number like infinity
        closest_centroid_index = -1

        # counter for centroid index
        i = 0

        # looping through each centroid to compute the distance to the data point

            # calculate the Euclidean distance


            # if the distance is smaller than the minimum (min_distance) found so far, update the closest centroid (closest_centroid_index)


            # increment the counter by 1

        # appending the index of the closest centroid to the result list
        closest_centroids.append(closest_centroid_index)

    # return the list of closest centroid indices
    return np.array(closest_centroids)


The code chunks below will test your functions. If you run both and no errors occur, your function works as expected.

In [None]:
# testing your function
data = np.array([[-2,2], [-1, 2], [-1,1], [1,1], [1,2], [2,2]])
centroids = np.array([[-3,3], [3,3]])
assert(np.array_equal(closest_centroid(data, centroids), np.array([0,0,0,1,1,1])))

In [None]:
data = train_data[:10]  # First 10 images
centroids = train_data[121:123]  # Images at index 121 and 123 as centroids
closest_centroid(data, centroids)
#assert(np.array_equal(closest_centroid(data, centroids), np.array([1, 1, 0, 0, 0, 1, 0, 1, 0, 0])))

###‚úè Question4: Dimensionality and Distances

In high-dimensional spaces like our 784-dimensional images, how does distance calculation become problematic? (This is related to what we call the "curse of dimensionality")

###‚úè Question5: Computational Costs

If we apply this function to all 60,000 images, it will be computationally expensive. How might you modify the approach to make it more efficient for large datasets? For example, we may consider only a subset of the data rather than all 60k images.

###‚úè Question6: Dealing with Ties

What would happen if two centroids were equally distant from a data point? How does your function handle this case, and is this approach appropriate?

## Task 2: Recompute Centroids

Next, you'll define a function that recomputes centroids once each data point has been assigned to a cluster. This function takes an array of datapoints and an array giving the cluster assignments. The index of each centroid should correspond to its cluster number. Note that the shapes should be:

* `data` has shape $(N, D)$, where $N$ is the number of datapoints and $D$ is the dimensionality of each datapoint.
* `labels` has shape $(N,)$, where $N$ is the number of datapoints.
* `centroids` (the return value) has shape $(k, D)$, where $k$ is the number of centroids, and $D$ is the dimensionality of each datapoint.


In [None]:
import numpy as np

def compute_centroids(data, labels):
    # getting the number of clusters
    k = np.max(labels) + 1 #adding one since we start counting at 0 not 1

    # init to store the new centroids
    new_centroids = []

    # looping through each cluster
    for i in range(k):
        # get the data points(data) assigned to cluster i (check if labels are equal to i)
        cluster_points = #

        # compute the mean (use np.mean) of the data points in the cluster (cluster_points)
        new_centroid = #

        # append the new centroid to our new centroid list
        new_centroids.append(new_centroid)

    # convert new_centroids into a numpy array and return
    return #


The code chunks below will test your functions. If you run both and no errors occur, your function works as expected.

In [None]:
# testing your function
data = np.array([[-2,2], [-1, 2], [-1,1], [1,1], [1,2], [2,2]])
labels = np.array([0,0,0,0,1,1])
assert(np.array_equal(compute_centroids(data, labels), np.array([[-.75, 1.5], [1.5, 2]])))

In [None]:
data = train_data[:10]  # Take the first 10 images for testing
labels = np.array([1, 0, 1, 1, 1, 0, 1, 0, 1, 1])  # Example labels (clusters)
assert(np.array_equal(compute_centroids(data, labels)[0][5:7], np.array([0.6666667 , 0.33333334], dtype='float32')))

###‚úè Question7: Centroid as an Image

A centroid is the mean of all points in a cluster. For image data, what does this "average image" actually represent visually? Would it still look like a recognizable piece of clothing?

###‚úè Question8: Mean vs Median

The mean minimizes the sum of squared distances to all points. What if we used the median instead? How might this change our clusters and when might this be beneficial?

## Task 3: Implement $k$-Means

Now, that you've seen how the various components of the algorithm are created, we're going to switch gears and create a model.

Usually when we have a ton of data, we have methods to make it easier on our algorithms (and machines) to create the clusters. For our purposes, we're going to take only three of the types of clothing and do k-means on this subset of data. This will help us as we eventually want to also investigate the clusters to see what is going on.

With large data, that also means that the process of choosing the actual best k take a bit of time. Instead, we're going to use some domain knowledge (we're choosing 3 articles of clothing in our subset) and use that to pick our k value.

In [None]:
dataset = np.column_stack((X, y))
articles = [y_names.index("Coat"), y_names.index("Bag"), y_names.index("Sneaker")]
subset = dataset[np.isin(dataset[:, -1], articles)]
X_subset = subset[:, :-1]
y_subset = subset[:,-1:]

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

kmeans = KMeans(n_clusters=3, random_state = 42)
kmeans.fit(X_subset)
silhouette_score(X_subset, kmeans.labels_)

###‚úè Question9: Subset Article Choices
Our subset currently cointains Coat, Bag, and Sneaker. What if instead we had selected only footwear items: Sandal, Sneaker, Ankle boot? What clustering challenges might arise when items are similar? How might this affect our silhouette score compared to clustering with an unrelated subset?  

###‚úè Question10: Silhouette Score

The silhouette score measures how well-separated clusters are. What would a perfect silhouette score be, and what does our current score suggest about our clusters?

# Task 4: Compare Clusters with Ground Truth
Now, we'll compare the clusters found by the k-means algorithm with the true labels of the dataset using a confusion matrix.

In [None]:
import pandas as pd

# Compare clusters with true labels
pd.crosstab(y_subset.flatten(), kmeans.predict(X_subset))

## Finding particular images

You can further invesitgate by using the function below `find_pic` which will take the class label (items from row_0 in the matrix above) and the cluster label (items from col_0 in the matrix above) and return the first image that meets that criteria.

In [None]:
def find_pic(article_type, cluster_label):
  """This function takes in y label (article_type) and the cluster it belongs
  to (cluster_label) and will show that image"""
  valid_indices = np.where((y_subset.flatten() == article_type) & (kmeans.labels_ == cluster_label))[0]
  chosen_index = np.random.choice(valid_indices)
  print(chosen_index)
  plt.figure(figsize=(3, 3))
  plt.imshow(X_subset[chosen_index].reshape(28, 28), cmap='gray')
  plt.title(f'Label: {article_type}, Cluster: {cluster_label}')
  plt.show()

In [None]:
find_pic(4,0)

In [None]:
find_pic(4,1)

In [None]:
find_pic(4,2)

###‚úè Question11: Labels

What does the row values 4, 7, 8 mean in the context of our data?

###üíª Question12: Identifying Clothing

Is the model generally grouping items according to their true class? Which clothing article type seems easiest for the algorithm to identify? Which is most confused? You can use the `find_pic` function above if that is helpful.

###üíª Question13: Visual Inspection

Consider the shapes of our articles of clothing. What visual features might cause the algorithm to group certain articles together despite having different labels? Use the function above to find at least two pieces of evidence in the data to support your claim. Add your code below.

###‚úè Question14: Clusters

For each cluster:
* which of the labels appear in the cluster?
* is there a label that occurs significantly more frequently than the others?

Where does the algorithm have difficulty? Why do you think this is happening?

#Reflection

Take a moment to reflect on the assingment



##‚úè Question 15: Reflection

What did you like about it? What could be improved? Your answers will not affect your overall grade. This feedback will be used to improve future programming assignments.



#Grading
For each of the following accomplishments, there is a breakdown of points which total to 21. The fraction of points earned out of 21 will be multiplied by 5 to get your final score (e.g. 17 points earned will be 17/21 * 5 ‚Üí 4.05)
* (1pt) Task0 q1: Identified counts and discussed if balanced.
* (1pt) Task0 q2: Identified at least one concern of flattening images
* (1pt) Task0 q3: Discussed distributions of features
* (2pt) Task1: Function `closest_centroid` runs as expected
* (1pt) Task1 q4: Correctly states why dimensions matter when calculating distances
* (1pt) Task1 q5: At least one approach to save on computational costs is discussed.
* (1pt) Task1 q6: Discusses ties and offers a solution
* (2pt) Task2: Function `recompute_centroid` runs as expected
* (1pt) Task2 q7: Correctly explains what "average image" means
* (1pt) Task2 q8: Discusses mean vs median appropriately
* (1pt) Task3 q9: Discusses subsets and their influence on the accuracy of the model
* (1pt) Task3 q10: Interprets the silhouette score
* (1pt) Task4 q11: Indentifies what each label means in the context of the dataset
* (2pt) Task4 q12: Interprets the confusion matrix for articles of clothing (rows)
* (1pt) Task4 q13: Used the function `find_pic` to support the claims
* (2pt) Task4 q14: Interprets the confusion matrix for clusters (columns)
* (1pt) Task5 q15: You have reflected on the assignment