**Note to grader:** Each question is assigned with a score. The final score will be (sum of actual scores)/(sum of maximum scores)*100. The grading rubrics are shown in the TA guidelines.

# **Assignment 4**

The goal of this assignment is to run some experiments with scikit-learn on a fairly sizeable and interesting image data set. This is the MNIST data set that consists of lots of images, each having 28x28 pixels. By today's standards, this may seem relatively tiny, but only a few years ago was quite challenging computationally, and it motivated the development of several ML algorithms and models that are now state-of-the-art  solutions for much bigger data sets.

The assignment is experimental. We will try to whether a combination of PCA and kNN can yield any good results for the MNIST data set. Let's see if it can be made to work on this data set.

Note: There are less difficult Python parts in this assignment. You can get things done by just repeating things from the class notebooks. But your participation and interaction via Canvas is always appreciated!

## Preparation Steps

In [10]:
# Import all necessary python packages
import numpy as np
#import os
#import pandas as pd
#import matplotlib.pyplot as plt
#from matplotlib.colors import ListedColormap
#from sklearn.linear_model import LogisticRegression

In [11]:
# we load the data set directly from scikit learn
#
# note: this operation may take a few seconds. If for any reason it fails we
# can revert back to loading from local storage.

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split


X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
y = y.astype(int)
X = ((X / 255.) - .5) * 2
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=10000, random_state=123, stratify=y)


  warn(


## <font color = 'blue'> Question 1. Inspecting the Dataset （50 pts = 10 pts by 5 questions)</font>

**(i)** How many data points are in the training and test sets ? <br>
**(ii)** How many attributes does the data set have ?

Exlain how you found the answer to the first two questions.

[**Hint**: Use the 'shape' method associated with numpy arrays. ]

**(iii)** How many different labels does this data set have. Can you demonsrate how to read that number from the vector of labels *y_train*?  <br>
**(iv)** How does the number of attributes relates to the size of the images? <br>
**(v)** What is the role of line 12 (X = ((X / 255.) - .5) * 2) in the above code?





*(Please insert cells below for your answers. Clearly id the part of the question you answer)*

In [12]:
print("Answer to question 1:")
print("Number of training data points: ", X_train.shape[0])
print("Number of test data points: ", X_test.shape[0])
print("")

print("Answer to question 2:")
print("Number of attributes: ", X_train.shape[1])
print("")

print("Answer to question 3:")
print("The number of different labels in the data set is 10")
print("Here is how you can read that number from the vector of labels y_train:")
print("Number of unique labels: ", len(np.unique(y_train)))
print("")

print("Answer to question 4:")
print("The number of attributes relates to the size of the images because each attribute represents a pixel in the image.")
print("The images are 28x28 pixels, and therefore have 784 attributes.")
print("")

print("Answer to question 5:")
print("The line X = ((X / 255.) - .5) * 2 is normalizing the pixel values.")
print("The original pixel values, which are integers in the range 0-255, are first scaled down to the range 0-1 by dividing by 255.")
print("Then they are shifted to the range -0.5 to 0.5 by subtracting 0.5. Finally, they are scaled to the range -1 to 1 by multiplying by 2. ")
print("This kind of normalization can help the learning algorithm converge faster.")

Answer to question 1:
Number of training data points:  60000
Number of test data points:  10000

Answer to question 2:
Number of attributes:  784

Answer to question 3:
The number of different labels in the data set is 10
Here is how you can read that number from the vector of labels y_train:
Number of unique labels:  10

Answer to question 4:
The number of attributes relates to the size of the images because each attribute represents a pixel in the image.
The images are 28x28 pixels, and therefore have 784 attributes.

Answer to question 5:
The line X = ((X / 255.) - .5) * 2 is normalizing the pixel values.
The original pixel values, which are integers in the range 0-255, are first scaled down to the range 0-1 by dividing by 255.
Then they are shifted to the range -0.5 to 0.5 by subtracting 0.5. Finally, they are scaled to the range -1 to 1 by multiplying by 2. 
This kind of normalization can help the learning algorithm converge faster.


In [13]:
# For grader use only
# maxScore = maxScore + 50


##  <font color = 'blue'> Question 2. PCA on MNIST (10 pts) </font>

Because the number of attributes of the MNIST data set may be too big to apply kNN on it (due to the 'curse of dimensionality'), we want to compress the images down to a smaller number of 'fake' attributes.

Use scikit-learn to output a data set *X_train_transformed* and *X_test_transformed*, with $l$ attributes. Here a reasonable choice of $l$ is 10, equal to the number of labels. But you can try slightly smaller or bigger values as well.

Print out the shape of *X_train_transformed* and *X_test_transformed*.


**Hint**: Take a look at [this notebook](https://colab.research.google.com/drive/1DG5PjWejo8F7AhozHxj8329SuMtXZ874?usp=drive_fs), and imitate what we did there. Be careful though, to use only the scikit-learn demonstration, not the exhaustive PCA steps.

**Note**: This computation can take a while. If problems are encountered we can try the same experiment on a downsized data set.

In [14]:
from sklearn.decomposition import PCA

# number of components to keep
n_components = 10

pca = PCA(n_components=n_components)
X_train_transformed = pca.fit_transform(X_train)
X_test_transformed = pca.transform(X_test)

print("Shape of X_train_transformed: ", X_train_transformed.shape)
print("Shape of X_test_transformed: ", X_test_transformed.shape)


Shape of X_train_transformed:  (60000, 10)
Shape of X_test_transformed:  (10000, 10)


In [15]:
# for grader use
# maxScore = maxScore + 10



## <font color = 'blue'> Question 3. kNN on MNIST attributes from PCA （40 pts = 10 pts by 4 questions) </font>


Having calculated the *transformed* MNIST data set we can now apply a kNN approach to the MNIST classification data set. Here are the sets:

(i) Fit a $k$-NN classifier on the transformed data set. Here $k$ is a hyperparameter, and you can experiment with it. Be aware though, that larger $k$ can take more time to fit.

(ii) Apply the classifier on the transformed test set. What is the classification accuracy?

(iii) Experiment with different settings of $k$. Experiment design: calculates accuracy for increasing values of k; stops when k decreases for 5 values of k; report your findings in a separate cell.

[**Hint**: Take a look at this [notebook](https://colab.research.google.com/drive/1Mh6I3bR8pE90kcs28JfKok59NtfV_7ct?usp=drive_fs)]


In [16]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize the classifier
knn = KNeighborsClassifier(n_neighbors=3)  # start with k=3

# Fit the model
knn.fit(X_train_transformed, y_train)

# Predict the labels for the test set
y_pred = knn.predict(X_test_transformed)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy with k=3: ", accuracy)

# Experiment with different values of k
k_values = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
accuracies = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_transformed, y_train)
    y_pred = knn.predict(X_test_transformed)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"Accuracy with k={k}: {accuracy}")

# Stop when accuracy decreases for 5 consecutive k values
for i in range(5, len(accuracies)):
    if all(earlier > later for earlier, later in zip(accuracies[i-5:i-1], accuracies[i-4:i])):
        print(f"Stopped experimenting at k={k_values[i]} as the accuracy decreased for 5 consecutive k values.")
        break


Accuracy with k=3:  0.9321
Accuracy with k=1: 0.9185
Accuracy with k=2: 0.9146
Accuracy with k=3: 0.9321
Accuracy with k=4: 0.9314
Accuracy with k=5: 0.9347
Accuracy with k=6: 0.9327
Accuracy with k=7: 0.9343
Accuracy with k=8: 0.9331
Accuracy with k=9: 0.9345
Accuracy with k=10: 0.9331


In [17]:
# for grader use
#maxScore = maxScore + 40



In [18]:
# for grader use

#score = actualScore*100/maxScore