# Assignment 1

Name        : Chia Sin Liew   
Last edited : February 21st, 2022 

The goal of this assignment is to study the K-Nearest Neighbors(K-N) classifier model.

- **Part A**: Model optimization via feature selection & varying threshold
- **Part B**: Understanding the curse of dimensionality & the fundamental limitation of the K-NN model

In [8]:
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve
from collections import defaultdict
from scipy import stats

## Part B: Classification of Unstructured Data

You will create a K-NN classifier (using Scikit-Learn) to perform multi-class classification on the following unstructured dataset.

### **Dataset**:
The CIFAR-10 dataset (Canadian Institute For Advanced Research) contains 60,000 32 x 32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class. You may directly load this dataset using the Keras API: https://keras.io/api/datasets/cifar10/


### **Preprocessing**:

You will need to perform some pre-processing steps. First step is to reshape the data. The dimension of the training set is 50000 x 32 x 32 x 3 and test set is 10000 x 32 x 32 x 3. Before you use this data for the K-NN model, you need to flatten each sample (i.e., 32 x 32 x 3 = 3072) such the dimension of training and test set becomes:
- 50000 x 3072
- 10000 x 3072

In [9]:
import keras

# first load data
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()

In [10]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(50000, 32, 32, 3)
(50000, 1)
(10000, 32, 32, 3)
(10000, 1)


In [11]:
# Reshape data 
X_train = X_train.reshape(50000, 3072)
X_test = X_test.reshape(10000, 3072)

# Convert 1D vector into 1D array
y_train = y_train.ravel()
y_test = y_test.ravel()


print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(50000, 3072)
(50000,)
(10000, 3072)
(10000,)


#### Data Scaling - Min-Max Scaling

In [12]:
X_train = X_train/255.0
X_test = X_test/255.0

print("\nMin, max for X_train: %.2f, %.2f" % (X_train.min(), X_train.max()))
print("\nMin, max for X_test: %.2f, %.2f" % (X_test.min(), X_test.max()))


Min, max for X_train: 0.00, 1.00

Min, max for X_test: 0.00, 1.00


### **Experiments**:

- **Experiment 5)** Create a K-NN classifier (using Scikit-Learn) and perform multi- class classification. Report train accuracy, test accuracy, and test confusion matrix.

In [13]:
%%time

# Create a KNN classifier
knn_mult = KNeighborsClassifier(n_neighbors=5, p=1)

knn_mult.fit(X_train, y_train)

y_train_predicted = knn_mult.predict(X_train)


CPU times: user 1h 12min 38s, sys: 56.6 s, total: 1h 13min 35s
Wall time: 2h 40min 33s


In [15]:
# Calculate:
#    - training accuracy
train_accuracy_knn = np.mean(y_train_predicted == y_train)
print("\nTraining Accuracy: ", train_accuracy_knn)

#    - test accuracy
test_accuracy_knn = knn_mult.score(X_test, y_test)
print("\nTest Accuracy: ", test_accuracy_knn)

#    - test confusion matrix
# first calculate the no. of correct predictions
y_test_predicted = knn_mult.predict(X_test)
print("\nNo. of correct predictions (Test): %d/%d" % (np.sum(y_test_predicted == y_test), len(y_test)))

# then calculate confusion matrix
print("\nConfusion Matrix (Test):\n", confusion_matrix(y_test, y_test_predicted))


Training Accuracy:  0.53512

Test Accuracy:  0.377

No. of correct predictions (Test): 3770/10000

Confusion Matrix (Test):
 [[582   9 101  10  49   7  25   7 195  15]
 [139 288  89  50 130  40  44  17 168  35]
 [145   5 456  54 206  30  55  13  34   2]
 [ 82  11 215 246 162 109 101  14  52   8]
 [ 92   4 259  40 489  18  43  14  40   1]
 [ 72   4 214 151 166 266  64  14  43   6]
 [ 36   4 259  74 285  27 288   1  25   1]
 [116  10 155  50 259  58  38 267  37  10]
 [154  20  47  33  43  17  10   6 662   8]
 [166  90  71  40  91  30  46  27 213 226]]


### Question 2:
To answer the questions below you need to compare the poor performance of your K-NN model on the CIFAR-10 dataset with the performance of a K-NN model on the MNIST handwritten digits image dataset. Click on the following link and observe the performance of a [K-NN model on the MNIST dataset](https://github.com/rhasanbd/K-Nearest-Neighbors-Learning-Without-Learning/blob/master/K-NN-6-Curse%20of%20Dimensionality.ipynb). The model obtained over 97% test accuracy. 


However, your K-NN model on the CIFAR-10 dataset would perform awfully poorly.


a) Explain why your K-NN model was unable to obtain high test accuracy on the CIFAR-10 image classification problem.

**Ans**: The K-NN model can not achieve high accuracy for the CIFAR-10 dataset because there is too much variability in the images, from different backgrounds to different angles and scales of the same objects being photographed. Therefore, the distance between and within classes are similar and renders the K-NN model useless.

b) Why does a K-NN model perform accurately on the MNIST handwritten
digits image classification problem? Following notebooks might be useful to answer this question: https://github.com/rhasanbd/Study-of-Analogy-based-Learning-Image-Classification

**Ans**: The MNIST dataset is a grayscale data set, where the handwritten digits are "centralized" and "scaled". Because all the digits have a similar white background and the somewhat uniform scale and angle of the digits, a K-NN model can accurately determine the digit classes based on pixel-by-pixel distance of the images.

This observation is also being supported by the t-SNE plot in this [notebook](https://github.com/rhasanbd/Study-of-Analogy-based-Learning-Image-Classification/blob/main/Study%20of%20Analogy%20based%20Learning-Image%20Classification-3.ipynb), which displays clear separation of the digit images for the MNIST dataset but a big blob of overlapped images for the CIFAR-10 dataset.

c) Is it possible to achieve above 90% accuracy on the CIFAR-10 dataset using a K-NN model? Justify your answer.

**Ans**: Theoretically, it is possible to achieve above 90% accuracy on the CIFAR-10 dataset using a K-NN model if high-level features with meaningful distinctive traits for the 10 classes are available. In order to obtain these features, more sophisticated model(s) with the ability to extract them from the raw pixel CIFAR-10 dataset have to be used. Once these features are available, a K-NN model or other K-NN like model can be utilized to classify them , such as the implementation by [Xu et al. 2020](https://arxiv.org/pdf/2012.02733v1.pdf).