<a href="https://colab.research.google.com/github/ch00226855/CMP414765Spring2021/blob/main/Week04_ImageClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4
# An Image Classification Example
- Learn about the MNIST hand-written digit data
- Visualize and analyze the dataset
- Apply simple SVM and kNN model to build a classifier
- Evaluate the performance of the models

*Readings*: Textbook Chapter 3

# I. The MNIST Data

The **MNIST database** is a large database of handwritten digits that is commonly used for training various image processing systems. The images were collected from digits written by high school students and employees of the United States Census Bureau. The database has a training set of 60,000 examples, and a test set of 10,000 examples.

The original dataset is in a format that is difficult for beginners to use. The data is transformed to CSV format [here](https://pjreddie.com/projects/mnist-in-csv/).

- Right click on the hyperlink "train set" and click "Copy Link Address"
- Use command `wget` to download the file to Colab environment.
- Check the "File" tab on the left to confirm that the CSV file has been successfully downloaded.
- Download `mnist_test.csv` in the same way.

In [None]:
# Download the training CSV file
!wget https://pjreddie.com/media/files/mnist_train.csv

In [None]:
# Exercise: Download the test CSV file.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Load the training set and show its first 5 rows.
raw_data = pd.read_csv("mnist_train.csv", header=None, sep=',')
raw_data.head()

## II. Data Exploration
- Show basic information about the dataset:
    - size, column names, data types
    - class frequencies for each categorical feature
    - max, min, mean for each numerical feature
    - correlation between the class feature and each input feature
- Visualize a data example as an image.

In [None]:
# Size, column names, and data types



In [None]:
# class frequencies for each categorical feature
data = raw_data.rename({0: "label"}, axis=1)
data['label'].value_counts().sort_index()

In [None]:
# Plot a bar chart to show class frequencies
data['label'].value_counts().sort_index().plot.bar()

In [None]:
# maximum, minimum, and mean of numerical features



In [None]:
# Visualize data as images
ind = 1
input_features = [x for x in data.columns if x != "label"]
data_example = data.loc[ind, input_features]
print(data_example.shape)

In [None]:
# Convert the data example to a numpy array
data_example_array = data_example.values
print(data_example_array.shape)

In [None]:
# Transform the array to a 28*28 2D array
data_example_array_transformed = data_example_array.reshape([28, 28])
print(data_example_array_transformed.shape)
print(data_example_array_transformed)

In [None]:
plt.imshow(data_example_array_transformed)

In [None]:
# Write a function to automate the process
def get_image(data, ind):
    # Use data.loc to extract the 784 pixel values
    
    input_features2 = [column for column in data.columns if column != 'label']
    data_example = data.loc[ind, input_features2]
    
    
    # Convert the list to a numpy array
    data_example_numpyArray = data_example.values
    
    # Change the shape to [28, 28]
    data_example_numpyArray_transformed = data_example_numpyArray.reshape([28,28])
    
    # Use imshow() to display the image.    
    plt.imshow(data_example_numpyArray_transformed)
    
#     return data_example_array_transformed

In [None]:
ind = 123
get_image(data, ind)

## III. Build A Classifier

In [None]:
# Create a smaller training set to reduce training time
sample_size = 6000
samples = np.random.choice(data.index, sample_size, replace=False)
mnist_train_small = data.loc[samples]
print(mnist_train_small.shape)

In [None]:
# Verify mnist_train_small still contains enough training examples for each label



### 1. k-Nearest-Neighbor Method

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(mnist_train_small[input_features], mnist_train_small['label'])

In [None]:
# Load the test set



In [None]:
# Use the model to make predictions on the test images



In [None]:
# Compute prediction accuracy
from sklearn.metrics import accuracy_score



### 2. Support Vector Machine

In [None]:
# Build a linear SVM classifier
from sklearn.svm import LinearSVC



In [None]:
# Use the model to make predictions on the test images



In [None]:
# Calculate accuracy score



## IV. Performance Evaluation
- Test Accuracy
- Error images
- Confusion matrix

Besides test accuracy, we should use other metrics to have a better understanding of how the models are performing on different type of inputs. Let's first find some images that are labeled incorrectly by the SVM model.

In [None]:
# Append predictions as a new column to the test data frame.



In [None]:
# Extract rows that are mis-classified



In [None]:
# Show one image that is incorrectly classified



We can also look at the classification accuracy on each type of labels via the confusion matrix.

In [None]:
# Construct the confusion matrix
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test_predictions, mnist_test['label'])
print(mat)

In [None]:
# Visualize the confusion matrix as an image


