# Week 5
# An Image Classification Example (Chapter 3)
- Learn about the MNIST hand-written digit data
- Visualize and analyze the dataset
- Apply simple SVM and kNN model to build a classifier
- Evaluate the performance of the models

# I. The MNIST Data

The **MNIST database** is a large database of handwritten digits that is commonly used for training various image processing systems. The images were collected from digits written by high school students and employees of the United States Census Bureau. The database has a training set of 60,000 examples, and a test set of 10,000 examples.

The original dataset is in a format that is difficult for beginners to use. The data is transformed to CSV format on [Kaggle](https://www.kaggle.com/oddrationale/mnist-in-csv).
- Log in to Kaggle.com
- Click "Download" to download the data as a `.zip` file
- Extract `mnist_train.csv` and `mnist_test.csv` from the `zip` file
- Load `mnist_train.csv` as a data frame

![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/MnistExamples.png/330px-MnistExamples.png)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Load the training set and show its first 5 rows.


## II. Data Exploration
- Show basic information about the dataset:
    - size, column names, data types
    - class frequencies for each categorical feature
    - max, min, mean for each numerical feature
    - correlation between the class feature and each input feature
- Visualize a data example as an image.

In [None]:
# Size, column names, and data types


In [None]:
# class frequencies for each categorical feature
mnist_train['label'].value_counts()

In [None]:
mnist_train['label'].value_counts().sort_index()

In [None]:
# Plot a bar chart to show class frequencies
mnist_train['label'].value_counts().sort_index().plot.bar()

In [None]:
# maximum, minimum, and mean of numerical features


In [None]:
mnist_train.describe()

In [None]:
# Visualize data as images
ind = 12345
input_features = [x for x in mnist_train.columns if x != "label"]
data_example = mnist_train.loc[ind, input_features]
print(data_example.shape)

In [None]:
# Convert the data example to a numpy array
data_example_array = data_example.values
print(data_example_array.shape)

In [None]:
# Transform the array to a 28*28 2D array
data_example_array_transformed = data_example_array.reshape([28, 28])
print(data_example_array_transformed.shape)
print(data_example_array_transformed)

In [None]:
plt.imshow(data_example_array_transformed)

In [None]:
# Write a function to automate the process
def get_image(data, ind):
    
    
    
    
    return data_example_array_transformed

In [None]:
img = get_image(mnist_train, ind)
plt.imshow(img)

## III. Build A Classifier

In [None]:
# Create a smaller training set to reduce training time
sample_size = 6000
samples = np.random.choice(mnist_train.index, sample_size, replace=False)
mnist_train_small = mnist_train.loc[samples]
print(mnist_train_small.shape)

### 1. k-Nearest-Neighbor Method

In [None]:
from sklearn.neighbors import KNeighborsClassifier


In [None]:
# Load the test data set
mnist_test = pd.read_csv('Data/mnist-in-csv/mnist_test.csv', sep=',')
mnist_test.head()

In [None]:
# Use the model to make predictions on the test images


In [None]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_predictions, mnist_test['label'])
print(accuracy)

### 2. Support Vector Machine

In [None]:
from sklearn.svm import LinearSVC


In [None]:
# Use the model to make predictions on the test images


In [None]:
# Calculate accuracy score

## IV. Performance Evaluation
- Test Accuracy
- Error images
- Confusion matrix

In [None]:
# Find one image with wrong prediction


In [None]:
# Confusion matrix
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test_predictions, mnist_test['label'])
print(mat)

In [None]:
# Visualize the confusion matrix


[Wheel Decide](https://wheeldecide.com/index.php?c1=Mohammed&c2=Alvin&c3=Wyahid&c4=Josue+&c5=Vincent&c6=Jose+A&c7=Immanuel&c8=Xujuan&c9=Rene&c10=Rishi&c11=Andy+D&c12=Miguel&c13=Frank&c14=Daniel&c15=Kevin&c16=Omar&c17=Mohit+U&c18=Dilan&c19=Alix+F&c20=Yafira&c21=Ba&c22=Jennifer&c23=Joshua+M&c24=Francis&c25=David&c26=Jose&c27=Jordan+B&c28=Guevara&c29=Shiva&c30=Gilberto+M&c31=Ygor+J+&c32=Ayaz&c33=Quan&time=5)