# MNIST handwritten digits dimensionality reduction with scikit-learn

In this notebook, we'll use some popular methods to reduce the dimensionality of MNIST digits data before classification.

[Section 1](#1.-Feature-extraction) of the notebook contains examples of feature extraction methods, and [Section 2](#2.-Feature-selection) two  methods for feature selection. Any of these methods can then be applied to train a MNIST digits classifier for lower-dimensional data in [Section 3](#3.-Classification-with-dimension-reduced-data).

First, the needed imports.

In [None]:
%matplotlib inline

from pml_utils import get_mnist

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import __version__
from sklearn import decomposition, feature_selection
from skimage.measure import block_reduce
from skimage.feature import canny

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from packaging.version import Version
assert(Version(__version__) >= Version("0.20")), "Version >= 0.20 of sklearn is required."

Then we load the MNIST data. First time it may download the data, which can take a while.

In [None]:
X_train, y_train, X_test, y_test = get_mnist('MNIST')

print('MNIST data loaded: train:',len(X_train),'test:',len(X_test))
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

## 1. Feature extraction

### 1.1 PCA

[Principal component analysis](http://scikit-learn.org/stable/modules/decomposition.html#pca) (PCA) is a standard method to decompose a high-dimensional dataset in a set of successive orthogonal components that explain a maximum amount of the variance. Here we project the data into `n_components` principal components. The components have the maximal possible variance under the orthogonality constraint.

The option `whiten=True` can be used to whiten the outputs to have unit component-wise variances.  Its usefulness depends on the model to be used.

In [None]:
%%time
n_components = 50
pca = decomposition.PCA(n_components=n_components, whiten=True)
X_pca = pca.fit_transform(X_train)
print('X_pca:', X_pca.shape)

We can inspect the amount of variance explained by the principal components.

In [None]:
plt.figure()
plt.plot(np.arange(n_components)+1, pca.explained_variance_)
plt.title('Explained variance by PCA components')
plt.ylabel('explained variance')
plt.xlabel('PCA component');

### 1.2 Image feature extraction

There are a lot of different feature extraction methods for image data.  Common ones include extraction of colors, textures, and shapes from images, or detection of edges, corners, lines, blobs, or templates.  Let's try a simple filtering-based method to reduce the dimensionality of the features, and a widely-used edge detector.

The [`measure.block_reduce()`](http://scikit-image.org/docs/dev/api/skimage.measure.html#skimage.measure.block_reduce) function from scikit-image applies a function (for_example `np.mean`, `np.max` or `np.median`) to blocks of the image, resulting in a downsampled image.

In [None]:
X_train_img = X_train.reshape(-1, 28, 28)
filter_size = 2
X_train_img_downsampled = block_reduce(X_train_img, 
                                       block_size=(1, filter_size, filter_size), 
                                       func=np.mean)

print('X_train_img:', X_train_img.shape)
print('X_train_img_downsampled:', X_train_img_downsampled.shape)

The [`feature.canny()`](http://scikit-image.org/docs/dev/api/skimage.feature.html#skimage.feature.canny) function applies the [Canny edge detector](https://en.wikipedia.org/wiki/Canny_edge_detector) to extract edges from the image.  Processing all images may take a couple of minutes.

In [None]:
%%time

sigma = 1.0
X_train_img_canny = np.zeros(X_train_img.shape)
for i in range(X_train_img.shape[0]):
    X_train_img_canny[i,:,:] = canny(X_train_img[i,:,:], sigma=sigma)
print('X_train_img_canny:', X_train_img_canny.shape)

Let's compare the original and filtered digit images:

In [None]:
pltsize=1

plt.figure(figsize=(10*pltsize, pltsize))
plt.suptitle('Original')
plt.subplots_adjust(top=0.8)
for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(X_train_img[i,:,:], cmap="gray", interpolation='none')

plt.figure(figsize=(10*pltsize, pltsize))
plt.suptitle('Downsampled with a %dx%d filter' % (filter_size, filter_size))
plt.subplots_adjust(top=0.8)
for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(X_train_img_downsampled[i,:,:], cmap="gray", interpolation='none')
    
plt.figure(figsize=(10*pltsize, pltsize))
plt.suptitle('Canny edge detection with sigma=%.2f' % sigma)
plt.subplots_adjust(top=0.8)
for i in range(10):
    plt.subplot(1,10,i+1)
    plt.axis('off')
    plt.imshow(X_train_img_canny[i,:,:], cmap="gray", interpolation='none')

## 2. Feature selection

### 2.1 Low variance

The MNIST digits have a lot of components (pixels) with little variance.  These components are not particularly useful for discriminating between the classes, so they can probably be removed safely.  Let's first draw the component-wise variances of MNIST data.

In [None]:
variances = np.var(X_train, axis=0)
plt.figure()
plt.plot(variances)
plt.title('Component-wise variance of MNIST digits')
plt.ylabel('variance')
plt.xlabel('component');

The variances can also be plotted for each pixel in the image plane.

In [None]:
plt.figure()
sns.heatmap(variances.reshape(28,28), cmap=sns.color_palette("Blues"))
plt.title('Pixel-wise variance of MNIST digits')
plt.grid(False)

Select an appropriate `variance_threshold` based on the *"Component-wise variance of MNIST digits"* figure above.

In [None]:
%%time

variance_threshold = 1000
lv = feature_selection.VarianceThreshold(threshold=variance_threshold)
X_lv = lv.fit_transform(X_train)
print('X_lv:', X_lv.shape)

### 2.2 Univariate feature selection

Another method for feature selection is to select the *k* best features based on univariate statistical tests between the features and the class of each sample.  Therefore, this is a supervised method and we need to include `y_train` in `fit_transform()`.
See [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) for the set of available statistical tests and other further options.

In [None]:
%%time

k = 50
ukb = feature_selection.SelectKBest(k=k)
X_ukb = ukb.fit_transform(X_train, y_train)
print('X_ukb:', X_ukb.shape)

We can check which features (that is, pixels in case) got selected:

In [None]:
support = ukb.get_support()
plt.figure()
sns.heatmap(support.reshape(28,28), cmap=sns.color_palette("Blues"))
#with sns.axes_style("white"):
#    plt.imshow(support.reshape(28,28), interpolation='none')
plt.title('Support of SelectKBest() with k=%d' % k)
plt.grid(False)

## 3. Classification with dimension-reduced data 

Let's now train a classifier using lower-dimensional data. Choose any of the above feature extraction or feature selection methods, and reduce the dimensionality of the MNIST data with that method. You can also implement your own dimensionaly reduction method.

Note that you need to transform also the test data into the lower-dimensional space using `transform()`.  Here is an example for PCA:

In [None]:
X_test_pca = pca.transform(X_test)
print('X_test_pca:', X_test_pca.shape)

Select a classification method from the ones that have been discussed on the previous lectures. For example, nearest neighbor classifiers or decision trees are good choices. Compare the results (accuracy, time) to classification using the original MNIST data.


## 4. Other methods for dimensionality reduction

Study and experiment with additional dimensionality reduction methods based on [decomposing](http://scikit-learn.org/stable/modules/decomposition.html) or [feature selection](http://scikit-learn.org/stable/modules/feature_selection.html).  See also [unsupervised dimensionality reduction](http://scikit-learn.org/stable/modules/unsupervised_reduction.html).