<img src="data/images/lecture-notebook-header.png" />

# Dimensionality Reduction: Digits Dataset

This notebook doesn't use any new concept, but applies PCA, LDA, and t-SNE on different a dataset. The IRIS dataset is "too simple": it only has 4 features to begin with, and those 4 features also show strong linear relationships, making it always too easy to apply PCA and LDA and getting good results. Also, the IRIS dataset with only 150 data samples is very small, so all 3 dimensionality reduction methods perform very fast.

The [Digits dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html) provided by `scikit-learn` is made up of 1,797 8x8 images. Each image, like the one shown below, is of a hand-written digit, represented as a grayscale image with pixel values ranging from 0 to 16, indicating the intensity of the pixel.. In order to utilize an 8x8 figure like this, we have to first transform it into a feature vector with length 64; this. resulting in 64 features. And since this features represent pixel values, there are no very obvious linear relationships.

## Setting up the Notebook

### Specify how Plots Get Rendered

In [None]:
%matplotlib inline

### Make all Required Imports

In [None]:
import numpy as np
import pandas as pd

from tqdm import tqdm

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import TSNE

import matplotlib.pyplot as plt
import matplotlib.cm as cm

### Load and Prepare Dataset (Digits)

The Digits dataset is part of the `scikit-learn` package, making it very easy to load into the right format.

In [None]:
digits = load_digits()

X = digits.data
y = digits.target

print('The dataset contains {} sample and {} features.'.format(X.shape[0], X.shape[1]))

For illustration, the code cell below show a few images from the datasets.

In [None]:
_, axes = plt.subplots(nrows=1, ncols=6, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title('Sample: {}'.format(label))

---

## Principal Component Analysis (PCA)

As the dataset is still not really large, we can easily afford to calculate PCA with different values for `n_compinents` and see how much of the variance is explained by all new features.

In [None]:
x_vals, y_vals = [], []

for n in tqdm(range(1, 33)):
    
    pca = PCA(n_components=n).fit(X)

    x_vals.append(n)
    y_vals.append(np.sum(pca.explained_variance_ratio_))

The first thing to observe is that the computation is still very fast. After all, the dataset is not truly large.

Note that with a value between 0 and 1 for the overall explained variance for each value of `n_component` we can use a line plot to visualize the relationship.

In [None]:
plt.figure()
plt.tick_params(labelsize=14)
plt.xlabel('x', fontsize=18)
plt.ylabel('y', fontsize=18)
plt.ylabel('percentage of variance (%)')
plt.plot(x_vals, y_vals)
plt.tight_layout()
plt.show()

Compared to the IRIS dataset, we need quite a number of principal components to explain a good amount of the variance, e.g., more than 20 components to explain more than 90% of the variance. This is in line with our intuition that there are no clear linear relationship between the features representing pixel values.

For plotting, this is a bit problematic. Of course, we can reduce the number of features down to 2 and have a look at the resulting plot. But from the result we have so far we can expect a very poor separation of the class labels.

In [None]:
pca = PCA(n_components=2).fit(X)

print('Overall explained variance: {:.3f}'.format(np.sum(pca.explained_variance_ratio_)))

X_pca = pca.transform(X)

And here is the corresponding plot:

In [None]:
plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=cm.tab10, s=50)
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)  
plt.tight_layout()
plt.show()

One can vaguely make out the 10 different classes, but the overlap between the classes is equally noticeable.

---

## Linear Discriminant Analysis (LDA)

We can do a similar analysis with LDA, but note that the maximum number of components we can try is the number of classes minus 1. So let's first calculate this number -- although we know that it's 10.

In [None]:
num_classes = len(np.unique(y))

print('Number of classes: {}'.format(num_classes))

We use the same loop to iterate over all possible values of `n_components` for LDA, for each value keeping track of the resulting overall explained variance.

In [None]:
x_vals, y_vals = [], []

for n in tqdm(range(1, num_classes)):
    
    lda = LinearDiscriminantAnalysis(n_components=n).fit(X, y)

    x_vals.append(n)
    y_vals.append(np.sum(lda.explained_variance_ratio_))

Again, the computation takes essentially no time, and we can visualize the result using a line plot.

In [None]:
plt.figure()
plt.tick_params(labelsize=14)
plt.xlabel('x', fontsize=18)
plt.ylabel('y', fontsize=18)
plt.ylabel('percentage of variance (%)')
plt.plot(x_vals, y_vals)
plt.tight_layout()
plt.show()

While the trend is similar to the one exhibited by PCA, the absolute values for the explained variances are larger for LDA. Here, only 6 components explained around 90% of the variance. Still, for 2 components this value is still below 50%.

In [None]:
lda = LinearDiscriminantAnalysis(n_components=2).fit(X, y)

print('Overall explained variance: {:.3f}'.format(np.sum(lda.explained_variance_ratio_)))

X_lda = pca.transform(X)

Let's plot the dataset with the 2 new features.

In [None]:
plt.figure()
plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, cmap=cm.tab10, s=50)
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)  
plt.tight_layout()
plt.show()

Unsurprisingly, the plot looks very similar to the one for PCA.

---

## t-distributed Stochastic Neighbor Embedding (t-SNE)

Lastly, we use t-SNE to reduce the dimensionality of the dataset. Since we cannot quantify the quality of the reduction like with PCA and LDA using the overall explained variance, we can directly reduce the dimensionality to 2 features. As in the previous notebook, feel free to run the algorithm multiple times with the same parameters settings to see that the results will vary due to the indeterministic nature of t-SNE.

In [None]:
%%time

X_tsne = TSNE(n_components=2, perplexity=50).fit_transform(X)

The first thing to notice is that t-SNE runs noticeably slower compared to PCA and LDA since t-SNE is an iterative algorithm. For really large datasets, this can be a significant challenge.

Anyway, we can plot the dataset using a scatter plot.

In [None]:
plt.figure()
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=cm.tab10, s=25)
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)  
plt.tight_layout()
plt.show()

The first thing to notice is that t-SNE runs noticeably slower compared to PCA and LDA since t-SNE is an iterative algorithm. For really large datasets, this can be a significant challenge.

Anyway, we can plot the dataset using a scatter plot.

## Summary

Compared to the IRIS Dataset, PCA and LDA don't yield similar "clean" 2d scatter plots like t-SNE since the features do not have strong linear relationships with each other. However, note that visualization is often not the main reason for dimensionality reduction and therefore not a good benchmark to assess the effectiveness of a dimensionality reduction technique.

A main challenge with t-SNE is the potentially very long runtime in case of datasets with a (very) large number of samples and features. In practice, a common way to address this is to first apply PCA on the dataset to reduce the number of features "a bit" (e.g., by an order of magnitude) and then apply t-SNE on the results. For the Digits dataset here, since it's arguably a small dataset, applying first PCA and then t-SNE shows almost no difference when it comes to the runtimes. However, for the [MNIST Dataset](https://en.wikipedia.org/wiki/MNIST_database), the differences can be significant, reducing the runtime of t-SNE on the original from ~1h down to ~5min when first applying PCA (with enough components to yield a similar 2d visualization).