# Dimensionality Reduction

### What is dimensionality reduction?
In data analysis, *dimensionality* can be roughly understood as the number of features in a dataset.
In cases where the dimensionality is very high, 
certain supervised learning algorithms will underperform or fail altogether unless the supervised process 
is preceded by a  *reduction* step.
Here we'll be outlining a few of the more common dimensionality reduction algorithms in `scikit-learn`.

### References

1. Scikit-learn documentation
    * [Principal component analysis (PCA)](http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca)
    * [LLE/Isomap/Spectral Embedding](http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py)

## Principal Component Analysis (PCA) 
PCA is based on some fairly complicated maths, but can be simply understood as *transforming* a dataset into a new,
less noisy one of reduced dimension. 
Specifically, the dimensions that are removed are those that don't explain the variance in the remaining variables, leaving behind only those that best explain said variance. This is ideal.

Because dimensionality reduction in-itself isn't all that useful, except perhaps for visualizing data, 
this example will employ *pipelining* - chaining together a PCA and a regression step for more accurate prediction.

*This example is adapted from the scikit-learn documentation for PCA.*

In [None]:
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model, decomposition, datasets # decomposition roughly equals reduction
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV # CV tools to set dimensionality of the dataset

# Construct model
logistic = linear_model.LogisticRegression()

# Prepare PCA pipeline
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

# Load dataset (we're using the toy dataset 'digits')
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

# Plot original data

# Fit PCA
pca.fit(X_digits)

# Plot PCA spectrum
plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')

# Prepare for prediction step
n_components = [20, 40, 64]
Cs = np.logspace(-4, 4, 3) #generates 3 numbers that are evenly spaced on a log scale from -4 to 4

# Parameters of pipelines can be set using ‘__’ separated parameter names:

# Cross-validate
estimator = GridSearchCV(pipe,
                         dict(pca__n_components=n_components,
                              logistic__C=Cs))
estimator.fit(X_digits, y_digits)

plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
plt.legend(prop=dict(size=12))
plt.show()

## Very large datasets: incremental PCA
While standard `PCA` is very useful for smaller datasets, 
there are some problems with the default implementation.
`PCA` as it exists in `scikit-learn` is based on batch processing, 
which requires the entire dataset being processed to fit within main memory.
This becomes problematic as the size of the dataset increases. 

As such, `IncrementalPCA` was developed.
This PCA implementation uses 'minibatch' processing,
splitting a dataset into moderately sized chunks and processing them sequentially.

Aside from the different use case, the way to use iPCA is the exact same as PCA.
Just know that if your dataset is particularly large, 
you should look at the iPCA documentation and most likely use `IncrementalPCA()` rather 
than `decomposition.PCA()` for your reduction algorithm.

# When PCA fails...
If `PCA` isn't working out, you have a few options. 
Smaller datasets (less than 10K samples) can safely employ `Isomap` or `SpectralEmbedding` 
(or the equivalent function `spectral_embedding`) for reduction - failing that, 
`LocallyLinearEmbedding` or `locally_linear_embedding`.
Larger datasets are generally relegated to kernel approximation, which will be covered elsewhere.

## Spectral Embedding
You will not be required to understand the math behind this.
Running the code below (from the `scikit` [documentation](http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py)) 
produces a random S-shaped curve and plots the results of several different reduction algorithms.

In [None]:
#Author: Jake Vanderplas -- <vanderplas@astro.washington.edu>

print(__doc__)

from time import time

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import NullFormatter

from sklearn import manifold, datasets

# Next line to silence pyflakes. This import is needed.
Axes3D

n_points = 1000
X, color = datasets.samples_generator.make_s_curve(n_points, random_state=0)
n_neighbors = 10
n_components = 2

fig = plt.figure(figsize=(15, 8))
plt.suptitle("Manifold Learning with %i points, %i neighbors"
             % (1000, n_neighbors), fontsize=14)

try:
    # compatibility matplotlib < 1.0
    ax = fig.add_subplot(251, projection='3d')
    ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral)
    ax.view_init(4, -72)
except:
    ax = fig.add_subplot(251, projection='3d')
    plt.scatter(X[:, 0], X[:, 2], c=color, cmap=plt.cm.Spectral)

methods = ['standard', 'ltsa', 'hessian', 'modified']
labels = ['LLE', 'LTSA', 'Hessian LLE', 'Modified LLE']

for i, method in enumerate(methods):
    t0 = time()
    Y = manifold.LocallyLinearEmbedding(n_neighbors, n_components,
                                        eigen_solver='auto',
                                        method=method).fit_transform(X)
    t1 = time()
    print("%s: %.2g sec" % (methods[i], t1 - t0))

    ax = fig.add_subplot(252 + i)
    plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
    plt.title("%s (%.2g sec)" % (labels[i], t1 - t0))
    ax.xaxis.set_major_formatter(NullFormatter())
    ax.yaxis.set_major_formatter(NullFormatter())
    plt.axis('tight')

t0 = time()
Y = manifold.Isomap(n_neighbors, n_components).fit_transform(X)
t1 = time()
print("Isomap: %.2g sec" % (t1 - t0))
ax = fig.add_subplot(257)
plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("Isomap (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')


t0 = time()
mds = manifold.MDS(n_components, max_iter=100, n_init=1)
Y = mds.fit_transform(X)
t1 = time()
print("MDS: %.2g sec" % (t1 - t0))
ax = fig.add_subplot(258)
plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("MDS (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')


t0 = time()
se = manifold.SpectralEmbedding(n_components=n_components,
                                n_neighbors=n_neighbors)
Y = se.fit_transform(X)
t1 = time()
print("SpectralEmbedding: %.2g sec" % (t1 - t0))
ax = fig.add_subplot(259)
plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("SpectralEmbedding (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')

t0 = time()
tsne = manifold.TSNE(n_components=n_components, init='pca', random_state=0)
Y = tsne.fit_transform(X)
t1 = time()
print("t-SNE: %.2g sec" % (t1 - t0))
ax = fig.add_subplot(2, 5, 10)
plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("t-SNE (%.2g sec)" % (t1 - t0))
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
plt.axis('tight')

plt.show()

### Discussion
The images above represent two-dimensional projections or mappings of the three-dimensional S curve to the left. 
Each of these is designed to preserve local distances and variance while eliminating dimensional complexity. `SpectralEmbedding`, `Isomap`, and the `LLE` variants are the fastest, 
so these tend to be preferred when applicable. 
Note that each method (aside from the LLE variants) produces a different reduced image, 
but every product has lost at least one dimension.