# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Intro to Dimensionality Reduction
Week 7 | Lesson 2.3

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Follow the logical workflow behind dimensionality reduction
- Describe the basic intuition of Principal Component Analysis
- Calculate eigenvectors and eigenvalues for use in Principal Component Analysis


### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- Have a working understand of scikit learn and numpy
- Be able to create functions from scratch in python
- Have a basic understanding of linear algebra concepts such as matrices

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 10 min  | [Introduction](#introduction)   | Introduction to Dimensionality Reduction |
| 15 min  | [Demo](#demo)  | Applications of Dimensionality Reduction: A Long-Form Approach  |
| 25 min  | [Guided Practice](#guided-practice<a name="opening"></a>)  | Conducting Dimensionality Analysis  |
| 25 min  | [Independent Practice](#ind-practice)  | Dimensionality Reduction on the Iris Dataset  |
| 5 min  | [Conclusion](#conclusion)  | Conclusion  |

---


<a name="introduction"></a>
## Introduction: What is Dimensionality Reduction? (10 mins)

Dimensionality reduction reduces the number of random variables that you are considering for analysis until you are left with the most important variables.

Dimensionality reduction is not an end goal in itself, but a tool to form a dataset with more parsimonious features for further visualization and/or modelling.

> Check: where have we already done dimensionality reduction? What are the potential benefits?

Imagine we have a linear graph, with one variable on the x axis and another on the y axis. Fitting a line models most of the information in the data (but leaves some noise). We can reduce the dimensions until the 45 degree line is completely horizontal - both of our measurements are now on the same plane - they are *one-dimensional*.

![graph1](./assets/images/graph1.jpg)

![graph2](./assets/images/graph2.jpg)

<a name="demo"></a>
## Demo: Applications of Dimensionality Reduction (20 mins)

Our first priority is to get comfortable with the initial manual workflow of PCA. (We'll expand on the math, applications and intuition in a following lesson.)

- Isolate the feature data
- Center and scale the feature data
- Calculate their covariance matrix
- Calculate the eigenvalues and eigenvectors
- Choose the best n principal components
- Calculate newly extracted feature data



```python
x = data.ix[selection].values
y = data.ix[selection].values
x_standard = StandardScaler().fit_transform(x)

```

A **covariance matrix** of n-features is just an n x n matrix, where the elements are the [covariances](https://en.wikipedia.org/wiki/Covariance) for each pair of _n_ features.

```
cov_mat = np.cov(x_standard.T)
```

(We're **transposing** the matrix only because np.cov expects features to be on the rows and columns to hold observations.)

Now, we decompose our matrix by calling the numpy linear algebra function ```linalg.eig()```. to calculate the [**eigenvectors** and **eigenvalues**](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors).

```
eigenValues, eigenVectors = np.linalg.eig(cov_mat)
```

The eigenvectors of a linear transformation are vectors that do not change direction under that transformation, but only have their magnitude scaled by some scalar value (the eigenvalue). In this context, the larger an eigenvalue, the more variance (information) in our data its corresponding eigenvector explains.

Once we have our eigenvalues, we can work on transforming our data onto another dimensional space. Remember the visual representation from above - this is exactly what we are doing in this step. Don't worry about the mathematics of this for now, we'll touch on it later!

Notice when calling ```linalg.eig``` from numpy, the input is limited to a matrix and the output requires two variables - the eigenvalues and eigenvectors.

<a name="guided-practice"></a>
## Guided Practice: Conducting Dimensionality Analysis (20 mins)

Now that you know the procedure, let's run through an implementation of dimensionality reduction with a real dataset.

We're going to be revisiting the [wine](./assets/datasets/wine_v.csv) dataset that lists the attributes of various different wine varieties.

In [1]:
import pandas as pd
import numpy as np
import os
from matplotlib import pyplot as plt
import numpy as np
import math
from sklearn.preprocessing import StandardScaler



In [2]:
wine = pd.read_csv('./assets/datasets/wine_v.csv')
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Varietal
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Cabernet
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,Cabernet
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,Cabernet
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,Cabernet
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,Cabernet


In [3]:
""" Isolate the feature data."""
x = wine.ix[:,0:11].values
y = wine.ix[:,12].values

""" Center and scale the feature data."""
x_standard = StandardScaler().fit_transform(x)


In [24]:
""" Calculate their covariance matrix. """
cov_mat = np.cov(x_standard.T)

"""Calculate the eigenvalues and eigenvectors."""
eigenValues, eigenVectors = np.linalg.eig(cov_mat)

In [25]:
""" Choose the best n principal components.  Calculate newly extracted feature data."""

eig_pairs = [(np.abs(eigenValues[i]), eigenVectors[:,i]) for i in range(len(eigenValues))]
eig_pairs.sort()
eig_pairs.reverse()
for i in eig_pairs[:2]:
    print(i[0],i[1])

#higher eigenvalues  mean that there is greater similarity to the data    
    
    

(3.1010718226758942, array([ 0.48931422, -0.23858436,  0.46363166,  0.14610715,  0.21224658,
       -0.03615752,  0.02357485,  0.39535301, -0.43851962,  0.24292133,
       -0.11323207]))
(1.9271148896490469, array([-0.11050274,  0.27493048, -0.15179136,  0.27208024,  0.14805156,
        0.51356681,  0.56948696,  0.23357549,  0.00671079, -0.03755392,
       -0.38618096]))


In [41]:
""" Compare with sklearn's PCA method."""
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(x_standard)
print("The information (explained variance) contained in each principal component: ", pca.explained_variance_ratio_)
print(pca.components_)

('The information (explained variance) contained in each principal component: ', array([ 0.28173931,  0.1750827 ]))
[[ 0.48931422 -0.23858436  0.46363166  0.14610715  0.21224658 -0.03615752
   0.02357485  0.39535301 -0.43851962  0.24292133 -0.11323207]
 [-0.11050274  0.27493048 -0.15179136  0.27208024  0.14805156  0.51356681
   0.56948696  0.23357549  0.00671079 -0.03755392 -0.38618096]]


### Now what?

We can use this to transform our data onto a lower dimension space.

In [30]:
W = np.hstack((eig_pairs[0][1].reshape(11,1), eig_pairs[1][1].reshape(11,1))) # Our transformation matrix
X_reduced = x_standard.dot(W)
X_reduced

array([[-1.61952988,  0.45095009],
       [-0.79916993,  1.85655306],
       [-0.74847909,  0.88203886],
       ..., 
       [-1.45612897,  0.31174559],
       [-2.27051793,  0.97979111],
       [-0.42697475, -0.53669021]])

In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x_standard, y, test_size=0.33, random_state=1)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print clf.score(X_test, y_test), "mean accuracy, using {0} dimensions.".format(x_standard.shape[1])

X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.33, random_state=1)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
print clf.score(X_test, y_test), "mean accuracy, using {0} principal component dimensions.".format(X_reduced.shape[1])

0.594696969697 mean accuracy, using 11 dimensions.
0.496212121212 mean accuracy, using 2 principal component dimensions.


<a name="ind-practice"></a>
## Independent Practice: Dimensionality Reduction on the Iris dataset (20 minutes)

Now that we've gone over the long-form approach to dimensionality reduction and worked through an example, let's put your skills to the test! We're going to be working with the classic [iris dataset](./assets/datasets/iris.csv). We want to decompose the data to the point of finding the eigenvectors and eigenvalues. Grab the [starter code](./code/starter-code/w7d2-dimensionality-reduction-iris-starter-code.ipynb) to begin!

> Note: [solution code](./code/solution-code/w7d2-dimensionality-reduction-iris-solution-code.ipynb).

<a name="conclusion"></a>
## Conclusion (5 mins)
- Recap and recall the process steps in dimensionality reduction
    -  Covariance Matrix: First, we create a covariance matrix to decompose so that we may find our eigenvalues / eigenvectors. 
    -  Eigenvectors & Eigenvalues: We decompose the covariance matrix to derive our eigenvectors and eigenvalues, and select the top  combined eigenpairs to become our principal components.
    -  Lastly, we project the eigenpairs onto a new feature subspace.

***



### ADDITIONAL RESOURCES

- [Unsupervised Dimensionality Reduction in sklearn](http://scikit-learn.org/stable/modules/unsupervised_reduction.html)
- [In depth overview of Dimensionality Reduction and PCA from Stanford University](http://ufldl.stanford.edu/wiki/index.php/PCA)