# PCA

## What is PCA?
Here you saw that PCA (or principal component analysis) is the first of the techniques you will see aimed at dimensionality reduction. This technique is about taking your full dataset and reducing it to only the parts that hold the most information.


## How Well Will You Understand PCA After This Lesson?
The goal is for everyone to leave this lesson with an understanding of:

* How PCA is used in the world.
* How to perform PCA in python.
* A conceptual understanding of how the algorithm works.
* How to interpret the results of PCA.
* If you want to dive deeper into the mathematics, there will be additional links provided, but it will not be a main focus of this lesson.

## PCA Lesson Topics

There is a lot to cover with Principal Component Analysis (or PCA). However, you will gain a solid understanding of PCA by the end of this lesson, by applying this technique in a couple of scenarios using scikit-learn, and practicing interpreting the results.

We will also cover conceptually how the algorithm works, and I will provide links to explore what is happening mathematically in case you want to dive in deeper! Here is an outline of what you can expect in this lesson.

### 1. Dimensionality Reduction through Feature Selection and Feature Extraction
With large datasets we often suffer with what is known as the "curse of dimensionality," and need to reduce the number of features to effectively develop a model. Feature Selection and Feature Extraction are two general approaches for reducing dimensionality.

### 2. Feature Extraction using PCA
Principal Component Analysis is a common method for extracting new "latent features" from our dataset, based on existing features.

### 3. Fitting PCA
In this part of the lesson, you will use PCA in scikit-learn to reduce the dimensionality of images of handwritten digits.

### 4. Interpreting Results
Once you are able to use PCA on a dataset, it is essential that you know how to interpret the results you get back. There are two main parts to interpreting your results - the principal components themselves and the variability of the original data captured by those components. You will get familiar with both.


## Latent features

Latent features are features that aren't explicitly in your dataset.
![pca_1.png](pics/pca_1.png)
In this example, you saw that the following features are all related to the latent feature **home size**:

* lot size
* number of rooms
* floor plan size
* size of garage
* number of bedrooms
* number of bathrooms

Similarly, the following features could be reduced to a single latent feature of **home neighborhood**:

* local crime rate
* number of schools in five miles
* property tax rate
* local median income
* average air quality index
* distance to highway

So even if our original dataset has the 12 features listed, we might be able to reduce this to only 2 latent features relating to the home size and home neighborhood.

![pca_1.png](pics/pca_2.png)


## Reducing the Number of Features - Dimensionality Reduction

Our real estate example is great to help develop an understanding of feature reduction and latent features. But we have a smallish number of features in this example, so it's not clear why it's so necessary to reduce the number of features. And in this case it wouldn't actually be required - we could handle all six original features to create a model.

But the "curse of dimensionality" becomes more clear when we're grappling with large real-world datasets that might involve hundreds or thousands of features, and to effectively develop a model really requires us to reduce our number of dimensions.

### Two Approaches : Feature Selection and Feature Extraction
#### Feature Selection
Feature Selection involves finding a subset of the original features of your data that you determine are most relevant and useful. In the example image below, taken from the video, notice that "floor plan size" and "local crime rate" are features that we have selected as a subset of the original data.

![pca_1.png](pics/pca_3.png)

* **Filter methods** - Filtering approaches use a ranking or sorting algorithm to filter out those features that have less usefulness. Filter methods are based on discerning some inherent correlations among the feature data in unsupervised learning, or on correlations with the output variable in supervised settings. Filter methods are usually applied as a preprocessing step. Common tools for determining correlations in filter methods include: *Pearson's Correlation, Linear Discriminant Analysis (LDA), and Analysis of Variance (ANOVA)*.
* **Wrapper methods** - Wrapper approaches generally select features by directly testing their impact on the performance of a model. The idea is to "wrap" this procedure around your algorithm, repeatedly calling the algorithm using different subsets of features, and measuring the performance of each model. *Cross-validation* is used across these multiple tests. The features that produce the best models are selected. Clearly this is a computationally expensive approach for finding the best performing subset of features, since they have to make a number of calls to the learning algorithm. Common examples of wrapper methods are: *Forward Search, Backward Search, and Recursive Feature Elimination*
.
Scikit-learn has a feature selection module that offers a variety of methods to improve model accuracy scores or to boost their performance on very high-dimensional datasets.

#### Feature Extraction
Feature Extraction involves extracting, or constructing, new features called latent features. In the example image below, taken from the video, "Size Feature" and "Neighborhood Quality Feature" are new latent features, extracted from the original input data.

![pca_1.png](pics/pca_4.png)

##### Methods of Feature Extraction
Constructing latent features is exactly the goal of **Principal Component Analysis** (PCA), which we'll explore throughout the rest of this lesson.

Other methods for accomplishing Feature Extraction include **Independent Component Analysis** (ICA) and **Random Projection**, which we will study in the following lesson.

**Further Exploration**
If you're interested in deeper study of these topics, here are a couple of helpful blog posts and a research paper:

* [https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/)
* [https://elitedatascience.com/dimensionality-reduction-algorithms](https://elitedatascience.com/dimensionality-reduction-algorithms)
* [http://www.ai.mit.edu/projects/jmlr/papers/volume3/guyon03a/source/old/guyon03a.pdf](http://www.ai.mit.edu/projects/jmlr/papers/volume3/guyon03a/source/old/guyon03a.pdf)


## Principal Components

![pca_1.png](pics/pca_5.png)

![pca_1.png](pics/pca_6.png)

An advantage of **Feature Extraction** over **Feature Selection** is that the latent features can be constructed to incorporate data from multiple features, and thus retain more information present in the various original inputs, than just losing that information by dropping many original inputs.

Principal components are **linear combinations** of the original features in a dataset that aim to retain the most information in the original data.

You can think of a principal component in the same way that you think about a latent feature.

The general approach to this problem of high-dimensional datasets is to search for a projection of the data onto a smaller number of features which preserves the information as much as possible.

## PCA Properties
There are two main properties of principal components:

They retain the most amount of information in the dataset. You can see that retaining the most information in the dataset meant finding a line that reduced the distances of the points to the component across all the points (same as in regression!).

![pca_1.png](pics/pca_7.png)

The created components are orthogonal to one another. So far we have been mostly focused on what the first component of a dataset would look like. However, when there are many components, the additional components will all be orthogonal to one another. Depending on how the components are used, there are benefits to having orthogonal components. In regression, we often would like independent features, so using the components in regression now guarantees this.

![pca_1.png](pics/pca_8.png)

### Quiz

![pca_1.png](pics/pca_q_1.png)
![pca_1.png](pics/pca_q_2.png)
![pca_1.png](pics/pca_q_3.png)

### What Are Eigenvalues and Eigenvectors?
The mathematics of PCA isn't really necessary for PCA to be useful. However, it can be useful to fully understand the mathematics of a technique to understand how it might be extended to new cases. For this reason, the page has a few additional references which go more into the mathematics of PCA.

A simple introduction of what PCA is aimed to accomplish is provided here in a simple example.

A nice visual, and mathematical, illustration of PCA is provided in this video by 3 blue 1 brown.

https://www.youtube.com/watch?v=PFDu9oVAE-g

If you dive into the literature surrounding PCA, you will without a doubt run into the language of eigenvalues and eigenvectors. These are just the math-y words for things you have already encountered in this lesson.

An eigenvalue is the same as the amount of variability captured by a principal component, and an eigenvector is the principal component itself. To see more on these ideas, take a look at the following three links below:

[A great introduction into the mathematics of principal components analysis.](http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf)

[An example of using PCA in python by one of my favorite data scientists.](https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html)

[An example of PCA from the scikit learn documentation.](http://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py)


## When to use PCA? 

1. Every time you want to reduce the dimensionality of your data 
3. Find latent features that might capture a number of other features into a single factor.

![pca](pics/pca_11.png)

![pca](pics/pca_12.png)


## Recap

### 1. Two Methods for Dimensionality Reduction

You learned that Feature Selection and Feature Extraction are two general approaches for reducing the number of features in your data. Feature Selection processes result in a subset of the most significant original features in the data, while Feature Extraction methods like PCA construct new latent features that well represent the original data.

### 2. Dimensionality Reduction and Principal Components
You learned that Principal Component Analysis (PCA) is a technique that is used to reduce the dimensionality of your dataset. The reduced features are called principal components, or latent features. These principal components are simply a linear combination of the original features in your dataset.

You learned that these components have two major properties:

1. They aim to capture the most amount of variability in the original dataset.
1. They are orthogonal to (independent of) one another.

### 3. Fitting PCA
Once you got the gist of what PCA was doing, we used it on handwritten digits within scikit-learn.

We did this all within a function called `do_pca`, which returned the PCA model, as well as the reduced feature matrix. You simply passed in the number of features you wanted back, as well as the original dataset.

### 4. Interpreting Results
You then saw there are two major parts to interpreting the PCA results:

1. The variance explained by each component. You were able to visualize this with scree plots to understand how many components you might keep based on how much information was being retained.
2. The principal components themselves, which gave us an idea of which original features were most related to why a component was able to explain certain aspects about the original datasets.

### 5. Mini-project
Finally, you applied PCA to a dataset on vehicle information. You gained valuable experience using scikit-learn, as well as interpreting the results of PCA.

With mastery of these skills, you are now ready to use PCA for any task in which you feel it may be useful. If you have a large amount of data, and are feeling afflicted by the curse of dimensionality, you want to reduce your data to a smaller number of latent features, and you know just the way to do it!

### 6. Do you think you understand PCA well enough yet to explain it in a way that would make sense to your grandmother?
Here is an interesting StackExchange post that does just that, and with animated graphics! https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues