<img src="materials/images/introduction-to-statistics-II-cover.png"/>


# 👋 Welcome, before you start
<br>

### 📚 Module overview

We will go through eleven lessons with you:
    
- [**Lesson 1: Z-score**](Lesson_1_Z-score.ipynb)

- [**Lesson 2: P-value**](Lesson_2_P-value.ipynb)

- [**Lesson 3: Lesson 3: Welchs T-test**](Lesson_3_Welchs_T-test.ipynb)

- [**Lesson 4: Log2 Fold Change**](Lesson_4_Log2_Fold_Change.ipynb)

- [**Lesson 5: Pearson Correlation**](Lesson_5_Pearson_Correlation.ipynb)

- [**Lesson 6: Spearman Correlation**](Lesson_6_Spearman_Correlation.ipynb)

- [**Lesson 7: False Discovery Rate**](Lesson_7_False_Discovery_Rate.ipynb)

- [**Lesson 8: Benjamini Hochberg**](Lesson_8_Benjamini_Hochberg.ipynb)

- <font color=#E98300>**Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis**</font>    `📍You are here.`

- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>



<div class="alert alert-block alert-info">
<h3>⌨️ Keyboard shortcut</h3>

These common shortcut could save your time going through this notebook:
- Run the current cell: **`Enter + Shift`**.
- Add a cell above the current cell: Press **`A`**.
- Add a cell below the current cell: Press **`B`**.
- Change a code cell to markdown cell: Select the cell, and then press **`M`**.
- Delete a cell: Press **`D`** twice.

Need more help with keyboard shortcut? Press **`H`** to look it up.
</div>

---

# Lesson 9: Dimensionality Reduction Methods: Principal Component Analysis

<mark>**Principal Component Analysis (PCA)**</mark> is a method that is often used to reduce the **dimensionality (i.e. number of variables)** of large data sets. It works by transforming a large set of variables in to a smaller set that maintains most of the information of the larger set.

Dimensionality reduction becomes useful often for the purposes of visualizing a dataset, or for making a dataset easier to work with, and faster for machine learning algorithms to process.

In sum, the idea of PCA is to **reduce the number of variables** in a dataset, while **preserving as much information** as possible.

`🕒 This module should take about 25 minutes to complete.`

`✍️ This notebook is written using Python.`

---

## Principal Components

**Principal components** are newly constructed variables that are efficient combinations of the original variables. The principal components represent the directions of the data that explain a maximal amount of variance, i.e, directions that capture the most information. 

For example, in the visualization below, you can see the data as displayed along the original x-axis and y-axis. However, the green line represents a nex axis where a large amount of variability exists among the data points. Additionally, perpendicular to that line, there also exists a large amount of variance among the data points. Those two axes (directions) could be two of the principal components since they contain a lot of information about the original variables. These new axes provide a better view of the differences among the variables. The larger the dispersion of the data points along a line, the larger the variance it carries and the more information it has.


<img src="materials/images/images_dimensionality_reduction_principal_component_analysis/variability.png"/>

The newly constructed variables (i.e., principal components) below ensure that the information contained within one component does not overlap with the others. 
- The first principal component contains most of the information from the original variables. 
- The maximum remaining information is contained in the second component and so on for n (the number of original variables) components. 

Organizing information as principal components allows you to **reduce the dimensionality (i.e. number of variables)** of your dataset without losing much information **by discarding the components that contain minimal information**.

<img src="materials/images/images_dimensionality_reduction_principal_component_analysis/principal_components_chart.png"/>

<div class="alert alert-block alert-success">
    <b>Note:</b> There are as many principal components as there are variables in the data set. The first principal component accounts for the largest possible variance in the dataset. The second principal component accounts for the next highest variance and is uncorrelated with (i.e., perpendicular) the first principal component. This continues until the total number of principal components is equal to the original number of variables.
</div>

---

# **Principal component analysis involves 5 steps:**
    
1. Standardize the range of values for each variable
2. Compute the covariance matrix to identify correlations among the variables
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
4. Decide how many principal components to keep
5. Reorient the data along the principal components' axes

## Step 1: Standardization

$$\large\ standardization = \frac{x-mean}{stdev}$$

We first transform the variables to a comparable range of values by relating each data point to its respective mean, in terms of standard deviations. The aim of this step is to standardize the range of the variables so that each one contributes equally to the analysis. 

## Step 2: Covariance matrix

The purpose of this step is to see if there are any relationships among the variables in order to identify any redundant information. To identify these correlations, we compute the covariance matrix.

The covariance matrix is a n × n symmetric matrix, where n is the number of variables. It contains the covariances associated with all possible pairs of the n variables.

<img src="materials/images/images_dimensionality_reduction_principal_component_analysis/cov_matrix1.png"/>

It’s the signs of the covariances that tell us about the correlations between the variables. If the sign is positive, the two variables increase or decrease together (positively correlated). If the sign is negative, one variable increases when the other decreases (negatively correlated). There may also be times when there is little to no relationship between variables.

<img src="materials/images/images_dimensionality_reduction_principal_component_analysis/cov_matrix2.png"/>

## Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix. 

Eigenvectors and eigenvalues are calculated from the covariance matrix, and always come in pairs such that every eigenvector has an eigenvalue. There are as many eigenvector/eigenvalue pairs as there are variables in the dataset. It is the eigenvectors and eigenvalues that determine the principal components of the data. 

The **eigenvectors** of the **covariance matrix** are actually the directions of the axes, where there is the most variance (most information); and it is these that we call <mark>**principal components**</mark>. 

The **eigenvalues** indicate the amount of variance carried in each eigenvector (i.e., principal component). By ranking the eigenvectors in order of their eigenvalues, highest to lowest, you get the <mark>principal components in order of importance</mark>. 

To determine the percentage of variance (information) accounted for by each principal component, we divide the eigenvalue of each component by the sum of eigenvalues.

## Step 4: Decide how many principal components to keep.

Finally, you will decide whether to keep all of the principal components or discard those of lesser importance. 

If we decide to keep only the most important components (carrying the most information), then we are performing **dimensionality reduction**. Our final dataset will have only as many dimensions as the number of principal components we keep.

<div class="alert alert-block alert-warning">
    <b>Alert:</b> The principal components are less interpretable than the original variables, and don’t have any real meaning since they are constructed as combinations of the original variables.
</div>

## Step 5: Re-orient the data along the principal components axes. 
Finally, we re-orient the data from the original axes to the ones represented by the principal components.

---

### PCA effectively provides:
- A measure of how each variable is associated with the others - **Covariance matrix**.
- Insight into the directions in which our data are dispersed (where most of the information resides)- **Eigenvectors**.
- The relative importance of these different directions - **Eigenvalues**.
- Dimensionality reduction of our data set by enabling us to drop the eigenvectors (i.e., principal components) that are relatively unimportant.

---

# Example dimensionality reduction using PCA

<img src="materials/images/images_dimensionality_reduction_principal_component_analysis/cell_morphology.png"/>

Below, we have a dataset of breast cancer cell morphology. We would like to use the variables to predict whether a tumor is benign or malignant.

### ✅ `Run` each of the cells below:

### Preview the dataset

In [None]:
import pandas as pd
df = pd.read_csv("data/data_dimensionality_reduction_principal_component_analysis/breast-cancer.csv")
df.head()

There are 10 independent variables used to predict the dependent variable named "diagnosis". We will look to reduce the number of variables (dimensions) while maintaining predictive performance.

### Standardize the data set

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

# Split dataset into test/train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.3)

# Standardize features
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)

# Dimensionality Reduction
We will perform principal component analysis to extract the 10 components with the most information.

### Principal Component Analysis (PCA)

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)

### Below is the explained variance ratios of the 10 principal components:

In [None]:
ratios = pca.explained_variance_ratio_.round(3)
pd.DataFrame(data = [ratios], 
             columns = ["PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9", "PC10"], 
             index=["Variance Explained:"])

We can see that the first component is carrying about 56% of the information, more than any other components, justifying its position. 

The second component possesses around 24% of the information. The third component carries almost 9%; and the fourth component has just over 5% of the information. 

The remaining six components each possesses less than 5% of the information, and collectively carry only 6% of the information. 

We will look at the effects of reducing the dimensionality of the data set by removing these less important components.

---

# Machine Learning

## Logistic Regression
#### Perform logistic regression to make predictions using all 10 of the original variables.

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X_train_std, y_train)
log_reg.score(X_test_std, y_test)

#### Perform logistic regression to make predictions using all 10 of the principal components.

In [None]:
# Use all 10 principal components
log_reg.fit(X_train_pca, y_train)
log_reg.score(X_test_pca, y_test)

We can see that the performance is identical when using all 10 of the original variables and all 10 of the principal components. They both achieved an accuracy of around <mark>**93%**</mark>.

---

## Reduce the number of dimensions

Let's reduce the number of dimensions to the top 5 principal components since they contain about 98% of the information.
#### Perform logistic regression to make predictions using only the 5 most important principal components.

In [None]:
pca = PCA(n_components=5)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
ratios = pca.explained_variance_ratio_.round(3)

PC = pd.DataFrame(data = [ratios], 
             columns = ["PC1", "PC2", "PC3", "PC4", "PC5"], 
             index=["Variance Explained:"])
PC["Total Variance"] = PC.sum(axis=1)
PC

In [None]:
# Use 5 principal components
log_reg.fit(X_train_pca, y_train)
log_reg.score(X_test_pca, y_test)

Note that even though we cut the number of dimensions in half (to 5), the performance remained almost identical at <mark>**92.3%**</mark>. This is because nearly all of the information is contained within those first 5 components.

---

## Reduce the number of dimensions again

Let's reduce the number of dimensions one more time. Let's try the first 4 principal components which contain about 94% of the information.

#### Perform logistic regression to make predictions using only the 4 most important principal components.

In [None]:
pca = PCA(n_components=4)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
ratios = pca.explained_variance_ratio_.round(3)

PC = pd.DataFrame(data = [ratios], 
             columns = ["PC1", "PC2", "PC3", "PC4"], 
             index=["Variance Explained:"])
PC["Total Variance"] = PC.sum(axis=1)
PC

In [None]:
# Use 4 principal components
log_reg.fit(X_train_pca, y_train)
log_reg.score(X_test_pca, y_test)

## Accuracy using principal components and dimensionality reduction

- 10 original variables: **92.9%**
- 10 principal components: **92.9%**
- 5 principal components **92.3%**
- 4 principal component **91.2%**

Reducing the dimensionality of the data set to 4 (from 10) dropped the performance only slightly to <mark>**91.2%**</mark> accuracy. This may be an excellent tradeoff--getting nearly identical performance while reducing the number of dimensions (variables) by 60%. **This is the advantage of using principal component analysis**. 

---

# 🌟 Ready for the next one?
<br>


- [**Lesson 10: Dimensionality Reduction Methods: t-SNE**](Lesson_10_Dimensionality_Reduction_Methods_t-SNE.ipynb)

- [**Lesson 11: UMAP**](Lesson_11_UMAP.ipynb)
</br>

---

# Contributions & acknowledgment

Thanks Antony Ross for contributing the content for this notebook.

---

Copyright (c) 2022 Stanford Data Ocean (SDO)

All rights reserved.