<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Principal component analysis
© ExploreAI Academy

In this exercise, we apply PCA to a dataset, evaluate the cumulative variance explained, and determine the appropriate number of components to retain.

## Learning objectives

By the end of this train, you should be able to:
* Apply PCA to reduce a dataset’s dimensionality.
* Evaluate the cumulative variance explained by each principal component.
* Determine the number of components needed to capture at least 85% of the variance. 

## Overview

The Digits dataset consists of 1,797 images of handwritten digits, each represented by a 64-dimensional feature vector. The dataset's high dimensionality can pose challenges when visualising and exploring it and could also lead to model complexity.

In this exercise, we apply PCA to the Digits dataset and evaluate its ability to reduce the dataset's dimensionality while retaining valuable information.

## Import libraries 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

## Load and prepare dataset

In [2]:
# Load the dataset
digits = load_digits()
X = digits.data
y = digits.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## Exercises

### Exercise 1

To reduce the dataset's dimensionality, let's transform the standardised dataset by applying PCA.

In [5]:
pca = PCA()
pca_dimension_reduced = pca.fit_transform(X_scaled)
pca_df = pd.DataFrame(data = pca_dimension_reduced)

### Exercise 2

To understand which components carry the most information, we can assess how much of the dataset's variance is captured by each principal component.

Compute and print the `Explained Variance Ratio` for each principal component formatted to four decimal places.

In [10]:
for index, ratio in enumerate(pca.explained_variance_ratio_):
    print(f"Component {index+1} has a variance ratio of {ratio:.4f}")

Component 1 has a variance ratio of 0.1203
Component 2 has a variance ratio of 0.0956
Component 3 has a variance ratio of 0.0844
Component 4 has a variance ratio of 0.0650
Component 5 has a variance ratio of 0.0486
Component 6 has a variance ratio of 0.0421
Component 7 has a variance ratio of 0.0394
Component 8 has a variance ratio of 0.0339
Component 9 has a variance ratio of 0.0300
Component 10 has a variance ratio of 0.0293
Component 11 has a variance ratio of 0.0278
Component 12 has a variance ratio of 0.0258
Component 13 has a variance ratio of 0.0228
Component 14 has a variance ratio of 0.0223
Component 15 has a variance ratio of 0.0217
Component 16 has a variance ratio of 0.0191
Component 17 has a variance ratio of 0.0178
Component 18 has a variance ratio of 0.0164
Component 19 has a variance ratio of 0.0160
Component 20 has a variance ratio of 0.0149
Component 21 has a variance ratio of 0.0135
Component 22 has a variance ratio of 0.0127
Component 23 has a variance ratio of 0.01

### Exercise 3

We can also evaluate how much total variance is captured as components are added incrementally. This can help us get a view of how many components are needed to capture a substantial proportion of the dataset's variance.

Determine the cumulative variance ratio by summing the explained variance ratios of each principal component.

In [12]:
for index, ratio in enumerate(np.cumsum(pca.explained_variance_ratio_)):
    print(f"Component {index+1} has a variance ratio of {ratio:.4f}")

Component 1 has a variance ratio of 0.1203
Component 2 has a variance ratio of 0.2159
Component 3 has a variance ratio of 0.3004
Component 4 has a variance ratio of 0.3654
Component 5 has a variance ratio of 0.4140
Component 6 has a variance ratio of 0.4561
Component 7 has a variance ratio of 0.4955
Component 8 has a variance ratio of 0.5294
Component 9 has a variance ratio of 0.5594
Component 10 has a variance ratio of 0.5887
Component 11 has a variance ratio of 0.6166
Component 12 has a variance ratio of 0.6423
Component 13 has a variance ratio of 0.6651
Component 14 has a variance ratio of 0.6874
Component 15 has a variance ratio of 0.7090
Component 16 has a variance ratio of 0.7281
Component 17 has a variance ratio of 0.7459
Component 18 has a variance ratio of 0.7623
Component 19 has a variance ratio of 0.7782
Component 20 has a variance ratio of 0.7931
Component 21 has a variance ratio of 0.8066
Component 22 has a variance ratio of 0.8193
Component 23 has a variance ratio of 0.83

### Exercise 4

Based on the results from **Exercise 3**, determine how many components are needed to capture at least 85% of the total variance.

Discuss the impact of this on subsequent analysis or modeling.

In [14]:
cum_explained_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
for index, ratio in enumerate(cum_explained_variance_ratio):
    if ratio >= 0.85:
        print(f"{index+1} components needed to capture at least 85% of total variance")
        break

25 components needed to capture at least 85% of total variance


## Solutions

### Exercise 1

In [None]:
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

### Exercise 2

In [None]:
# Get the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratio
for i, ev in enumerate(explained_variance_ratio):
    print(f"PC{i+1}: Explained Variance = {ev:.4f}")

The explained variance ratios show that the first few principal components (PC1, PC2, PC3) capture the most variance, with PC1 accounting for 12.03%. As we move to higher components, the explained variance decreases, indicating they carry less information about the dataset's overall variance.

### Exercise 3

In [None]:
# Calculate the cumulative variance ratio
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

# Print the cumulative variance ratio
for i, cv in enumerate(cumulative_variance_ratio):
    print(f"PC{i+1}: Cumulative Variance = {cv:.4f}")

The results show that each additional principal component captures more variance. For example, PC1 captures 12.03%, and by PC10, 58.87% is captured. All 64 components capture 100% variance. This helps identify how many components are needed to capture a significant portion of the dataset's variance.

### Exercise 4

The cumulative variance ratio shows that to reach at least 85% cumulative variance, 25 components are needed.

We observe that by retaining the first 25 components, we can capture 85.13% of the total variance in the dataset, which significantly reduces the dataset's dimensionality while retaining most of its information.

Using 25 components instead of the original 64 simplifies any downstream models by reducing their feature space, potentially improving model performance and interpretability.

The reduced number of components also makes visualising the data easier, which can provide meaningful insights into class separations or clustering.

Therefore, the ability to capture over 85% of the variance with 25 components makes PCA a viable dimensionality reduction technique for this dataset, preserving most information while simplifying further analyses.

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>