<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Principal component analysis
© ExploreAI Academy

In this exercise, we apply PCA to a dataset, evaluate the cumulative variance explained, and determine the appropriate number of components to retain.

## Learning objectives

By the end of this train, you should be able to:
* Apply PCA to reduce a dataset’s dimensionality.
* Evaluate the cumulative variance explained by each principal component.
* Determine the number of components needed to capture at least 85% of the variance. 

## Overview

The Digits dataset consists of 1,797 images of handwritten digits, each represented by a 64-dimensional feature vector. The dataset's high dimensionality can pose challenges when visualising and exploring it and could also lead to model complexity.

In this exercise, we apply PCA to the Digits dataset and evaluate its ability to reduce the dataset's dimensionality while retaining valuable information.

## Import libraries 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

## Load and prepare dataset

In [None]:
# Load the dataset
digits = load_digits()
X = digits.data
y = digits.target

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## Exercises

### Exercise 1

To reduce the dataset's dimensionality, let's transform the standardised dataset by applying PCA.

In [None]:
# Your solution here...

### Exercise 2

To understand which components carry the most information, we can assess how much of the dataset's variance is captured by each principal component.

Compute and print the `Explained Variance Ratio` for each principal component formatted to four decimal places.

In [None]:
# Your solution here...

### Exercise 3

We can also evaluate how much total variance is captured as components are added incrementally. This can help us get a view of how many components are needed to capture a substantial proportion of the dataset's variance.

Determine the cumulative variance ratio by summing the explained variance ratios of each principal component.

In [None]:
# Your solution here...

### Exercise 4

Based on the results from **Exercise 3**, determine how many components are needed to capture at least 85% of the total variance.

Discuss the impact of this on subsequent analysis or modeling.

## Solutions

### Exercise 1

In [None]:
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

### Exercise 2

In [None]:
# Get the explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratio
for i, ev in enumerate(explained_variance_ratio):
    print(f"PC{i+1}: Explained Variance = {ev:.4f}")

The explained variance ratios show that the first few principal components (PC1, PC2, PC3) capture the most variance, with PC1 accounting for 12.03%. As we move to higher components, the explained variance decreases, indicating they carry less information about the dataset's overall variance.

### Exercise 3

In [None]:
# Calculate the cumulative variance ratio
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

# Print the cumulative variance ratio
for i, cv in enumerate(cumulative_variance_ratio):
    print(f"PC{i+1}: Cumulative Variance = {cv:.4f}")

The results show that each additional principal component captures more variance. For example, PC1 captures 12.03%, and by PC10, 58.87% is captured. All 64 components capture 100% variance. This helps identify how many components are needed to capture a significant portion of the dataset's variance.

### Exercise 4

The cumulative variance ratio shows that to reach at least 85% cumulative variance, 25 components are needed.

We observe that by retaining the first 25 components, we can capture 85.13% of the total variance in the dataset, which significantly reduces the dataset's dimensionality while retaining most of its information.

Using 25 components instead of the original 64 simplifies any downstream models by reducing their feature space, potentially improving model performance and interpretability.

The reduced number of components also makes visualising the data easier, which can provide meaningful insights into class separations or clustering.

Therefore, the ability to capture over 85% of the variance with 25 components makes PCA a viable dimensionality reduction technique for this dataset, preserving most information while simplifying further analyses.

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>