# Principal Component Analysis (PCA) on Iris Dataset

In this notebook, we will apply **Principal Component Analysis (PCA)** to the **Iris dataset**. PCA is a dimensionality reduction technique that helps in reducing the number of features while preserving the variance in the data.

## Steps:
1. **Load the Iris dataset**
2. **Standardize the data**
3. **Apply PCA**
4. **Visualize the results**
5. **Interpret the explained variance**

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Check the shape of the data
X.shape

The Iris dataset contains 150 samples with 4 features each. Now, let's standardize the data to ensure that PCA works optimally, since PCA is sensitive to the scale of the data.

In [2]:
# Step 2: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Check the mean and standard deviation after scaling
np.mean(X_scaled, axis=0), np.std(X_scaled, axis=0)

The data has been standardized, which ensures that each feature has a mean of 0 and a standard deviation of 1. We can now apply PCA to reduce the dataset to two principal components.

In [3]:
# Step 3: Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 components
X_pca = pca.fit_transform(X_scaled)

# Check the shape of the PCA-transformed data
X_pca.shape

We have reduced the dataset to 2 principal components. Next, let's visualize the transformed data in a 2D plot.

In [4]:
# Step 4: Visualize the PCA results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=100)
plt.title('PCA of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Class Label')
plt.show()

<matplotlib.figure.Figure at 0x7f9e09fcb3a0>

The scatter plot above shows the first two principal components of the Iris dataset. Each point is color-coded according to its class label (species of Iris).

Next, let's check the **explained variance** to see how much of the original data's variance is captured by the two principal components.

In [5]:
# Step 5: Explained Variance
print(f'Explained Variance Ratio: {pca.explained_variance_ratio_}')
print(f'Total Variance Explained by 2 Components: {sum(pca.explained_variance_ratio_)}')

Explained Variance Ratio: [0.92461872 0.05306648]
Total Variance Explained by 2 Components: 0.9776851992773527

The first two principal components explain approximately **97.77%** of the total variance in the dataset, which indicates that the dimensionality reduction was highly effective.

This concludes our PCA example. We have demonstrated how to reduce the dimensionality of the Iris dataset using PCA, visualize the results, and interpret the explained variance.