<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_3/Section_8_Python_Example__PCA_with_Scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 8 - PCA with Scikit-learn

Principal Component Analysis (PCA) is one of the most commonly used techniques for dimensionality reduction in data science. PCA reduces the dimensionality of data by transforming the original variables into a new set of variables, which are linear combinations of the original variables. These new variables, called principal components, are orthogonal and ranked according to the variance of data along them, which ensures that the first few retain most of the variation present in all of the original variables. This notebook demonstrates how to implement PCA using Python's scikit-learn library, focusing on a dataset with multiple features to illustrate how PCA can simplify the data while retaining the essential information.

1. Setting Up the Environment:

Ensure you have the necessary Python packages installed. scikit-learn is used for PCA, and matplotlib for plotting:

In [None]:
pip install numpy matplotlib scikit-learn

2. Importing Required Libraries:

Start by importing the libraries needed for data manipulation, PCA, and plotting:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

3. Loading and Preprocessing Data:

For this example, we'll use the Iris dataset, which is a classic dataset in pattern recognition containing the sepal and petal measurements of three species of Iris flowers.

In [None]:
# Load the dataset
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

4. Applying PCA:

We'll perform PCA to reduce the dimensions from four to two. This reduction allows us to visualize the data easily without losing much information.

In [None]:
# Initialize PCA and reduce to two dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)

5. Visualizing the Results:

Visualize the transformed dataset with the first two principal components.

In [None]:
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.title('PCA of IRIS Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(scatter)
plt.show()

6. Interpreting the Results:

The plot shows the new two-dimensional representation of the data, where the separation between the different species of Iris is still fairly clear. The explained variance ratio printed earlier tells us how much information (variance) is captured by each of the principal components. Typically, these first two components should capture a significant portion of the total variance.

7. Conclusion:

This example illustrates how PCA can be implemented in Python using scikit-learn to reduce the dimensionality of data, thereby simplifying the dataset while still retaining most of the critical information. PCA is particularly useful in preprocessing steps for other machine learning algorithms, which might perform poorly or overfit when dealing with high-dimensional data. By reducing the number of dimensions, PCA helps to enhance the performance and interpretability of these models.

Dimensionality reduction like PCA is an invaluable tool in areas where the number of input features is too large to handle effectively, not just for simplification and visualization but also for making computational tasks more manageable and improving the performance of predictive models.