# Principle Component Analysis (PCA) in 5 Minutes

In this notebook, we will explore the main concepts behind Principle Component Analysis (PCA) as explained in the video by StatQuest. PCA is a powerful tool used in machine learning and data visualization to reduce the dimensionality of our data, making it easier to understand and analyze.

We will demonstrate PCA using a simplified example where we have cells (or any other entities like people, cars, cities etc.) with certain characteristics. The aim is to identify potential groups of cells based on these characteristics. We will use Python's sklearn library to perform PCA.

Let's start by importing the necessary libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Let's create a synthetic dataset representing the gene activity in cells.

In [None]:
np.random.seed(0)
gene_activity = np.random.rand(100,3) # 100 cells with 3 genes each
df = pd.DataFrame(data=gene_activity, columns=['Gene1', 'Gene2', 'Gene3'])

We apply PCA to this 3-dimensional data to reduce it to 2 dimensions, which can be easily visualized.

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

pca = PCA(n_components=2) # we want to reduce to 2 dimensions
principalComponents = pca.fit_transform(scaled_data)
principalDf = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2'])

We can now visualize the cells in a 2-dimensional plot, where each point represents a cell. Cells that are similar in gene activity will be closer together.

In [None]:
fig, ax = plt.subplots()
ax.scatter(principalDf['PC1'], principalDf['PC2'])
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('PCA Plot')
plt.show()

PCA reduces the dimensionality of our data by creating new axes (principal components) that capture the most variation in the data. The first principal component (PC1) captures the most variation, and each subsequent component captures less variation.

In this example, we reduced our 3-dimensional data to 2 dimensions, but PCA can be used to reduce data from any number of dimensions to any lower number of dimensions. This makes PCA a powerful tool for visualizing and understanding high-dimensional data.

PCA is just one of many dimension reduction techniques. Other methods include heat maps, t-SNE plots, and multiple dimension scaling plots. These methods can be used in combination to provide different perspectives on the same data.

In [None]:
# end of notebook