# Principal Component Analysis

Principle components analysis (PCA) is a standard way to reduce the dimension 
 (which can be quite large) to something more manageable, given a $n\times p$ matrix.

PCA tries to find “components” that capture the maximal variance within the data. 

For three dimensional data, this is the basic image you may have come across:

<img src=img/pca_classic.png width="30%" height="30%">

## PCA in a nutshell

Each blue point represents one observation (a row of **X**). There are $n = 20$ observations, each with $p = 3$ features. PCA reduces the data from three dimensions to $r = 2$ by finding two new, perpendicular directions (red arrows) that capture as much of the data’s variance as possible. These directions define a two-dimensional plane (grey) that best represents the original 3D data.

## Mathematical formulation of PCA

We now formalize the geometric intuition of PCA.  
Let $\mathbf{X} \in \mathbb{R}^{n \times p}$ denote the data matrix, where each row corresponds to an observation and each column to a feature. Assume that each column of $\mathbf{X}$ has been mean-centered, so that the data are centered around the origin:

$$
\frac{1}{n} \sum_{i=1}^n X_{ij} = 0, \quad \forall j = 1, \dots, p.
$$

The goal of PCA is to find a unit vector $c \in \mathbb{R}^p$ that defines a direction in feature space along which the projected data have maximal variance. Formally, this corresponds to the optimization problem:

$$
\max_{c} \; c^\top \mathbf{X}^\top \mathbf{X} c
$$

subject to the unit-norm constraint

$$
c^\top c = 1.
$$

Let $w = \mathbf{X}c$ denote the projection of the data onto this direction. Since $c^\top c = 1$ and the data are mean-centered, the variance of the projected data is

$$
\operatorname{Var}(w) = \frac{1}{n} w^\top w = \frac{1}{n} c^\top \mathbf{X}^\top \mathbf{X} c.
$$

Thus, maximizing the variance of $w$ is equivalent to maximizing $c^\top \mathbf{X}^\top \mathbf{X} c$.  
The optimal $c$ is therefore the eigenvector of $\mathbf{X}^\top \mathbf{X}$ corresponding to its largest eigenvalue.

### Proof that the optimal $c$ is the eigenvector of $\mathbf{X}^\top \mathbf{X}$ with the largest eigenvalue

We start from the PCA optimization problem:

$$
\max_{c} \; c^\top \mathbf{X}^\top \mathbf{X} c
$$

subject to the constraint

$$
c^\top c = 1.
$$

#### Step 1: Form the Lagrangian

To handle the constraint, we introduce a Lagrange multiplier $\lambda$ and define the Lagrangian:

$$
\mathcal{L}(c, \lambda) = c^\top \mathbf{X}^\top \mathbf{X} c - \lambda (c^\top c - 1).
$$

#### Step 2: Take the derivative with respect to $c$

Setting the gradient of $\mathcal{L}$ with respect to $c$ to zero gives the first-order condition for optimality:

$$
\nabla_c \mathcal{L}(c, \lambda) = 2 \mathbf{X}^\top \mathbf{X} c - 2 \lambda c = 0.
$$

Simplifying, we obtain

$$
\mathbf{X}^\top \mathbf{X} c = \lambda c.
$$

This is the eigenvalue equation for $\mathbf{X}^\top \mathbf{X}$, showing that any optimal solution $c$ must be one of its eigenvectors.

#### Step 3: Determine which eigenvector maximizes the objective

Substituting $\mathbf{X}^\top \mathbf{X} c = \lambda c$ into the objective function yields:

$$
c^\top \mathbf{X}^\top \mathbf{X} c = c^\top (\lambda c) = \lambda (c^\top c) = \lambda.
$$

Since $c^\top c = 1$, the value of the objective function equals the corresponding eigenvalue $\lambda$.  
Therefore, to maximize the variance, we must select the eigenvector associated with the largest eigenvalue $\lambda_{\max}$.

#### Step 4: Interpretation

- The eigenvector $c_1$ corresponding to $\lambda_{\max}$ defines the **first principal component direction** — the axis along which the data variance is maximal.  
- The projected data (component scores) are given by $w = \mathbf{X} c_1$.  
- Subsequent principal components correspond to the remaining eigenvectors, ordered by decreasing eigenvalues, and are mutually orthogonal.


There are several equivalent ways to solve the PCA optimization problem and obtain $c$ and $w$.  
The classical approach is to compute the **eigendecomposition** of the covariance matrix:

$$
\mathbf{S} = \mathbf{X}^\top \mathbf{X},
$$

where $\mathbf{S} \in \mathbb{R}^{p \times p}$.  
We then solve the eigenvalue problem

$$
\mathbf{S} c = \lambda c,
$$

and select $c$ as the eigenvector corresponding to the largest eigenvalue $\lambda_{\max}$.  
This eigenvector defines the direction of maximal variance, and the associated eigenvalue represents the amount of variance captured along that direction.  

In practice, the eigendecomposition of $\mathbf{S}$ is often computed via the **singular value decomposition (SVD)** of $\mathbf{X}$:

$$
\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top,
$$

where the columns of $\mathbf{V}$ are the eigenvectors of $\mathbf{X}^\top \mathbf{X}$, and the squared singular values in $\mathbf{\Sigma}^2$ correspond to the eigenvalues.  

However, this matrix-based formulation assumes that $\mathbf{X}$ is a fully observed two-dimensional array.  
In more complex scenarios—such as when the data are represented as **tensors** (multi-way arrays), contain **missing entries**, or possess additional structural constraints—this classical eigenvalue/SVD approach no longer applies directly.  
In those cases, generalized formulations of PCA (e.g., tensor PCA, probabilistic PCA, or low-rank matrix completion) are required to handle the additional complexity.

Once the optimization problem has been solved—by eigendecomposition, SVD, or another suitable method—we obtain the vectors $c$ (the principal component direction) and $w = \mathbf{X}c$ (the corresponding component scores).  
The best **rank-one approximation** of the data matrix can then be written as the outer product of these two vectors:

$$
\mathbf{X} \approx w c^\top.
$$

This approximation minimizes the reconstruction error (in the least-squares sense) among all rank-one matrices.  
The matrix $w c^\top$ has rank one because it is formed by the outer product of two vectors.  
Geometrically, this reconstruction represents the projection of the data onto the one-dimensional subspace spanned by $c$.

<img src="img/rank_one.png" width="50%" height="50%">

**Example: Rank-one reconstruction using the first principal component.**  
An example data matrix (left) with $n = 12$ observations and $p = 8$ features is approximated by the outer product $w c^\top$ (middle), producing a rank-one matrix (right).  
Here, $w$ corresponds to the **component scores** (or loadings for each observation), and $c^\top$ represents the **principal component direction** in feature space.


Most data can’t be well-described by a single principal component. Typically, we compute multiple principal components by computing all eigenvectors of $\textbf{X}^T\textbf{X}$ and ranking them by their eigenvalues. This can be visualized by a scree plot, which plots the variance explained by each successive principal component. People may have told you to look for the “knee” or inflection point in the scree plot to determine the number of components to keep (the rest are noise).

<img src=img/scree.png width="40%" height="40%">

**Scree plot**. Principal components are ranked by the amount of variance they capture in the original dataset, a scree plot can provide some sense of how many components are needed.

We can organize the top r principal components into a matrix $C = [c_1,c_2,...,c_r]$ and the loading weights into $W = [w_1,w_2,...,w_r]$. Our reconstruction of the data is now a sum of r outer products:

$$
\textbf{X} \approx \sum_{k=1}^r w_kc_k^T \text{ or } X\approx WC^T
$$

<img src=img/pca_3.png width="50%" height="50%">

**Example reconstruction of data with 3 principal components.** A data matrix (left) is approximated by the product of a $n\times r$ matrix and a $r\times p$ matrix (i.e. $WC^T$). This product is at most a rank-r matrix (in this example, $r=3$). Each paired column of W and row of $C^T$ form an outer product, so the full reconstruction can also be thought of as a sum of r rank-one matrices.

## Python implimentation

### Uses of PCA

- **Identify relationships between variables:** PCA reveals correlations and dependencies between features in the data.  
- **Data interpretation and visualization:** By reducing dimensionality, PCA helps visualize complex datasets in 2D or 3D spaces.  
- **Dimensionality reduction:** Reducing the number of variables simplifies further analysis and computation.  
- **Genetic and population studies:** PCA is widely used to visualize genetic distances and relatedness among populations.

### Objectives of PCA

- **Dimension reduction:** PCA transforms a large number of variables into a smaller set of uncorrelated principal components, capturing most of the variance.  
- **Pattern identification:** PCA uncovers hidden structures or relationships in the data that may not be apparent in the original feature space.  
- **Feature extraction:** PCA creates new features (principal components) that are often more informative than the original variables.  
- **Data compression:** By keeping only the top principal components, PCA reduces data size while preserving as much information as possible.  
- **Noise reduction:** Components with small variance often correspond to noise; removing them improves signal quality.  
- **Visualization of high-dimensional data:** PCA enables projection of high-dimensional datasets onto lower dimensions for easier interpretation and insight.


In [None]:
## Step 1: Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# importing or loading the dataset
dataset = pd.read_csv('wine.csv')
 
# distributing the dataset into two components X and Y
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values

In [None]:
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
# performing preprocessing part
from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
 
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA
 
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
 
explained_variance = pca.explained_variance_ratio_

In [None]:
explained_variance

In [None]:
sum(explained_variance)

### Explained Variance in PCA

The **explained variance** quantifies how much of the total information (variance) in the dataset is captured by each principal component. This is important because reducing the dimensionality of the data inevitably discards some variance.  

For example, if we reduce a four-dimensional dataset to two principal components, we lose a portion of the total variance. The **explained variance ratio** is typically denoted as:

$$
\text{explained\_variance\_ratio\_} = \frac{\text{variance of a principal component}}{\text{total variance of the data}}.
$$

Using this measure, we can interpret the contribution of each component. For instance:  

- The **first principal component** may capture $36.88\%$ of the total variance.  
- The **second principal component** may capture $19.32\%$ of the total variance.  

Together, the first two components account for $36.88\% + 19.32\% = 56.2\%$ of the total variance.  

Thus, by examining the explained variance ratio, we can decide how many components are needed to retain a sufficient amount of information while reducing dimensionality.


In [None]:
# Fitting Logistic Regression To the training set
from sklearn.linear_model import LogisticRegression 
 
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

In [None]:
# Predicting the test set result using
# predict function under LogisticRegression
y_pred = classifier.predict(X_test)

In [None]:
# making confusion matrix between
#  test set of Y and predicted value.
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test, y_pred)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Set data
X_set, y_set = X_train, y_train

# Create grid
X1, X2 = np.meshgrid(
    np.arange(X_set[:, 0].min() - 1, X_set[:, 0].max() + 1, 0.01),
    np.arange(X_set[:, 1].min() - 1, X_set[:, 1].max() + 1, 0.01)
)

# Plot decision boundary
plt.contourf(
    X1, X2,
    classifier.predict(np.c_[X1.ravel(), X2.ravel()]).reshape(X1.shape),
    alpha=0.75,
    cmap=ListedColormap(('yellow', 'white', 'aquamarine'))
)

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

# Scatter plot for each class
colors = ['red', 'green', 'blue']
for i, label in enumerate(np.unique(y_set)):
    plt.scatter(
        X_set[y_set == label, 0],
        X_set[y_set == label, 1],
        color=colors[i],
        label=label
    )

# Labels and title
plt.title('Logistic Regression (Training set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# Set data
X_set, y_set = X_test, y_test

# Create grid
X1, X2 = np.meshgrid(
    np.arange(X_set[:, 0].min() - 1, X_set[:, 0].max() + 1, 0.01),
    np.arange(X_set[:, 1].min() - 1, X_set[:, 1].max() + 1, 0.01)
)

# Plot decision boundary
plt.contourf(
    X1, X2,
    classifier.predict(np.c_[X1.ravel(), X2.ravel()]).reshape(X1.shape),
    alpha=0.75,
    cmap=ListedColormap(('yellow', 'white', 'aquamarine'))
)

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

# Scatter plot for each class
colors = ['red', 'green', 'blue']
for i, label in enumerate(np.unique(y_set)):
    plt.scatter(
        X_set[y_set == label, 0],
        X_set[y_set == label, 1],
        color=colors[i],  # <--- use 'color' instead of 'c'
        label=label
    )

# Labels and title
plt.title('Logistic Regression (Test set)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.show()

This is a simple example of how to perform PCA using Python. The output of this code will be a scatter plot of the first two principal components and their explained variance ratio. By selecting the appropriate number of principal components, we can reduce the dimensionality of the dataset and improve our understanding of the data.