# Curse of Dimensionality

* Data becomes sparse and should grow exponentially to maintain an accurate analysis as dimension increases.


* Data that are close to each other in a low dimensional space may be far away in higher dimensional spaces, making predictions much less reliable. The more features the training set has, the greater the risk of overfitting it.

## Principal Component Analysis (PCA)

* PCA finds the axis having the largest amount of variance in the training dataset. 


* PCA assumes that the dataset is centered around the origin. 


* Given the training set matrix $X\in M_{m,n}$,

    1. computer the zero-centered matrix $X_0 = X - np.mean(X,0)$;  
    1. apply the SVD to $X_0$ and get $X_0 = U\Sigma V^T$, where 
    
        * $U\in M_{m,m}$ and $V\in M_{n,n}$ are unitary and
        * $\Sigma\in M_{m,n}$ is a rectangular diagonal matrix whose diagonal entries are nonnegative real numbers and arranged in descending order.
    
    
* The columns of $V^*$ are called the __principal components__. The diagonal entries of $\Sigma$ are called the __singular values__.


* Note $X_0V = U\Sigma$. If we want to project the training instances from $\mathbb{R}^n$ onto $\mathbb{R}^k\;(k\leq n)$, we compute $U_k\Sigma_k$, where $U_k = U[:,:k]\in M_{n,k}$ and $\Sigma_k = \Sigma[:k,:k]\in M_{k,k}$. Equivalently, we can compute $X_0 V_k$, where $V_k = V[:,:k]\in M_{n,k}$.


* Conversely, to reconstruct a dataset ($\in M_{r,n}$) from $Y\in M_{r,k}$, we compute $Y V_k^T$ and then add $np.mean(X,0)$ to it.


* If $s_1,\ldots,s_k$ are the first $k$ singular values of $X_0$ (_i.e._ the first $k$ diagonal entries of $\Sigma$), then the __explained variance__ is $\frac{1}{m-1}[s_1^2,\ldots,s_k^2]$ and the __explained variance ratio__ is $\frac{1}{s_1^2+\cdots+s_k^2}[s_1^2,\ldots,s_k^2]$.


```python
from sklearn.decomposition import PCA
clf = PCA(n_components, svd_solver, ...)
```

Reference: scikit-learn.org


## Incremental PCA

This algorithm has constant memory complexity, enabling use of np.memmap files without loading the entire file into memory, and allows sparse input.


```python
from sklearn.decomposition import IncrementalPCA
clf = IncrementalPCA(n_components, ...)
```


## Randomized PCA

```python
from sklearn.decomposition import PCA
clf = PCA(n_components, svd_solver='randomized', ...)
```


## Kernel PCA

Apply the kernel trick to PCA.

```python
from sklearn.decomposition import KernelPCA
clf = KernelPCA(n_components, kernel,...)
```