## PCA

* Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components.

*  PCA transforms data linearly into new properties that are not correlated with each other.

* What is the difference between SVD and PCA? 
    * SVD gives you the whole nine-yard of diagonalizing a matrix into special matrices that are easy to manipulate and to analyze. 
    
    * It lay down the foundation to untangle data into independent components. 
    * PCA skips less significant components. Obviously, we can use SVD to find PCA by truncating the less important basis vectors in the original SVD matrix.

* Identify the hyperplane that lie closest to data (the line which has largest sum of distance from origin : variance)

* Project into the hyperplane
* The subsequent planes are orthogonal to each other

* Objective is to preserve maximum amount of variance -> less loss of information that other projections.
* It can be seen as projection along the axis that has low MSE
* Centering
    * PCA requires data to be zero-centered around origin
    * Scikit learn default centers the data`

## PCA Basics and prerequisties:

*  Reference : https://www.youtube.com/watch?v=g-Hb26agBFg

 * Variance & Covariance:
    * Variance is how spread the data is.

    * Why Covariance is reqd in-case of 2 or more variabes? 
      * we can find x-variance and y-variance. Can that be not sufficient?

      * the x & y variance gives same variance for both +ve slope & -ve sloped relationship!! Variance is distance from mean squared, so +ve and -ve distances are squared and are equal

      * Covariance is mainly to fixes this by using the sum of product of coordinate points. Hence, +ve slope data points will have +ve product and this sum remains +ve. vice-versa for -ve datapoints

      * $ cov(x, y) = \frac{1}{n}\sum (x - \mu_x) (y - \mu_y) $, as we are centering the mean to zero, the above equation is about simply sum of product of coordinate points. 

      * For random datapoints, the + and - products of cords will cancel each other during sum, hence covariance is zero. Or in other words the variable varying with other is 0 or random

      * How much the two variables tend to inc together. If cov is + -> + slope or coord are similar (x, y is either +, + or -, -)

  * Center data to zero before starting PCA

  * Create covariance matrix $\Sigma$:
      * $ cov(x, x)  \ cov(x, y) $
      * $ cov(x, y)  \ cov(y, y) $

      * where $ cov(x, x) $ is the variance of x or covariance of x with itself. 

      * Regardless of data size, the matrix is always square. If matrix is symmetrical, then the eigen vectors are orthogonal

      * Covariance matrix is used as transformation matrix during change of axis. The new coords is written as linear transformation of covariance matrix

      * (x, y) -> (9x + 4y, 4x + 3y) {ex: (1, 0) point in old cord is transformed to (9, 4)}
        * When you multiply by any vector with the covariance matrix, the vector turns towards direction of variance.
        * In above example, (1, 0) vector is shifted more towards x direction than y as x var > y.
        
        * If you keep multiplying again and again with the slope get closer and stabilizes to the direction of highest variance (eigen vectors) but the magnitude increases (link :https://www.youtube.com/watch?v=jydawdlGLmo)
         
        * ie, as you multiply covariance matrix it turns the vector towards the direction of highest variance

      * where cov matrix -> (9, 4, 4, 3)

      * The new data points are streched along the eigen vectors for the cov matrix. 

      * $ \Sigma x = \lambda x $

      * Solving eigen Value problem:

        * ![title](Images\Proj_3.PNG)

        * For a particular covariance matrix (2, 0.8, 0.8, 0.6) -> e.values are 2.35, 0.23

        * eigen vectors are calculated as shown below. You get to that equation which leads to e11 = 2.2 e12. 

        * Many solution of vectors available which satisfies above eq. But always choose simplest one and take unit vector of it -> 2.2,1 -> 0.91, 0.41. Because eigen vectors are mostly unit vectors

  * Why $ \Sigma $ ?
      * Covariance matrix tells along which direction the data is spread and by how much.

      * In ex above: data is spread more along x axis by 9 and next along y by 3

* Create covariance matrix $\Sigma$, solve eigen value problem and keep only required eigen values and vectors for dimensionality reduction


## PCA - Code Emporium

* PCA is linear dimensionality reduction
    * $ x \in R^D, z \in R^M $ where M << D

    * $ z = U^T x $ where U $\in$ $R^{D, M}$

    * z is our reduced dimension output from higher dimension input data
    
* PCA tries to maintain information during transformation
    * information is stored as covariance
    
    * $ S_z = \frac{1}{N} z^T z $, where $S_z \ \in R^{M, M} $

    * Objective minimize dimension by maximising covariance
    
    * $ max \ S_z = max \ \frac{1}{N}(XU)^T (XU) $
    * $ max \ \frac{1}{N}U^T X^T XU = max \ U^TS_xU $ subject to $ U^T U = I $
    * This is the optmization problem with equality condition -> lagrangian

    * optmization with lagrangian : 
        * $ L(U, \lambda) = U^TS_xU + \lambda (I - U^T U) $

        * Solve this by take gradient wrt to U and equate to zero

        * We get, $ S_x U = \lambda U $ -> Eigen Value Problem

        * Eigen Decomposition : $ S_x = W \Lambda W^{-1} $, where W contains eigen vectors, $\Lambda$ contains eigen values
    
    * Variance Explained : 
        * Total Variance = $ \sum^n \lambda_i $
        * Retained Reduced Variance = $ \sum^d \lambda_i $
        
        * Precentage Variance Explained = $\frac{\sum^d \lambda}{\sum^n \lambda} $


## Summary of PCA:

* PCA can be generated through 2 approaches.
    * Eigen Decomposition

    * SVD 

* Eigen Decomposition:
    * Done through covariance matrix 
        * why cov marix (bcos, it is symmetric, contains information about variances in data, the eigen vectors can be orthonormal (perp with unit length))
    * $ C = W \Lambda W^{-1} $
    
    * $\Lambda$ is Eigen values, W is Eigen vectors
    * Reduced Projection : $ X_k = X \ W_k $

    * The above decomposition is valid only if W inverse exists (if C has independent eigen vectors)
    
    * SVD will help us develop solution for all matrices without any constraints

* SVD :
    * Decompose any matrix into like below
    $$ X_{m,n} = U_{m,m} \ \Sigma_{m,n} \ V_{n,n} $$

    * Compared to Eigen Decomposition, this can work on non-square matrices. 
    

## SVD :

* Decompose any matrix into like below
    $$ X_{m,n} = U_{m,m} \ \Sigma_{m,n} \ V_{n,n} $$

* Compared to Eigen Decomposition, this can work on non-square matrices. 

* Example : 
$$X = \begin{pmatrix} 3 & 2 & 2 \\ 2 & 3 & -2 \end{pmatrix}$$

$$X = USV^T = \begin{pmatrix} 1/\sqrt 2 & 1/\sqrt 2 \\ 1/\sqrt 2 & -1/\sqrt 2 \end{pmatrix}  \begin{pmatrix} 5 & 0 & 0 \\ 0 & 3 & 0 \end{pmatrix}  \begin{pmatrix} 1/\sqrt 2 & 1/\sqrt 2 & 0 \\ 1/\sqrt 18 & -1/\sqrt 18 & 4/\sqrt 18 \\ 2/3 & -2/3 & -1/3 \end{pmatrix} $$ 

The Eigen values are 5, 3.  Eigen vectors are in V (the abv is $V^T$)

* Obtaining U & V from $XX^T$ and $X^TX$ 

First for U Matrix:
$$XX^T = \begin{pmatrix} 17 & 8 \\ 8 & 17 \end{pmatrix}$$
$ \lambda = 25, 9 $
$$ u_1, u_2 = \begin{pmatrix} 1/\sqrt 2 \\ 1/\sqrt 2 \end{pmatrix}, \begin{pmatrix} 1/\sqrt 2 \\ -1/\sqrt 2 \end{pmatrix} $$

Second for V Matrix:
$$X^TX = \begin{pmatrix} 13 & 12 & 2 \\ 12 & 13 & -2 \\ 2 & -2 & 8\end{pmatrix}$$
$ \lambda = 25, 9, 0 $
$$ v_1, v_2, v_3 = \begin{pmatrix} 1/\sqrt 2 \\ 1/\sqrt 2 \\ 0 \end{pmatrix}, \begin{pmatrix} 1/\sqrt 18 \\ -1/\sqrt 18 \\ 4/\sqrt 18 \end{pmatrix}, \begin{pmatrix} 2/3 \\ -2/3 \\ -1/3 \end{pmatrix} $$




## Understanding SVD

* SVD is analogous to factorizing algebraic expressions, while PCA is analogous to approximating a factorized expression by keeping the ‘biggest’ terms, and dropping all ‘smaller’ terms.

* SVD stands for Singular Value Decomposition - in school algebra, values are constant numbers, and decomposition is called factorizing. So, we will take an algebraic expression and factorize it, based on some constant factors.

* consider the algebraic expression: $ 90x6+150x5+180x4+90x3+30x2 $

* We will factorize this in a strange way               
        
        * we will have exactly three factors in a specific order: L (for left), C (for center) and R (for right).
        
        * L and R will contain only x factors, while C will only contain constant factors (C also stands for constant).
We see that our expression can be factorized as:

$$ (3x3+2x2+x)∗(30)∗(x3+x2+x) $$

Where:

$$ L=3x3+2x2+x $$

$$ C=30 $$

$$ R=x3+x2+x $$

* With factorizing C again(largest to smallest) : $ 90x6+150x5+180x4+90x3+30x2= (3x3+2x2+x)∗(5∗3∗2)∗(x3+x2+x) $


* A piece of terminology: we will call each factor of the original expression a factor, and each item in the factor a component.


* What we did so far is analogous to SVD in linear algebra. We will continue from this factorization to perform the analogue of PCA.

### PCA with SVD Intuition

* Now, PCA stands for Principal Component Analysis - we want the principal components based on our factorization. In the school algebra analogy, we want the most important (principal) components in the algebraic expression.

* Let us say that we are told that x is very large. We also want the first two principal components.

* We will then keep the largest two factors in each component, and drop the rest (the third).

* Drop the third component from each factor -> an approximation of the original expression.

* We are only interested in the components now, of x and constant numbers, and we can see that the third factor (R) is completely determined by the first two factors (L and C) and the original expression. Hence, we can focus on getting x components from L and constant components from C, and **ignore R**. This is more like keeping/retaining only the uncorrelated factors in expression

* So after ignoring R, and keeping the first two components of L and C, we have:
    $$ (3x3+2x2)∗(5∗3) $$

* And we can say that  L∗C  determines an approximation of the original expression with the first two principal components.

* In some ‘nice' cases, the L components -> eigenvectors, and C components -> eigenvalues. The C components are always called singular values.

* The expression in school algebra is analogous to a matrix in linear algebra. A matrix can be ‘factorized’ into ‘factors’, each of which contain ‘components’ (polynomial or numeric terms within each factor).

## A few additional notes:

* while all algebraic expressions are not directly factorizable, all matrices can be decomposed with SVD (even complex matrices!).

* We can think of each constant component as scaling each of the variable components - since 5 is larger than 3, the overall expression is scaled more by the first numeric component than the second. Analogously, each eigenvalue is a scaling term. The scale analogy breaks down a bit, though, because in the numeric case, each numeric component scales all the variable components, whereas in linear algebra, each eigenvalue is ‘paired’ with its corresponding eigenvector, and only affects how that eigenvector is scaled.
* We have a general formula for the inverse (reciprocal) of an algebraic expression using a binomial expansion of the series (see Negative Binomial Series ). Similarly, we can ‘approximate’ the inverse of a matrix (even a non-invertible one!) by computing its ‘pseudo-inverse’ (see Moore–Penrose inverse) which uses SVD.
