# Dimensionality Reduction
## 1 Motivation
In real world Machine learning projects, problems tend to have significantly large dimensionality for several reasons:
* the problem is complex and include several aspects
* several teams are working together and certain features are redundant.
Therefore, reducing the number of features has several advantages:
* improve performance (depending on the situation)
* visualize the dataset and have a better intuitive understanding

## 2 P.C.A
### 2.1 Probelm formulation
The Principal Component Analysis algorithm is the most popular alogrithm for dimensionality reduction. Assuming 
$m$ vector $\in \mathbb{R} ^ {n}$, the algorithm finds $k$ vectors $\in \mathbb{R} ^ {k}$ forming a sub space such that the sum of the projections (projection error) on that space is minimal.
### 2.2 Algorithm
Before executing the algorithm, it is preferable to apply mean normalization as well as features scaling. The algorithm can be broken to the following steps
1. compute the covariance matrix:
$
\begin{align} \Sigma = \frac{1}{m} \sum_{i=1}^{m} x^{(i)} \cdot (x^{(i)}) ^ T = \frac{1}{m} X ^ {T} \cdot X
\end{align}
$
2. apply the Single Value Decomposition on $\Sigma$ obtaining matrices $U$, $S$, $V$. 
3. Consider the matrix $U_{reduced}$ as the matrix composed out of the first $k$ columns of $U$
4. for every vector $x^{(i)}$ we compute the vector $x_{approx} ^{(i)} = U_{reduced} ^ T \cdot x^{(i)}$

### 2.3 Number of principle components: $K$
#### 2.3.1 Choosing K
The Average squared projection error is defined as: $\begin{align} \frac{1}{m} \sum_{i=1}^{m} ||x^{(i)} - x_{approx} ^{(i)}|| ^{2} \end{align}$

The total variance is defined as $\frac{1}{m} \sum_{i=1}^{m} ||x^{(i)}||^2$
The number $K$ is chosen as the smallest number satisfying: 
$\begin{align} 
\frac{\frac{1}{m} \sum_{i=1}^{m} ||x^{(i)} - x_{approx} ^{(i)}|| ^{2}}{\frac{1}{m} \sum_{i=1}^{m} ||x^{(i)}||^2} = 
\frac{\sum_{i=1}^{m} ||x^{(i)} - x_{approx} ^{(i)}|| ^{2}}{\sum_{i=1}^{m} ||x^{(i)}||^2} \leq C \end{align}$

We choose $K$ such that $(1 - C) * 100$\% of the variance is retained 


#### 2.3.2 Implementation notes
It is possible to prove:
$\begin{align} 
\frac{\sum_{i=1}^{m} ||x^{(i)} - x_{approx} ^{(i)}|| ^{2}}{\sum_{i=1}^{m} ||x^{(i)}||^2} = 1 - \frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^{n} S_{ii}}
\end{align}$
where $S$ is the diagonal matrix returned by the SVD and $S_{ii}$ is the $i$-th diagonal value.  
Thus, the value of $K$ can be chosen with only one call to the *SVD* function.

### 2.4 The new training dataset
For an initial dataset represented as 
$\begin{align}
X =\begin{bmatrix} 
x^{(1)} \\
x^{(2)} \\
.. \\
.. \\
x^{(m)}
\end{bmatrix}
\end{align}$
, it is possible to find the new dataset 
$\begin{align}
Z =\begin{bmatrix} 
z^{(1)} \\
z^{(2)} \\
.. \\
.. \\
z^{(m)}
\end{bmatrix} = X \cdot U_{reduced}
\end{align}$  
We have
$\begin{align} z = x_{approx} = U_{reduced} ^ T \cdot x \end{align}$
This can be extended to the entire dataset:
$ \begin{align} U_{reduced} ^ T \cdot X^T = \begin{bmatrix} z_1, z_2, ..., z_m \end{bmatrix} = Z^{T}  \end{align}$ 
Thus $\begin{align} Z = X \cdot U_{reduced} \end{align}$

### 2.5 PCA and Supervised Learning
For a supervised learning problem and a dataset $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})$. 
1. extract the inputs unlabeled 
2. approximate them $z^{(1)}, z^{(2)},..., z^{(m)}$
3. obtain the new training set $(z^{(1)}, y^{(1)}), (z^{(2)}, y^{(2)}), ..., (z^{(m)}, y^{(m)})$
4. The mapping parameters obtained : $U_{reduced}, k$ are then used to convert $x_{cv}$ to $z_{cv}$ and $x_{test}$ to $z_{test}$. No additional execution of PCS is required.