# Principal Component Analysis vs. Denoising Variational Autoencoders
## PCA vs. DVAE with examples

jupyter nbconvert *.ipynb --to slides --post serve

# An intuitive perspective ...

*"... natural, real-world high-dimensional data concentrates close to a non-linear low-dimensional manifold ..."*

![](manifold2.png)

![](manifold1.png)

![](manifold3.png)

** But, how to learn the manifold and the probability distribution on it ? **

# PCA
* unsupervised learning
* linear transformation that transforms a set of observations to a new coordinate system in which the values of the first coordinate have the largest possible variance [2]
* computing the eigen decomposition of the covariance matrix
* computing the singular value decomposition of the observations
* decorrelation of the coordinates 
* reconstructions of the observations from the leading principal components have the least total squared error


## Basic math of PCA 

Let $\{y_i\}^N_{i=1}$ be a set of $N$ observations vectors, each of size $n$, with $n\leq N$. 


Let $Y \in R^{nxN}$ be a matrix obtained by horizontally concatenating $\{y_i\}^N_{i=1}$, 

$ Y = \begin{bmatrix} | ... | \\ y_1 ... y_N \\ | ... | \end{bmatrix} $ 

We want to center the data to understand its statistical properties better, so we compute the element-wise mean of the $N$ observations as a $n$ dimensional vector 

$ \bar{y} = \frac{1}{N} \sum_{i=1}^{N} {y_i} = \frac{1}{N} Y 1_{N}$, 

where $1_N$ is a column vector of all-ones. We can calculate the centered observations as 

$Y_0 = Y - \hat{y} 1_{N}^T $

A linear transformation on a finite dimensional vector can be expressed as a matrix multiplication: 

$x_i = W y_i$, 

where $y_i \in R^{n}, x_i \in R^{m} and W \in R^{nxm}$. Each $j-th$ element in $x_i$ is the inner product between $y_i$ and the $j-th$ column in $W$, denoted as $w_j$.

Given the linear transformation, it is clear that 

$X = W^TY$ and $X_0 = W^TY_0$.

In particular, when $W^T$ represents the transformation applying Principal Component Analysis, we denote $W = P$. Each column of $P$, denoted $\{p_j\}^n_{j=1}$ is a loading vector, whereas each transformed vector $\{x_i\}^N_{i=1}$ is a principal component.

The first loading vector is the unit vector with which the inner products of the observations have the greatest variance:
$p_1 = \max w_1^T Y_0Y_0^Tw_1$ subject to $w_1^Tw_1 = 1$.

The solution of the previous equation is the first eigenvector of the sample covariance matrix $Y_0Y_0^T$ corresponding to the largest eigenvalue.

# Autoencoders
* unsupervised neural network
* minimize the error of reconstructions of observations [1]
* 

https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html


# PCA vs. Autoencoders
*  an autoencoder with a single fully-connected hidden layer, a linear activation function and a squared error cost function is closely related to PCA - its weights span the principal subspace [3]
* 


# Variational Autoencoders




# Denoising Variational Autoencoders




In [None]:
from keras.datasets import mnist
from keras.layers import Input, Dense
from keras import regularizers, models, optimizers
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Analytical PCA of the training set
def AnalyticalPCA(y, dimension):
    pca = PCA(n_components=dimension)
    pca.fit(y)
    loadings = pca.components_
    return loadings

# Linear Autoencoder
def LinearAE(y, dimension, learning_rate = 1e-4, regularization = 5e-4, epochs=3):
    input = Input(shape=(y.shape[1],))
    encoded = Dense(dimension, activation='linear',
                    kernel_regularizer=regularizers.l2(regularization))(input)
    decoded = Dense(y.shape[1], activation='linear',
                    kernel_regularizer=regularizers.l2(regularization))(encoded)
    autoencoder = models.Model(input, decoded)
    autoencoder.compile(optimizer=optimizers.adam(lr=learning_rate), loss='mean_squared_error')
    autoencoder.fit(y, y, epochs=epochs, batch_size=4, shuffle=True)
    (w1,b1,w2,b2)=autoencoder.get_weights()
    return (w1,b1,w2,b2)

def PlotResults(p,dimension,name):
    sqrt_dimension = int(np.ceil(np.sqrt(dimension)))
    plt.figure()
    for i in range(p.shape[0]):
        plt.subplot(sqrt_dimension, sqrt_dimension, i + 1)
        plt.imshow(p[i, :, :],cmap='gray')
        plt.axis('off')
    plt.savefig(name + '.png')

dimension = 16                                                                  # feel free to change this, but you may have to tune hyperparameters
(y, _), (_, _) = mnist.load_data(path='./mnist.npz')                                              # load MNIST training images

shape_y = y.shape                                                               # store shape of y before reshaping it
y = np.reshape(y,[shape_y[0],shape_y[1]*shape_y[2]]).astype('float32')/255      # reshape y to be a 2D matrix of the dataset
p_analytical = AnalyticalPCA(y,dimension)                                       # PCA by applying SVD to y
(_, _, w2, _) = LinearAE(y, dimension)                                          # train a linear autoencoder
(p_linear_ae, _, _) = np.linalg.svd(w2.T, full_matrices=False)                    # PCA by applying SVD to linear autoencoder weights
p_analytical = np.reshape(p_analytical,[dimension,shape_y[1],shape_y[2]])       # reshape loading vectors before plotting
w2 = np.reshape(w2,[dimension,shape_y[1],shape_y[2]])                         # reshape autoencoder weights before plotting
p_linear_ae = np.reshape(p_linear_ae.T, [dimension, shape_y[1], shape_y[2]])    # reshape loading vectors before plotting

In [3]:
PlotResults(p_analytical,dimension,'AnalyticalPCA')
PlotResults(w2,dimension,'W2')
PlotResults(p_linear_ae,dimension,'LinearAE_PCA')

# References and further reading
[1] Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, MIT Press, 2016.

[2] Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2017.

[3] Plaut, E., 2018. From principal subspaces to principal components with linear autoencoders. arXiv preprint arXiv:1804.10253.

[4] Im, D.I.J., Ahn, S., Memisevic, R. and Bengio, Y., 2017, February. Denoising criterion for variational auto-encoding framework. In Thirty-First AAAI Conference on Artificial Intelligence.

[5] Rolinek, M., Zietlow, D. and Martius, G., 2019. Variational Autoencoders Pursue PCA Directions (by Accident). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 12406-12415).

[6] Lei, N., Luo, Z., Yau, S.T. and Gu, D.X., 2018. Geometric understanding of deep learning. arXiv preprint arXiv:1805.10451.

[7] Kingma, D.P. and Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.