# Autoencoder

An autoencoder learns a lossy, compressed representation of data.
An autoencoder has three parts:
1. Encoding function 
2. Decoding function
3. Distance function between the amount of information loss between the compressed representation of the data and the decompresed representation. 

Two practical applications of autoencoders are data denoising and dimensional reduction for data visuzalization. 

t-SNE is a good algorithm for visualizing low-dimensional data, one use for autoencoders is to use an autoencoder to compress data into a lower-dimensinal space, like 32 dimensional, and then use t-SNE for mapping the compressed data to a 2D plane. 

### Encoder and Decoder

decoder:= P(X|z)
encoder:= Q(z|X)

The encoder Q is a net, outputting the mean, Mu, and the standard deviation, Sigma, of the encoded data. 

### Xavier Initialization

For our weight initialization we want a Gaussian distribution with 0 mean and finite variance.

We want to choose a finite variance that will not lead to exploding or vanishing gradients. To do this we use a variance that is equal to 1/N, where N is the number of input neurons. In the original paper by Glorot and Bengio *Understanding the difficulty of training deep feedforward neural networks* they use 1/N, where N is (N_in + N_out)/2. More recent papers simply use the number of input neurons to reduce computational complexity. 

### Reparameterization Trick

In a variational autoencoder we are random sampling from the encoder's distribution. In our computation graph we cannot backpropogate through this random node because we cannot compute a gradient for our parameters in a random function. To fix this we move the learnable parameters outside our sampling function- now there are no gradient calculations needed for the stochastic node in our graph. 

### Unnesecary Notes

Z is the latent vector 

### Things I want to try

Changing the size of the latent vector and looking at image reconstruction

Letting the user select values for the latent vector- or some system in that vein 

Looking at how limited the autoencoders is in terms of how many examples it needs and how much growth in number of samples it needs to approximate a greater diversity of images. 

Can I use an autoencoder to encode data for a GA- what happens what does that look like. I know that is what Hardmaru did, or something like it but I want to try it myself. I'm not sure exactly how to do this- but its definantly going to happen. This sounds liek it is super computaitonally expensive but I'd like to read more about the interaction of GA and machine learning.    


In [38]:
import torch
import torch.nn.functional as nn
import torch.autograd as autograd
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

import os
from torch.autograd import Variable
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('../MNIST_data', one_hot= True)

mb_size = 64 #mini batch size
Z_dim = 100
X_dim = mnist.train.images.shape[1] #784, number of pixels 28x28
Y_dim = mnist.train.labels.shape[1] #10, number of categories
h_dim = 128 
c = 0
lr = 1e-3 #learning rate

def xavier_init(size):
    in_dim = size[0]
    xavier_stddev = 1. / np.sqrt(in_dim / 2.)
    return Variable(torch.randn(*size) * xavier_stddev, requires_grad=True)

Wxh = xavier_init(size=[X_dim, h_dim]) # weights for linear layer
bxh = Variable(torch.zeros(h_dim), requires_grad=True)

Whz_mu = xavier_init(size=[h_dim, Z_dim]) # weights for mean
bhz_mu = Variable(torch.zeros(Z_dim), requires_grad=True)

Whz_var = xavier_init(size=[h_dim, Z_dim])# weights for variance 
bhz_var = Variable(torch.zeros(Z_dim), requires_grad=True)
    
# Encoder
# Multiple the input by weights and add bias, then multiply activation by the mu/variance weights + bias
def Q(X):
    h = nn.relu(X @ Wxh + bxh.repeat(X.size(0), 1)) # .repeat is for some broadcasting issue with pytorch
    z_mu = h @ Whz_mu + bhz_mu.repeat(h.size(0), 1) 
    z_var = h @ Whz_var + bhz_var.repeat(h.size(0), 1)
    return z_mu, z_var

Extracting ../MNIST_data/train-images-idx3-ubyte.gz
Extracting ../MNIST_data/train-labels-idx1-ubyte.gz
Extracting ../MNIST_data/t10k-images-idx3-ubyte.gz
Extracting ../MNIST_data/t10k-labels-idx1-ubyte.gz


In [24]:
img = mnist.train.images[0]


In [39]:
def sample_z(mu, log_var):
    # Using reparameterization trick to sample from a gaussian
    eps = Variable(torch.randn(mb_size, Z_dim))
    return mu + torch.exp(log_var / 2) * eps

In [41]:
# Decoder
# Pass compression to through linear layer and then sigmoid 

Wzh = xavier_init(size=[Z_dim, h_dim])
bzh = Variable(torch.zeros(h_dim), requires_grad=True)

Whx = xavier_init(size=[h_dim, X_dim])
bhx = Variable(torch.zeros(X_dim), requires_grad=True)


def P(z):
    h = nn.relu(z @ Wzh + bzh.repeat(z.size(0), 1)) 
    X = nn.sigmoid(h @ Whx + bhx.repeat(h.size(0), 1))
    return X

In [43]:
params = [Wxh, bxh, Whz_mu, bhz_mu, Whz_var, bhz_var,
          Wzh, bzh, Whx, bhx]

solver = optim.Adam(params, lr=lr)

for it in range(100000):
    X, _ = mnist.train.next_batch(mb_size)
    X = Variable(torch.from_numpy(X))

    # Forward
    z_mu, z_var = Q(X)
    z = sample_z(z_mu, z_var)
    X_sample = P(z)

    # Loss
    recon_loss = nn.binary_cross_entropy(X_sample, X, size_average=False)
    kl_loss = 0.5 * torch.sum(torch.exp(z_var) + z_mu**2 - 1. - z_var)
    loss = recon_loss + kl_loss

    # Backward
    loss.backward()

    # Update
    solver.step()

    # Housekeeping
    for p in params:
        p.grad.data.zero_()