# Auto-Encoders
This is a shallow neural network to automatically perform feature engineering and feature selection. 

One other area that unsupervised learning excels in is *feature extraction*, which is a method used to generate a new feature representation from an original set of features. 

The new feature representation is called a *learned representation* and is used to improve performance on supervised learning problems. 

Autoencoders are one such form of feature extraction. They use **feed forward, non-recurrent neural network** to perform *representation learning*. 

In autoencoders which are a form of representation learning - each layer of the neural network learns a representation of the original features, and subsequent layers build on the representation learned by the preceding layers. 

Layer by layer, the autoencoder learns increasingly complicated representations from simpler ones, building what is known as hierarchy of concepts that become more and more abstract. 

The output layer is the final newly learned representation of the original features. This learned representation can then be used as input into a supervised learning model with the objective of improving the generalization error. 

# Neural Networks
Neural Networks perfrom representation learning, where each layer of the neural network learns a representation from the previous layer. 

By building more nuanced and detailed representations layer by layer, neural networks can accomplish pretty amazing tasks such as computer vision, speech recogniion, and machine translation. 

The major difference between machine learning using neural networks and classical machine learning is that a lot of the feature representation is automatically performed in the neural networks case and is hand-designed in classical machine learning. 

Each node are fed into an activation function, which determines what value of the current layer is fed into the next layer of the neural network. Common activation functions: sigmoid, tanh, relu. The final activation function is usually the softmax function, which outputs a class probability that the input observation falls in.

Neural networks may also have bias nodes; these nodes are always constant values and, unlike the normal nodes, are not connected to the previous layer. Rather, they allow the output of an activation function to be shifted lower or higher. With the hidden layers - including the nodes, bias nodes, and activation functions - the neural network is trying to learn the right function approximation to use to map the input layer to the output layer. 

In the case of supervised learning problems, this is pretty straight forward. The input layer represents the features that are fed into the neural network, and the output layer represents the label assigned to each observation. 

During the training process, the neural network determines which weights across the neural network help minimize the error between its predicted label for each observation and the true label. 

**unsupervised learning** problems, the neural network learns representations of the input layer via the various hidden layers but is not guided by labels. 

Neural networks are incredibly powerful and are capable of modeling complex nonlinear relationships to a degree that classical machine learning algorithms struggle with. 

There is a potential risk. Because neural networks can model complex nonlinear relationships, they are also much more prone to overfitting. Thus regularization of these networks is vital to their performance. 

## Autoencoder: The Encoder and the Decoder
An autoecoder has two parts: an encoder and decoder. 

The **encoder** converts the input set of features into a different representation. 

The **decoder** converts this newly learned representation to the original format. 

The core concept of an autoencoder is similar concept of dimensionality reduction. Similar to dimensionality reduction, an auto-encoder does not memorize the original observations and features, which would be an identify function. Rather, autoencoders must **approximate** the original observations as closely as possible - but not exactly - using a newly learned representation. 

In other words: The autoencoder learns an **approximation** of the identity function 

Since the autoencoder is contrained, it is forced to learn the most salient properties of the original data, capturing the underlying structure of the data; this is similar to what happens in dimensionality reduction.

The constraint is very import attribute of autoencoders - that is the constraint forces the autoencoder to intelligently choose which import information to capture and which irrelevant or less import information to discard. 

## Undercomplete Autoencoders
We care most about the **encoder** because this component is the one that learns a new representation of the original data. 

This new representation is the new set of features derived from the original set of features and observations. 

* Encoder = $h = f(x)$
* Decoder = $r = g(h)$

Therefor we will aim to build: 

## $$g(f(x)) \approx x$$

The term **autocomplete autoencoder** is a mechanism to constrain the encoder function's output $h$, to have fewer dimensions than $x$. 

Constrained in this manner, the autoencoder attempts to minimize a *loss function* we define such that the **reconstruction error** - after the decoder reconstructs the observations approximately using the encoder's output - is as small as possible. 

The decoder's last layer will contain as many layers as the encoders input, that is the reproduce a reconstructed output of the same dimensions as the raw input. 

**WHEN** the decoder is linear and the loss function is teh mean squared error, an undercoplete autoencoder learns the same sort of new representations as PCA. 

**HOWEVER**, if the encoder and decoder functions are nonlinear, the autoencoder can learn much more complex nonlinear representations. 

## Overcomplete Autoencoders
If the encoder learns a representation in a greater numbre of dimensions than the original input dimensions, the autoencoder is considered overcomplete. 

If we employ some form of regularization, which penalizes the neural network for learning unecessarily complex functions, overcomplete autoencoders can be used successfully for dimensionality reduction and automatic feature engineering. 

Regularized overcomplete autoencoders are harder to design successfully but are potentially more powerful because they can learn more complex but not overly complex representations that better approximate the original observations without copying them precisely. 

## Dense vs. Sparse Autoencoders
Normal autoencoders output a dense final matrix such taht a handful of features have the most salient information that has been captured about the original data. 

HOWEVER, we may want to output a sparse final matrix such that the information captured is more well-distributed across the features that the autoencoder learns. 

To do this, we need to include not just a *reconstruction error* as part of the autoencoder but also a *sparsity penalty* so that the autoencoder must take the sparsity of the final matrix into consideration. 

Sparse autoencoders are generally overcomplete - the hidden layers have more units than the number of input features with the caveat that only a small fraction of the hidden units are allowed to be active at the same time. 

## Denoising Autoencoder
In some cases, we may want the autoencoder we design to more aggressibely ignore the noise in the data, especially if we suspect the original data is corrupted to some degree. 

Examples may be: recording a conversation between two people at a noise coffee shop in the middle of the day. In which we would want to isolate the conversation (the signal) from the background chatter (noise). 

Or a dataset of images that are grainy or distorted due to low resolution or some blurring effect. We will want to isolate the core image from the distrotion. 

Therefor we design a *denoising autoencoder* that receives the corrupted data as input and is trained to output the original, uncorrupted data as best as possible. 

## Variational Autoencoder
An alternative autoencoder known as *variational autoencoder* has an encoder that outputs two vectors instead of one: a vector of means ```mu```, and a vector of standard deviations ```sigma```. 

These two vectors form random variables such taht the $ith$ element of ```mu``` and ```sigma``` correspond to the mean and standard deviation of the $ith$ random variable. 

By forming this stochastic output via its encoder, the variational autoencoder is able to sample across a continous space based on what it has learned from the input data 

The variational autoencoder is not confined to just the examples it has trained on but can generalize and output new examples even if it may have never seen precisely similar ones before. 

This is incredibly powerful because now the variational autoencoders can generate new synthetic data that appears to belong in the distribution the variational autoencoder has learned from the original input data. 

Such advances has led to an entirely new and trending field in unsupervised learning known as generative modeling: including *generative adversarial networks*. 