Unsupervised Learning & Autoencoders
-----

<center><img src="http://blog.venturesity.com/wp-content/uploads/2015/06/clustering.png" height="500"/></center>

By The End Of This Session You Should Be Able To:
----

- Explain the difference Supervised & Unsupervised Learning
- List examples of Unsupervised Learning
- Define and diagram an Autoencoder
- Explain practical applications for Autoencoders
- List limitations of Autoencoders

What is Unsupervised Learning (UL)?
------
<br>
<center><img src="http://oliviaklose.azurewebsites.net/content/images/2015/02/2-supervised-vs-unsupervised-1.png" height="500"/></center>

Given inputs, find "interesting patterns" in the data.

aka, Descriptive Learning or Knowledge Discovery

Check for understanding
------

What is the best error metric (e.g., accuracy, precision, recall,…) for UL?

__None__. There is no appropriate error metric.

Framing Unsupervised Learning
------

Density estimation

Build models with the form: p(x|θ)

x is the data and θ is a generative function

Check for understanding
------

If unsupervised learning is p(x|θ), what is supervised learning?

p(y|x,θ)

If Supervised Learning p(y|x,θ) is __conditional__ density estimation,

Unsupervised Learning p(x|θ)is __unconditional__ density estimation.

Check for understanding
----

During the RL lecture, I use a cake metaphor for RL, SL, and UL.

If RL is the cherry, what is the icing and the cake?

Let them eat cake 🍰! 
-----

<br>

<center><img src="https://cdn-images-1.medium.com/max/800/1*KDvA9Fq3lm-eQOyGlcKAKg.png" height="500"/></center>

Source: Dmytro (Dima) Lituiev (UCSF)

Supervised Learning
------

<center><img src="images/tiger_supervised.png" height="500"/></center>

Given: features

Task: predict target/label

Autoencoders (AE)
-----

<center><img src="images/tiger_autoencoder.png" height="500"/></center>

- Only features, no label

- Learn a representation from features

Classifer Analogy
----
<br>

<center><img src="images/supervised_kid.png" height="500"/></center>

A kid learning to pick out object name.

Autoencoders Analogy
----- 
<br>
<center><img src="images/unsupervised_kid.png" height="500"/></center>

Drawing picture from memory or retelling a story by heart.

What are autoencoders (AE) good for?
------

- Dimensionality reduction, especially for data visualization
- Compression
- Feature extraction
- Data denoising / noise reduction
- Anomaly detection 
- Pattern generation

AE Pattern generation
-----

<center><img src="images/ugly_sweaters.png" height="500"/></center>

[source: Stitchfix Blog](source: http://multithreaded.stitchfix.com/assets/images/blog/random_shirts3.png)

Check for understanding
-----

What other DL architecture can generate  realistic reconstructions of images?

<center><img src="http://www.kdnuggets.com/wp-content/uploads/generative-adversarial-network.png" width="500"/></center>

GANs often perform better than AE

Check for understanding
------

<center><img src="images/w2v_neural_net_blank.png" height="500"/></center>

Where else in the gU curriculum have you seen AE?

word2vec is an Autoencoder
-----

<center><img src="images/w2v_neural_net.png" height="500"/></center>

AE are "fancy": They have "bow-ties" 🎀
------

<center><img src="http://2.bp.blogspot.com/-xdfotR5CIoY/VjJ7UKrFP2I/AAAAAAAAFlM/sHI4T4j0IrY/s1600/autoencoders.png" height="500"/></center>

AE for data compression
-----

<center><img src="http://static.squarespace.com/static/531f2c4ee4b002f5b011bf00/t/536bdcefe4b03580f8f6bb16/1399577848961/hbosiliconvalleypiedpiperoldlogo" height="500"/></center>

1. Data specific
2. Lossy
3. Learned automatically from data

1. Autoencoders are data specific
------

Only able to compress data similar to what AE have been trained on. Feature weights do not generalize across domains.

Our AE trained on tigers 🐯 won't do well on faces 👩.

Different from other compression algorithms (i.e., mp3) which is rule-based and can compress any domain within a data format.

2. Autoencoders are lossy
------

Decompressed outputs will be degraded compared to the original inputs.

In contrast to lossless arithmetic compression (e.g., zip files).

3. Autoencoders Learned automatically from data
-------

We ❤️ machine learning!

Easy to train specialized AE instances.

AE does __not__ require any new rules, just training data and time.

[Source](https://blog.keras.io/building-autoencoders-in-keras.html)

Are AE useful for data compression?
------

Usually, not really. 😞

Autoencoders are data-specific. Thus, generally impractical for real-world data compression problems.

But if you have specific data (e.g., words) then use AE.

Maybe in the future this limitation will be overcome. ✨

AE Parts List
------

<center><img src="images/tiger_autoencoder.png" height="500"/></center>

You need three things: 

1. An encoding function
2. A decoding function
3. A distance / loss function 

AE Formalism
-----

| Concept | Symbol |  
|:-------|:------:|
| original data | x |
| encoder | h(x) |  
| encoding or "hidden/latent state" | z = h(x) |  
| decoder (g for generator) | g(⋅) |  
| reconstruction | r = g(z) |  
| encoder | h(x) |  
| idenitity mapping | r = g(h(x)) ≈ x |
| reconstruction error | MSE: L(r, x) = ∣∣r − x∣∣<sup>2</sup> |
| constraint loss (optional) i.e. L1 norm for sparsity | C(z) = ∑ ∣z<sub>i</sub>∣ |


AE architecture
-----

<center><img src="http://ufldl.stanford.edu/tutorial/images/Autoencoder636.png" width="450"/></center>

The autoencoder tries to learn a function:  
h<sub>W,b</sub>(x) ≈ x.

In other words, it is trying to learn an approximation to the identity function, so as to output x̂ that is similar to x. 

[Source](http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/)

AE Learning
-----

Minimize distance between the amount of information __loss__ between the compressed representation of your data and the decompressed representation.

Total loss = Reconstruction error + Constraint loss 

Minimize the loss by Stochastic Gradient Descent(SGD) (or different flavor)

AE structure
------

<center><img src="images/tiger_autoencoder.png" height="500"/></center>

Remember AE tries to learn the identity function (relatively trivial) 

But by placing constraints on the network we can discover interesting structure about the data




AE constraints
-----

1. Limit the number of hidden units
2. L1 regularization

By having fewer hidden states (nodes) than the input dimensionality, AE creates a "undercomplete representation".

Check for understanding
-----

<center><img src="images/auto3.png" width="500"/></center>

How many input dimensions are there?  
How many compressed dimensions are there?

There are inputs 4 dimensions that are compressed to 2 dimensions.

Check for understanding
------
<center><img src="images/same.png" height="500"/></center>
What if there is an __equal__ number of input, hidden, and output nodes?

There is no compression. 

The AE learns the identity function. Same data and out!

Check for understanding
------
<center><img src="images/more.png" height="500"/></center>
What if there are __more__ hidden nodes than input output nodes?

The NN learns to map the data to different parts of hidden without minimal reuse of weights.

AE architecture
-----

1. Encoder
2. Decoder

1. Encoder
------

<center><img src="images/encoder.png" height="500"/></center>

Compresses a multi-dimensional observed data sample to a hidden / latent representation of the data

The hidden representation is low-dimensional or otherwise constrained.

2. Decoder
------

<center><img src="images/decoder.png" height="500"/></center>

Unpacks / interprets / decompresses the hidden / latent representation into the __reconstruction__ of a sample

<center><img src="https://cdn-images-1.medium.com/max/1000/1*j_y0bNZLP1yzqtyF48Z3Ug.png" height="500"/></center>

AE are NOT unsupervised (I've mislead you)
------

<center><img src="https://blog.keras.io/img/ae/autoencoder_schema.jpg" height="500"/></center>

AE are a __self-supervised__ technique.

Self-supervised: A type of supervised learning where the targets are generated from the input data.

Skip-gram architecture for word2vec is self-supervised
------

<center><img src="images/skip-gram.png" height="500"/></center>

Check for understanding
------

What is the difference between a MLP and an AE?

The output of AE has to be the same size as the input. A MLP output is often much smaller.
<center><img src="https://qph.ec.quoracdn.net/main-qimg-82b1a05b73274bc412f629d8e035af30.webp" height="500"/></center>
[Source](https://www.quora.com/What-is-the-difference-between-a-neural-network-and-an-autoencoder-network)

Stacked Autoencoders: We must go Deeper!
------

<center><img src="images/stacked.png" height="500"/></center>

A stacked autoencoder is a neural network consisting of multiple layers of sparse autoencoders in which the outputs of each layer is wired to the inputs of the successive layer.

[Learn more here](http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders)

Summary
----

- Unsupervised Learning learns useful representations of data without labels.
- Autoencoders (AE) can be considered a variation of Unsupervised Learning.
- AE are often used to compress data. But they have limited applications.
- AE consist of 3 parts:
    1. Encoder
    2. Decoder
    3. Distance function
- AE tries to learn the identity mapping but constraints prevent it
    - Reduced number of hidden nodes
    - Regularization: Typically L1 
- AE are just another variation of NN with the requirement to have equal number of input and output nodes

<br>
<br>
--

---
Bonus Material
---


Variational Autoencoders (VAEs)
------

The mathematical basis of VAEs actually has relatively little to do with
classical autoencoders, e.g. sparse autoencoders or denoising autoencoders

P(X) = Intregal:P(X|z; θ)P(z)dz

VAEs approximately maximize the equation above according to a plate model.

They are called “autoencoders” only because the final training objective that derives from this setup does have an encoder and a decoder, and resembles a traditional autoencoder.

[Source](https://arxiv.org/pdf/1606.05908.pdf)

Restricted Boltzmann Machines & Deep Belief Network
-----

- Not currently used in applied settings
- Deep Belief Networks are stacked Restricted Boltzmann Machines

<br>
<br>