# Bayesian Perspective<hr>
- Where does our cost function come from?
- Consider \\(p(z|x)\\)
- We would like to know what it is
- Our encoder approximates this ! \\(q(z|x)\\) approximates \\(p(z|x)\\)
- But why do we care about \\(p(z|x)\\)
- \\(p(z|x)\\) is our posterior
- We want a good mapping from x -> z 

## Examples
- Gaussian Mixture Models and Hidden Markov Models
- GMM:
 - \\(p(z|x)\\) tells us "which cluster x belongs to"
 - z is cluster identity
 - in clustering that is our intended goal - given an x, find its cluster
- HMM:
 - z refers to "hidden state"
 - Ex.x = word in a sentence, z = parts-of-speech tag
 
## Classification
- In classification, we have targets \\(p(y|x)\\)
- Neural network predictions are \\(q(y|x)\\)
- Typically, \\(p(y = k|x) = 1\\) for some class k, and \\(p(y = k'|x) = 0\\) for all \\(k' != k\\)
- The correct cost function for classfication:
 - Cross-Entropy\\([p(y|x), q(y|x)]\\)
- Cross-Entropy = KL-Divergence + constant (gradient for both is the same)
- Only difference:
 - Supervised learning: we are given \\(y / p(y|x)\\)
 - Unsupervised learning: we don't know \\(z/p(z|x)\\)
- Approximating \\(p(z|x)\\) is the unsupervised equivalent of approximating \\(p(y|x)\\)

## Important
- Supervised learning: we want to find \\(p(y|x)\\), the true target
- Unsupervised leaning: we want to find \\(p(z|x)\\), the true unobserved variable

## KL-Divergence 
- Now that we know we want \\(q(z|x)\\) to approximate \\(p(z|x)\\), how can we formulate it as a cost function?
- KL-divergence ! 

##### Before
![kl_divergence2](../images/kl_divergence2.PNG)

##### Expected value 
![expected_value](../images/expected_value.PNG)
let's use the expected value instead

##### After
![kl_divergence3](../images/kl_divergence3.PNG)


## Bayes Rule
![bayes_rule](../images/bayes_rule.PNG)
<br>
## Cont...
![bayes_rule2](../images/bayes_rule2.PNG)
<br>
## Why is this interesting?
![elbo2](../images/elbo2.PNG)
<br>
## Left-hand side
![elbo-left](../images/elbo_left.PNG)
<br>

## Re-arranging
![elbo3](../images/elbo3.PNG)
<br>

## ELBO
![elbo4](../images/elbo4.PNG)

# Summary
- Originally, i gave you a cost function and stated that we will minimize it
- From a practical standpoint, it made sense
- Cost = reconstruction error + regularization penalty
- Call it the ELBO
- In deep learning / machine learning in general - target-output- penaly + regularization penalty is often a starting premise
- Next, the probabilistic perspective: accurately estimate \\(p(z|x)\\), justifying using of ELBO

## Small note about variantional inference
- For those of you who have studied VI before
- Common technique for solving VI problems is mean-field approximation
- Not relevant if you don't already know what it is
- Like me, you may have been looking for something resembling the mean-field approximation
- But that is now what is being used for the variational autoencoder ! 