# Cost Function<hr>
## Training an autoencoder
- Machine learning models typically have 2 functions we're interested in: leaning and inference
- **"Inference"** - too broad and overused, but generalizes the concept of the **"forward"** direction
- Supervised learning: prediction
- Unsupervised learning: transforming data into latent representation
- Sci-Kit Learn: fit(X,Y)/predict(X) or fit(X)/transform(X)
- Next step: how to fit / train / learn?
- As usual, we start with a cost function, then try to minimize it

## Cost Function
- A little tricky
- Will look a little strange compared to what we're used to (common theme in this course)
- Would be beneficial to know about variational inference 
- But not required - we will derive a solution from first principles


- Outline:
- Start with what it is
- Then ask: "Does it make sense?"
- i.e. does it decrease as the autoencoder improves at its task?
- Then, a theoretical perspective considering deep learning / machine learning cost functions in general 
- Later: look at the cost function from a probabilistic perspective
- Since variational inference is part of Bayesian ML, the probabilistic perspective is of particular interest

## The ELBO 
- We call the objective function the **"ELBO"** - evidence lower bound 
- Reason why will be explained later
- We want to maximize the ELBO, therefore our **"cost"** (which is something to minimize) is - ELBO

![elbo](../images/elbo.PNG)
## Expected Log-Likelihood
- Looks strange, but it's just a very fancy way of doing what we usually do
- What is the expected value of a log probability?
- (Negative) cross-entropy !
- Since x and x_hat are bernoulli probabilities:
![log_likelihood](../images/log_likelihood.PNG)


- If input/output are Gaussian - cross-entropy is squared error + some constants
- Same losses we used in our regular autoencoder
![log_likelihood2](../images/log_likelihood2.PNG)

## KL-Divergence
- KL divergence between \\(q(z|x)\\) and \\(p(z)\\)
- In Bayesian ML we call \\(p(z)\\) the "prior"
- It is up to is to "choose it"
- For convenience, we choose \\(p(z) = N(0,1)\\)
- No theoretical underpinning, it's just convenient
- Weakness of Bayesian ML: poorly chosen prior can lead to bad results in good model
![KL_divergence](../images/KL_divergence.PNG)
![KL_divergence2](../images/KL_divergence2.PNG)

- What does KL-Divergence do?
- Allows us to compare 2 probability distributions
- If q = p, KLD = 0
- If q != p, KLD > 0
- So this term encourages \\(q(z|x)\\) to "be like" \\(p(z) = N(0,1)\\)

## Why does it make sense?
- Cost function is made up of 2 parts
- First part: Tells us how close our output is to target (we always have this, whether we are doing classification, regression, or reconstructing input)
- Second part: Regularization
- Common construction in machine learning, including deep learning models and SVMs(Support Vector Machines)
- Make sense: Cost = Target-Output Penalty + Regularization Penalty

## Good Enough for Now
- We don't know "why" the cost function is this way, but we know it make sense
- We will look at why it takes this form later (with much more theory and math)
- Not needed for implementation (coming next)