<a href="https://colab.research.google.com/github/ayushmangupta/TF2/blob/master/Generative_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Terminology 

**PDF :** is a non-negetive function which integrates to one. The likelihood is defined as the joint density of the observed data as a function of the parameter. 

**Likelihood Function:** likelihood function is a function of the parameter only, with the data held as a fixed constant.
- The likelihood function is a function of the unknown parameter θ (conditioned on the data). As such, it does typically not have area 1 (i.e. the integral over all possible values of θ is not 1) and is therefore by definition not a pdf

**Wasserstein distances**  It commonly replaces the Kullback-Leibler divergence (also often dubbed cross-entropy loss in the Deep Learning context). In contrast to the latter, Wasserstein distances not only consider the values probability distribution or density at any given point, but also incorporating spatial information in terms of the underlying metric regarding these differences. Intuitively, it yields a smaller distance if probability mass moved to a nearby point or region and a larger distance if probability mass moved far away.





- Hierarchical Bayesian models 
- Multivariate kernel methods 
- Discriminative machine learning 
- Clustering algorithms
- Dimensionality reduction 



# Differentiable Inference and Generative Models

**What are generative models?**

Generative modeling loosely refers to building a model of data, for instance p(image), that we can sample from. This is in contrast to discriminative modeling, such as regression or classification, which tries to estimate conditional distributions such as p(class | image).

**Why generative models?**

Even when we're only interested in making predictions, there are practical reasons to build generative models:

Data efficiency and semi-supervised learning - Generative models can reduce the amount of data required. As a simple example, building an image classifier $p(class | image) $ requires estimating a very high-dimenisonal function, possibly requiring a lot of data, or clever assumptions. In contrast, we could model the data as being generated from some low-dimensional or sparse latent variables $z$, as in $ p(image)=∫p(image|z)p(z)dz$. Then, to do classification, we only need to learn $p( class | z)$ , which will usually be a much simpler function. This approach also lets us take advantage of unlabeled data - also known as semi-supervised learning.
Model checking by sampling - Understanding complex regression and classification models is hard - it's often not clear what these models have learned from the data and what they missed. There is a simple way to sanity-check and inspect generative models - simply sample from them, and compare the sampled data to the real data to see if anything is missing.
Understanding - Generative models usually assume that each datapoint is generated from a (usually low-dimensional) latent variable. These latent variables are often interpretable, and sometimes can tell us about the hidden causes of a phenomenon. These latent variables can also sometimes let us do interesting things such as interpolating between examples

### Types of Generative Models
 
 - VAE
 - Auto Regressive Models
 - GAN's


Lectures



# VAE
###
- VAEs are trained by maximizing the variational lowerbound

# GAN with Objective Function



### GAN:
-  GANs do not require any approximation and can be trainedend-to-end through the differentiable network
-   The basic idea of GANs is tosimultaneously  train  a  discriminator  and  a  generator:  the  discriminator  aimsto distinguish between real samples and generated samples;  while the genera-tor tries to generate fake samples as real as possible, making the discriminatorbelieve that the fake samples are from real data.

### Application
- Image generation 
  - Infogan
- Image super-resolution 
  - Image Super-Resolution Using a Generative Adversarial Network
- Text to image synthesis 
  - Gen-erative  adversarial  text-to-image  synthesis
- Image to image translation
  - Image-to-image translationwith conditional adversarial networks
  
  

| GAN        | Objective Function           | 
| ------------- |:-------------:| 
|        GAN(Orignal)| JSD Divergence           | 
|        WGAN | EM Distance           | 
|        Improved WGAN| No weight Clipping       | 
|        LSGAN | L2 Loss Objective         | 
|        RWGAN| Relaxed WGAN     | 
|        McGAN| Mean Covariance Minimization      | 
|       GMMN | Maximum Mean Discrepency | 
|       MMD GAN | Adversial Kernel To GMMN |
|       Cramer GAN | Gramer Distance |
|       Fisher GAN   | Chi-Square Objective |
|       EBGAN |  AutoEncoder instead of Discriminator|
|       BEGAN | WGAN and EBGAN Merged Objective |
|       MAGAN| Dynamic Margin on Hinge Loss for EBGAN



# GAN's Variations and their loss function




### Jensen Shannon Divergence :
P and Q are probability measures, then the Jensen-Shannon Divergence is: 
$$ (P,Q) = KL(P||R) + KL(Q||R)  $$  where $R = \frac{1}{2} (P+Q)$
- R is the mid-point measure and KL(⋅∣∣⋅) is the Kullback-Leibler divergence.


### WGAN:
WGAN doesn’t use the JSD to measure divergence, instead it uses something called the Earth-Mover (EM) distance (AKA Wasserstein distance). EM distance is defined as




### LSGAN(Least Square GAN) [Paper](https://arxiv.org/pdf/1611.04076.pdf)
-  The loss for real samples should be lower than the loss for fake samples. This allows the LSGAN to put a high focus on fake samples that have a really high margin.
- They introduce regularization in the form of weight decay, encouraging the weights of their function to lie within a bounded area that guarantee the theoretical needs.

- **Imp**The reason for this has to do with the fact that a log loss will basically only care about whether or not a sample is labeled correctly or not. It will not heavily penalize based on the distance of said sample from correct classification. If a label is correct, it doesn’t worry about it further. In contrast, L2 loss does care about distance.
Data far away from where it should be will be penalized proportionally. What LSGAN argues is that this produces more informative gradients.
 
 -    Loss function instead of a critic
 -    Weight decay regularization to bound loss function
 -    L2 loss instead of log loss for proportional penalization


### Cramer GAN
Cramer GAN starts by outlining an issue with the popular WGAN. It claims that there are three properties that a probability divergence should satisfy:

    Sum invariance
    Scale sensitivity
    Unbiased sample gradients

Of these properties, they argue that the Wasserstein distance lacks the final property, unlike KLD or JSD which both have it. They demonstrate that this is actually an issue in practice, and propose a new distance: the Cramer distance.
The Cramer Distance

Now if we look at the Cramer distance, we can actually see it looks somewhat similar to the EM distance. However, due to its mathematical differences, it actually doesn’t suffer from the biased sample gradients that EM distance will. This is proven in the paper, if you really wish to dig into the mathematics of it.
 
 
 ### EBGAN 
 Setting Up
   -  1. Train an autoencoder on the original data
    2. Now run generated images through this autoencoder
    3. Poorly generated images will have awful reconstruction loss, and thus this now becomes a good measure
    
 Summary
    - Autoencoder as the discriminator
    - Reconstruction loss used as cost, setup similar to original GAN cost
    - Fast, stable, and robust
    
    
### Boundary Equilibrium GAN

Boundary Equilibrium GAN (BEGAN) is an iteration on EBGAN. It instead uses the autoencoder reconstruction loss in a way that is similar to WGAN’s loss function.

In order to do this, a parameter needs to be introduced to balance the training of the discriminator and generator. This parameter is weighted as a running mean over the samples, dancing at the boundary between improving the two halves (thus where it gets its name: “boundary equilibrium”).


# Reference

- http://www.cs.toronto.edu/~duvenaud/courses/csc2541/
- https://www.adhiraiyan.org/deeplearning/03.00-Probability-and-Information-Theory
- Density estimation using Real NVP (youtube explanation and paper)
- Deep Learning with TF2 Book tf2 version [Link](https://github.com/adhiraiyan/DeepLearningWithTF2.0/blob/44deed6175d321c807cfc4812df6d0fed633f8bf/notebooks/04.00-Numerical-Computation.ipynb)
- 