Skip to content

ajitsamudrala/GAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Premise of GANs

A GAN takes a random sample from a latent or prior distribution as input and maps it to the data space. The task of training is to learn a deterministic function that can efficiently capture the dependencies and patterns in the data so that the mapped point resembles a sample generated from the data distribution.

Example:

I have generated 300 samples from Isotropic Bivariate Gaussian distribution.

bi var guass

When passed through a function , the points form a ring, which demonstrates that there could be a high capacity function that may be able to model data distribution of high dimensional data like images. Neural networks are out best bet as they are universal functional approximators. Hence, deep neural networks are used while modeling data distribution of images. Unlike MLE or KDE, this is an implicit density estimation

ring

Probability Review

freq table frequency table and joint distribution of two discrete random variables

Conditional Distribution

In the above table, fix the value of a random variable, say x = x_1; the distribution of y when x = x_1 is called conditional distribution, . Conditional expectation is expectation of the conditional distribution.

In the above table, the conditional probability of y_1 given X=x_1 is 2/17

Marginal Distribution

Integrate or summate over a variable, to get the marginal distribution of another variable.

In the above table, the marginal probability of x_1 according to above formula is 2/50 + 10/50 + 5/50 = 17/50.

Joint Distribution

A join distribution a.k.a data distribution captures the joint probabilities between random variables. In the above table, the join probability of P(X = x_2 & Y = y_3) is 2/50. This is what a GAN tries to model from the sample data.

Consider images of size 28 x 28. Each pixel is a random variable that can take any value from 0 to 255. Hence, we have 784 random variables in total. GAN tries to model the dependencies between the pixels.

Bayes Theorm

From the above table: P(Y = y_1 | X = x_1) = P(Y = y_1 & X = x_1) / P(X = x_1) = (2/50)/(17/50) = 2/17

Entropy

Entropy measures the degree of uncertainty of an outcome of a trial according to a p(x).

entropy

The entropy of an unbiased coin is higher than a biased coin. The difference in entropy increases with the degree of polarization of probabilities of the biased coin.

Cross Entropy

Cross entropy measures the degree of uncertainty of a trial according to p(y) but in truth according to p(x).

entropy

Cross entropy is higher when a trial is conducted according to unbiased coin probability distribution but you think it is being conducted according to the biased coin probability distribution.

KL Divergence

KL Divergence is the difference between cross-entropy and entropy of the true distribution. KL Divergence is equal to zero when two distributions are equal. Hence, when you want to approximate or model a probability distribution with other, minimizing the KL Divergence between them will make them similar.

D(fair||biased) = H(fair, biased) – H(fair) = 2.19-1 = 1.19

JS Divergence

Due to division by a probability of an outcome in KL Divergence equation, it may become intractable in some cases. For example, if q_k is zero, KL Divergence becomes infinite. Moreover, KL Divergence is not symmetric i.e., D(p||q) is not equal to D(q||p), which makes it unusable as a distance metric. To suppress these effects, JS divergence uses avg probability of an outcome.

GANs

As aforementioned, GANs take a random sample from the latent space as input and maps it to data space. In DCGANs, the mapping function is a deep neural network, which is differentiable and parameterized by network weights. The mapping function is called Generator(G). A Discriminator(D) is also a deep neural network that takes a sample in the data space and maps it to the action space i.e., the probability of that sample being generated from the data distribution.

: Prior / latent distribution. Typically, this space is much smaller than the data space.

: Data distribution of data generated by the generator

: Real data distribution

D__m = (d_1, d_2, d_3, ... d_m) be the data generated according to P_r

G__n = (g_m+1, g_m+2, .. g_n) be the data generated according to P_g

Train D to minimize the empirical loss. I am including min functions, as most deep learning frameworks only implement, minimization of a function.

Fix the D network, and train G to maximize the loss of D over G_n.

As stated in the original paper, in the early training period, the above loss doesn't offer enough gradient to update the parameters of G network, as initially P_g is distant from P_d, which makes it easy for D to classify generated images. Hence, we try to maximize it by switching labels.

Optimization And Theoritical Results

Optimal Discriminator for fixed `G`

Equation 1 is an empiracal loss function. Its risk function or loss on the whole population i.e., for every possible image can be written as:

So when y_hat = y_hat*, the discriminator is at its minimum. At the end of the training, if G does a good job at approximating P_r, then P_g ~ P_r.

substituting it in equation 3 gives the optimal loss of the discriminator at the end of the training.

This is the cost is obtained when both D and G are perfectly optimized.

From the JS divergence equation, the JS divergence between P_g and P_r is

From equation 3,

The JS divergence is positive semidefinite. Hence, for the value of the above equation to be equal to the value calculated in 4, the JS divergence should be 0 i.e., P_g = P_r. To conclude, when the D is at its best, G need to make P_g ~ P_r to reach the global optimum.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published