### Loss Function in Vanilla GANs

In a **vanilla GAN** (Generative Adversarial Network), there are two competing neural networks:

1. The **generator** $ G $, which generates fake data samples.
2. The **discriminator** $ D $, which distinguishes real data samples from fake ones.

The goal of GANs is to train the generator to produce data that is indistinguishable from real data. This competition between the generator and discriminator is formalized in their respective **loss functions**, which are derived from a **min-max game**.

#### Discriminator Loss

The discriminator $ D $ is trained to maximize the probability of correctly classifying real and fake samples. Its loss function $ \mathcal{L}_D $ is given by:

$$
\mathcal{L}_D = -\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log(D(\mathbf{x}))] - \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(1 - D(G(\mathbf{z})))]
$$

where:

- $ \mathbf{x} \sim p_{\text{data}} $ represents samples from the real data distribution.
- $ \mathbf{z} \sim p_{\mathbf{z}} $ is a noise vector sampled from a prior distribution (e.g., Gaussian), which the generator uses as input.
- $ G(\mathbf{z}) $ is the generated (fake) data sample created by the generator from noise $ \mathbf{z} $.
- $ D(\mathbf{x}) $ is the  discriminator's estimate of the probability that the sample $ \mathbf{x} $ is real.
- $ D(G(\mathbf{z})) $ is the discriminator’s estimate of the probability that the generated sample $ G(\mathbf{z}) $ is real. We want to minimize $ D(G(\mathbf{z})) $, which is equivalent to maximizing $ 1- D(G(\mathbf{z})) $.
- In the loss function of a GAN, the E term is the expectation over the data distribution. It can be thought of as computing an average loss across all the data points in a batch or dataset. The expectation 
E means we are averaging over multiple samples from either the real data distribution or the generator’s noise distribution. In practice, this is done using mini-batches of data during training.
  
In practice, the discriminator's loss consists of two main terms:

1. **Real Data Loss**: $ \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log(D(\mathbf{x}))] $ – the expected log-probability that real data samples are classified as real. We want to maximize this.
2. **Fake Data Loss**: $ \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(1 - D(G(\mathbf{z})))] $ – the expected log-probability that fake data samples are classified as fake. We again want to maximize this.

The discriminator's objective is to **maximize** the sum of real data loss and fake data loss which is equivalent to minimizing the negative of this sum. This is encouraging $ D $ to assign high probabilities to real samples and low probabilities to fake ones.

#### Generator Loss

The generator $ G $ is trained to **minimize** the probability that the discriminator correctly classifies generated samples as fake. In other words the generator's objective is to maximize the discriminator's output for fake images. In other words, it wants 𝐷(𝐺(𝑧)) to be close to 1 (i.e., make the discriminator think the generated images are real).

Its loss function $ \mathcal{L}_G $ is given by:

$$
\mathcal{L}_G = -\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(D(G(\mathbf{z})))]
$$

Here, the generator tries to **maximize** $\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(D(G(\mathbf{z})))]$ which is equivalent to minimizing $-\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(D(G(\mathbf{z})))]$ 

The generator's objective is to **minimize** the discriminator's classification of generated data as fake, which is equivalent to minimizing:

$$
\mathcal{L}_G = \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(1 - D(G(\mathbf{z})))]
$$

When the discriminator becomes very good $D(G(\mathbf{z}))$ becomes close to 0 for fake samples, causing the term $\log(1 - D(G(\mathbf{z})))$ to to saturate. This leads to vanishing gradients for the generator, making it hard to update the generator's weights.

One popular modification to the generator’s loss is to maximize $\log(D(G(\mathbf{z})))$ instead of minimizing $\log(1 - D(G(\mathbf{z})))$. This form leads to stronger gradients during the early stages of training.

Hence, using 
$$
\mathcal{L}_G = -\mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(D(G(\mathbf{z})))]
$$ 
helps stabilize training.

The nn.BCELoss (Binary Cross-Entropy Loss) function is defined as:

**BCELoss(𝐷(𝐺(𝑧)),𝑦) = −[𝑦⋅log⁡(𝐷(𝐺(𝑧))) + (1−𝑦)⋅log(1 − 𝐷(𝐺(𝑧)))]**
where:
 - D(G(z)) is the discriminator's output when given a fake image generated by G.
 - 𝑦 is the ground truth label:
In the case of training the generator, y=1, because we want the generator to produce images that the discriminator classifies as real.

#### Min-Max Game Formulation

In a GAN, the generator and discriminator are playing a **min-max game** where:

$$
\min_G \max_D \mathcal{L}(D, G) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log(D(\mathbf{x}))] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}[\log(1 - D(G(\mathbf{z})))]
$$

The generator tries to **minimize** this loss, while the discriminator tries to **maximize** it.

#### Why This Loss Function?

This adversarial loss function forces the generator to produce outputs that the discriminator cannot distinguish from real data. As the generator improves, the discriminator’s task becomes more challenging, leading to more realistic generated samples. The adversarial nature of this loss function is what drives GANs to produce high-quality outputs.
