- generative adversarial network

    - generative: generate a new probability distribution function that *in-th-end* mimics the original probability distribution of a dataset
    
    - adversarial: some conflict or opposition, as exists between the 2 NNs, generator and discriminator, which compete against each other

- 2 main neural-net models:

    - **discriminator**:
    
        - discriminates between 2 different classes of data
        
        - for instance, a model built to detect *fake vs real* usually uses this. in such case the model outputs 1 ==> real, 0 ===> fake
        
    - **generator**:
    
        - trained on training data, sampled from some true distribution D, and when given some standard random distribution Z(of some other parameters), produces a distribution $\hat{\textrm{D}}$ which is as close to D according to some closeness metric.
        
        - symbolically represented as G.
        
        - G(Z) = $\hat{\textrm{D}}$, such that $\hat{\textrm{D}}$ $\approx$ D
        
- hence, G learns to generate a sample, and then discriminator is the one that checks whether this sample is a fake or not, and if discriminator says that its a fake, G has to learn something more(i.e. optimise some objective function), so as to generate a sample which the discriminator actually classifies as a real sample, and in this way $\hat{\textrm{D}}$ $\approx$ D

- if decision made by the discriminator is wrong, feedback to fine-tune it, and fine-tuning of the generator, as part of backprop.

# Learning Mechanism

- weights and biases for each of the generator and discriminator
- while training one the other is held constant, i.e. not trained
- <font size="4">training the discriminator</font> is much easier
    - label the artificial instances(samples generated by the generator) as y = 0, real instances(from true dataset) as y=1
    - these 2 collection of samples are then combined into 1 large set, and the discriminator learns to output y = 0 or y=1, thus the discriminator is involved with a binary classification task
- <font size="4">training the generator</font>
    - the discriminator weights and biases are kept fixed, so that it doesn't become so strong such that the generator is never able to beat it
    - while training the generator, its outputs, i.e. the artificial instances are labelled 1, so as to fool the discriminator into believing that its been provided with real samples
    - if the discriminator is rather strong enough to classify this artificial instance as y=0, we backprop this information to, the weights and biases for the generator are adjusted, upto an extent where the discriminator outputs the label as 0.5, meaning that it has become confused, or in other words, *generator has become smart enough* to fool the discriminator.

# Loss function of GAN


## Notations
- p.d.f. of true dataset d$_{\textrm{true}}$: p$_{\textrm{data}}$(x), where x = true sample belonging to d$_{\textrm{true}}$, and X = random variable that denotes the sample belonging to d$_{\textrm{true}}$

- discriminator D(x; $\theta$) ==> variables before ";" are input parameters, after ";" are parameters that we need to optimize
    - x is passed into D(x; $\theta_2$), which gives us D(x), a p.d.f. that denotes probability of x belonging in D, i.e. being a true/real sample
    
    - since its a p.d.f., D(x) $\epsilon$\[0, 1\]

- a priori probability distribution on the input noise variable p$_{\textrm{Z}}$(z) is defined, which will be fed into the generator G(z; $\theta_1$)

    - the generator outputs a sample, x$_{\textrm{G}}$
    
    - goal: p$_{\textrm{G}}$(x$_{\textrm{G}}$) = p$_{\textrm{data}}$(x)
    
    - this sample is also fed to the D(x; $\theta_2$), with a label = 0
    
    - the output, when an artificial sample is given to the discriminator, which is a p.d.f., just like D(x), is D(G(z))

- <font color="red">Note</font>: G() and D() are differentiable functions, or else updation will not be possible

## Binary crossentropy loss

- L(y, $\hat{\textrm{y}}$) = ylog($\hat{\textrm{y}}$) + (1-y)log(1-$\hat{\textrm{y}}$)

- for samples coming from d$_{\textrm{true}}$, y = 1, $\hat{\textrm{y}}$ = D(x) 
    
    - hence, L(1, D(x)) = log(D(x))
    
- for samples coming from generator, y = 0, $\hat{\textrm{y}}$ = D(G(z))
    
    - hence, L(0, D(G(z) ) = log( 1 - D(G(z)) )

## Discriminator

- objective: correctly classifiy fake from real

- maximise L(1, D(x) ) and L(0, D(G(x) ), since these are the loss functions for the discriminator

- we already know that D(x) and D(G(z)) are p.d.f.'s, 
    
    - hence log(D(x)) $\epsilon (-\infty$, 0], so the max value for this function is 0, which means that D(x) = 1.
    
    - hence 1-D(G(z)) $\epsilon [0, 1]$, which means that log( 1 - D(G(z)) ) $\epsilon \, (-\infty$, 0], so the max value for this function is 0, which means that D(G(z)) = 0
    
- these results are anyways expected, i.e. sample from the generator should be classified as to having 0 probability of belonging in d$_{\textrm{true}}$, and samples from d$_{\textrm{true}}$ should be classified as to having full probability of belonging in d$_{\textrm{true}}$

- hence, the net loss function becomes L = max.( log(D(x)) + log( 1 - D( G(z) ) ) )

## Generator

- so as to fool the discriminator, all the generator has to do is D(G(z)) = 1, since this corresponds to being classified as a true sample

- this basically means that log(1 - D(G(z)) ) $\rightarrow -\infty$ 
<img src="generator.png"/>
    
- hence the goal is to min. ( log(1 - D(G(z)) ) )
    
    - its actually generally represented as min.( log(D(x)) + log(1 - D(G(z)) ) ), although D(x) has nothing to do with generator G.
    

Hence its very obvious that the generator and discriminator have the exact **<font color="red">opposite</font>** objective functions, and that's why the term **adversarial** is used

min.$_{\textrm{G}}$ max.$_{\textrm{D}}$ (log(D(x)) + log(1 - D(G(z)) ))

<font size="4">On including all training samples, we get</font>:\
V(G, D) = min.$_{\textrm{G}}$ max.$_{\textrm{D}}$ ( <font size="4">E$_{x \epsilon p_{\textrm{data}}(x)}$</font>$\left[\textrm{log}\left(D(x)\right)\right]$ +  <font size="4">E$_{z \epsilon p_{\textrm{z}}(Z)}$</font>$\left[\textrm{log}\left(1-D(G(z))\right)\right]$), where E: expectation value over the respective p.d.f. of p$_{\textrm{data}}$(x) and p$_{\textrm{z}}$(Z)

# Find the best discriminator

- fix G, optimal discriminator D is given by:
    - D$_{\textrm{G}}^{*}$(x) =  $\frac{\textrm{p}_{\textrm{data}}(x)}{\textrm{p}_{\textrm{data}}(x)+\textrm{p}_{\textrm{G}}(x)}$

- since G is fixed, the loss function becomes max.(sum-of-expected-values)

- E$_{\textrm{p(x)}}$\[x\] = $\int_{\textrm{x}}xp(x)dx$
    
- hence V(G, D) = max.$_{\textrm{D}}$(<font size="4">E$_{x \epsilon p_{\textrm{data}}(x)}$</font>$\left[\textrm{log}\left(D(x)\right)\right]$ +  <font size="4">E$_{z \epsilon p_{\textrm{z}}(Z)}$</font>$\left[\textrm{log}\left(1-D(G(z))\right)\right]$)
    
    - \begin{equation}
        V(G, D) = \int_x p_{\textrm{data}}(x)\textrm{log(D(x))dx} + \int_z p_{\textrm{z}}(z)\textrm{log(1-D(G(z)))dz} \end{equation}
        
- for a given p.d.f. p$_x$(x), the p.d.f. of a function f(x) can also be calculated
    
    - this is called change of variable
    
    - P$_y$(y) = p$_x(f^{-1}(y)) \frac{d(f^{-1}(y)}{dy} $, where y = f(x)
    
- we know that our generator produces x' = G(z), such that we want the distribution of this x' to mimic that of x
    
    - thus, we get p$_{\textrm{G}}$(x') = p$_{\textrm{z}}(G^{-1}) \frac{d(G^{-1}(x')}{dx'}$
    
    - **strong assumption: G is invertible**
    
    - performing *change of variable* for the expression $\int_z p_{\textrm{z}}(z)\textrm{log(1-D(G(z)))dz}$, we get $\int_{x'} p_{\textrm{z}}(G^{-1}(x'))\textrm{log(1-D(x'))}dG^{-1}(x')$
    
    - multiplying and dividing the above expression by dx', inside the integral, we obtain $\int_{x'} p_{\textrm{z}}(G^{-1}(x'))\frac{dG^{-1}(x')}{dx'} $ . log(1-D(x'))dx'
    
    - this becomes $\int_{x'} \textrm{p}_{\textrm{G}}(x') $ . log(1-D(x'))dx' 
    

- hence V(G, D) = $\int_x p_{\textrm{data}}(x)\textrm{log(D(x))dx}$  + $\int_{x'} \textrm{p}_{\textrm{G}}(x')$. log(1-D(x'))dx'

    - we assume that our generator is currently doing its best, i.e. its generating x' with the same distribution as x, thus making x' = x
    
    - hence V(G, D) = $\int_x \left(p_{\textrm{data}}(x)\textrm{log(D(x))} \, + \, \textrm{p}_{\textrm{G}}(x).log(1-D(x))\right)dx$
    
- this above V(G, D) needs to be maximised, which means that the first derivative w.r.t. D(x) should be 0, since the **argument here is D(x)**, i.e. the discriminator, and not the random variable x.
    
    - $\frac{d(V(G, D))}{dD(x)}_{D = D^{*}} = 0$
    
    - $\frac{d(V(G, D))}{dD(x)}$ = $\frac{p_{\textrm{data}}(x)}{\textrm{D(x)}} \, - \, \frac{\textrm{p}_{\textrm{G}}(x)}{1-D(x)}$ = 0 
    
    - on rearranging the terms, we finally obtain: <font size="4"> D$_{\textrm{G}}^{*}$(x) =  $\frac{\textrm{p}_{\textrm{data}}(x)}{\textrm{p}_{\textrm{data}}(x)+\textrm{p}_{\textrm{G}}(x)}$</font>
    
    - just to prove that D is maximum at this value, lets evaluate $\frac{d^2(V(G, D))}{dD(x)^2}$:
    \begin{equation}
    \frac{d^2(V(G, D))}{dD(x)^2} = -\frac{p_{\textrm{data}}(x)}{\textrm{D(x)}^{2}} - \frac{\textrm{p}_{\textrm{G}}(x)}{\left(1-D(x)\right)^{2}} \\
    \textrm{since   } p_{\textrm{data}}(x) \, \epsilon \, [0, 1] \textrm{ and } p_{\textrm{G}}(x) \, \epsilon \, [0, 1] \, \, \Rightarrow \, \, D_{\textrm{G}}^{*}\textrm{(x)} \ge 0
    \end{equation}
    
    - hence the second derivative is negative, thus proving that this is in fact the value at which the objective is maximized.    

# Finding the best generator

- now that the best discriminator is known, lets keep it fixed and try to optimise for the generator, i.e. min.$_{\textrm{G}}$(V(G, D$_{\textrm{G}}^{*}$(x)))

- optimal generator should have the condition that p$_{\textrm{G}}$(x) = p$_{\textrm{data}}$(x)

- using D$_{\textrm{G}}^{*}$(x) =  $\frac{\textrm{p}_{\textrm{data}}(x)}{\textrm{p}_{\textrm{data}}(x)+\textrm{p}_{\textrm{G}}(x)}$, we get: V(G, D$_{\textrm{G}}^{*}$(x)) = $\int_x \left(p_{\textrm{data}}(x)\textrm{log}\left(\frac{\textrm{p}_{\textrm{data}}(x)}{\textrm{p}_{\textrm{data}}(x)+\textrm{p}_{\textrm{G}}(x)}\right) \, + \, \textrm{p}_{\textrm{G}}(x).log\left(\frac{\textrm{p}_{\textrm{G}}(x)}{\textrm{p}_{\textrm{data}}(x)+\textrm{p}_{\textrm{G}}(x)}\right)\right)dx$

- now add and subtract these 2 terms : log(2).p$_{\textrm{data}}$(x) , log(2).p$_{\textrm{G}}$(x)\
we thus have $\int_x \left(-\log2.\left(p_{\textrm{data}}(x)+ p_{\textrm{G}}(x)\right)\, + \, p_{\textrm{data}}(x)\textrm{log}\left(\frac{2.\textrm{p}_{\textrm{data}}(x)}{\textrm{p}_{\textrm{data}}(x)+\textrm{p}_{\textrm{G}}(x)}\right) \, + \, \textrm{p}_{\textrm{G}}(x).log\left(\frac{2.\textrm{p}_{\textrm{G}}(x)}{\textrm{p}_{\textrm{data}}(x)+\textrm{p}_{\textrm{G}}(x)}\right)\right)dx$

    - this simplifies to <font size="4" color="red">-log4 + KL$\left[p_{\textrm{data}}(x) || \frac{\textrm{p}_{\textrm{data}}(x) + \textrm{p}_{\textrm{G}}(x)}{2}\right]$ + KL$\left[p_{\textrm{G}}(x) || \frac{\textrm{p}_{\textrm{data}}(x) + \textrm{p}_{\textrm{G}}(x)}{2}\right]$ </font>

- remember that x' = G(z), z = random noise, x' = sample generated by the generator
    
    - also <font color="red" size="4">p$_{\textrm{G}}$(x') = p$_{\textrm{z}}(G^{-1}) \frac{d(G^{-1}(x')}{dx'}$</font>
    
- the summation of KL-divergence above is represented by a new divergence function, called the **Jensen-Shanon Divergence** 
    
    - JSD(a || b) = $\frac{1}{2}\left[ \textrm{KL(a||c)} +  \textrm{KL(b||c)} \right]$, where c = $\frac{\textrm{a+b}}{2}$
    
- KL divergence becomes 0 at p$_{\textrm{G}}$(x) = p$_{\textrm{data}}$(x) (2 same quantities)

- **hence, tuning the JSD to 0 is the main objective**

- observe that when G = G*, p$_{\textrm{G}}$(x) = p$_{\textrm{data}}$(x), even if D = D$_{\textrm{G}}^{*}$,<font size="4" color="red">D = 1/2</font>, hence our generator has confused even the best discriminator.

# Optimizing the loss function

- <font size="5" color="blue">for the j$^{\textrm{th}}$ iteration</font>:
    - <font size="4" color="purple">for k steps do</font>:
        * Sample mini-batch of **m noise samples** {z$^{(1)}$, z$^{(2)}$, .... z$^{(\textrm{m})}$ } from noise prior p$_{\textrm{G}}$(z)
        * Sample mini-batch of **m data samples** {x$^{(1)}$, x$^{(2)}$, .... x$^{(\textrm{m})}$ } from dataset-distribution, p$_{\textrm{data}}$(x)
        * Update the discriminator by **ascending**(<font color="green">since we want to maximise the cost w.r.t. discriminator</font>) its *stochastic*(basically for the current mini-batch) gradient)\
            $\nabla$<font size="4">$_{\theta_{\textrm{d}}}$</font> $\frac{1}{\textrm{m}} \sum \limits_{\textrm{i=1}}^{\textrm{m}} $log(D(x$^{(\textrm{i})}$)) + log( 1 - D( G(z$^{(\textrm{i})}$) ) ), where $\theta_{\textrm{d}}$ denoted parameters of the discriminator network
    * <font size="4" color="purple">endfor</font>
    * Sample mini-batch of **m noise samples** {z$^{(1)}$, z$^{(2)}$, .... z$^{(\textrm{m})}$ } from noise prior p$_{\textrm{G}}$(z)
    * Update the generator by **descending**(<font color="red">since we want to minimise the cost w.r.t. generator</font>) its *stochastic*(basically for the current mini-batch) gradient)
        * $\nabla$<font size="4">$_{\theta_{\textrm{g}}}$</font> $\frac{1}{\textrm{m}} \sum \limits_{\textrm{i=1}}^{\textrm{m}}$ log( 1 - D( G(z$^{(\textrm{i})}$) ) ), where $\theta_{\textrm{g}}$ denoted parameters of the generator network
* <font size="5" color="blue">endfor</font>
* practically speaking, at the starting iterations of this loss function, when the generator isn't itself *smart enough* to fool the discriminator, i.e. discriminator can easily identify the sample produced by the generator, D(G(z)) = 0
    
    * hence the gradient for the loss function log(1 - D(G(z)) is almost 0, since tangent drawn to y = log(1-D(G(z))) at D(G(z)) = 0  is almost a flat one, it would seem as if convergence criterion has reached, but its not so
    * to overcome this, in practical scenarios, the function <font size="4">argmax$_{\textrm{G}}$(E$_{p_{z}(z)}$[log(D(G(z))])</font> is used instead
    
    * even for the initial iterations where D(G(z)) = 0, the slope would be very large, thus informing the model that we have a lot to go before concluding that convergence is achieved. Also, since the gradient is high, it usually corresponds to a jump in the steepest descent algorithm for optimisation.
    
    * don't worry, <font size="4" color="blue">the objective for the discriminator remains the same</font>, since it doesn't suffer from this gradient-problem.

# Drawbacks of GAN

1. Vanishing Gradients
    
    1. this is observed for the generator, and that's why the objective was changed, as mentioned in the previous section.
    
    2. let $\theta_{\textrm{g}} = \theta,\,\, \theta_{\textrm{d}} = \phi$, $\frac{\partial \textrm{V(G, D)}}{\partial \theta} = \nabla_{\theta}$[E$_{p_{z}(z)}$[log(1-D$_{\phi}$(G$_{\theta}$(z)))]] = E<font size="4">$_{\textrm{p}_{\textrm{z}}(z)}\left[ \frac{\partial \textrm{G}_{\theta}(z)}{\partial \theta} \frac{1}{\textrm{D}_{\phi}(\textrm{G}_{\theta}(z))-1} \frac{\partial \textrm{D}_{\phi}(\textrm{G}_{\theta}(z))}{\partial \textrm{G}_{\theta}(z)} \right]$</font>
    
    3. we already know that for perfect generator, x' = x, i.e. x = G(z), E<font size="4">$_{\textrm{p}_{\textrm{G}}(x)}$ = $ \frac{\partial x}{\partial \theta} \frac{1}{\textrm{D(x)-1}} \frac{\partial D(x)}{\partial x}$</font>
    
    4. with our basic assumption that after **k steps, a perfect discriminator is obtained**, which will be able to perfectly classify D(G(x)), or rather samples generated from generator, as always being D(X) = 0. Hence all samples drawn from the generator will have D(x) = 0, hence the derivative of D(x) w.r.t. $\theta$ turns to 0.
    
    5. this is the mathematical display of vanishing gradients for the generator
    
    6. changing the loss function to max.(log(D$_{\phi}$(G$_{\theta}$(x)))) or to put it more precisely, max.(E$_{\textrm{p}_{\textrm{z}}(z)}$\[log(D$_{\phi}$(G$_{\theta}$(x)))\])
    
2. Mode Collapse
    
    1. generator collapses to a setting where it ends up **always producing the same outputs**
    
    2. the p.d.f. function of the true dataset is a complex, multi-modal function, having different peaks such that there can be a subset of all classes concentrated in these peaks
    
    3. lets consider MNIST hadnwritten-digit dataset
    
    4. it may so happen that we have different curves for p.d.f. for each class(not talking about the p.d.f. of the true entire data here). \
    We can see below that 0 can be represented in first gaussian, and 9 can be represented in the last one 
    
    5. hence the generator might find it hard to learn this multi-modal distribution(here modes means total number of classes) and since its main objective is to fool the discriminator, <font color="green">instead of learning from all modes</font>, <font color="red">it may just end up learning from 1 of these modes in a near-perfect manner</font>(this is an easier task), such that its easily able to replicate the samples that constitute that particular mode.
    
    6. as the discriminator gets better w.r.t. telling apart the artificial and real samples belonging to 1 mode, the generator has to either produce better samples for that mode or it can simply learn to produce better samples from some other mode for which the discriminator has not yet learnt to tell the real-n-fake apart.
    
    7. hence, in the above point, if the generator ends up picking the former option many number of times, mode collapse will occur, or at-the-very-least has high chances of occurring.
    
    8. an analogy to understand this is that *instead of becoming a jack of all trades*, <u>the generator decides to become the master of one</u>
    
3. hard to achieve Nash equilibrium

    1. [Salimans 2016](https://arxiv.org/abs/1606.03498) discusses this problem in detail
    
    2. principally GANs should train in such a way that both the generator and dsicriminator find a nash equilibrium at the end of this *2 player, non-cooperative game.*
    
    3. however, each model updates its cost with no regards of how the other is updating itself.
    
    4. hence the gradients of both models, **<font color="red">concurrently cannot guarantee convergence</font>**
    
    5. lets assume f$_1$(x) = xy, f$_2$(y) = -xy, such that their respective objectives are to maximize and minimize x.y
    
    6. with each update of these functions, huge oscillations in these functions, instability becomes worse with time
    
4. Problem with counting
    
    1. fail to differentiate between number of objects that should occur in a generated image
    
5. Problem with perspective
    
    1. unable to differentiate between the front view and rear view
    
    2. this is seen when GANs are learning from 3D-objects(images are 3D in a sense that some perspective is involved in that image) to generate their 2D images.
    
6. problem with gloabl structure

    1. problems in understanding holistic structure similar to the above perspective problem.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()


(train_images, train_labels), (_, _) = tf.keras.datasets.fashion_mnist.load_data()
train_images = train_images/255. 
print(train_images.shape, train_labels.shape)

Since tensorflow v2 has removed `placeholder`, we have to force tensorflow to use the v1 components, and rather shut down the v2 behavior 

In [None]:
print(list(train_images[0]))
plt.imshow(train_images[0], cmap='gray')
plt.show()

# trying to change the visualisation for better resolution
from skimage.transform import resize
img_down = resize(train_images[0], (14, 14))
img_up = resize(train_images[0], (56, 56))
img_2 = resize(train_images[0], (448, 448))
fig, ax = plt.subplots(1, 3, figsize=(12, 5))

ax[0].imshow(img_down, cmap='gray')
ax[0].axis("off")

ax[1].imshow(img_up, cmap='gray')
ax[1].axis("off")

ax[2].imshow(img_2, cmap='gray')
ax[2].axis("off")

plt.show()

lets set the training parameters:
* learning rate
* batch size
* number of epochs

and the neural net params:
* image dimensions(flattened)
* hidden dimensions of the generated image(flattened, number of neurons in this layer)
* hidden dimensions of the discriminator NN(flattened, number of neurons in this layer)
* dimensions of the random noise, z(flattened out value, $100\times1$ vector produced)

In [None]:
learning_rate, batch_size, epochs = 0.0002, 128, 100000

image_dim, gen_hidden_dim, disc_hidden_dim, z_noise_dim = 784, 256, 256, 100

In [None]:
indicesList = [i for i in range(train_images.shape[0])]
from random import choices
def fetchBatch():
    choiceOptions = choices(indicesList, k=batch_size)
    x_train_arr = np.array([train_images[i] for i in choiceOptions])
#     x_train_arr = x_train_arr/255.
    return x_train_arr.reshape(batch_size, image_dim)

In [None]:
plt.imshow(fetchBatch()[0].reshape([28, 28]))
plt.axis("off")
plt.show()

image dimensions is 784, since it has to generate an image of dimensions $28\times28$

the function below, called <u>xavier initialization</u>, helps in achieving the convergence criterion faster.

* its an initialization method after all, used for initializing weights and biases

* With each passing layer, we want the variance to remain the same. 
    * This helps us keep the signal from exploding to a high value or vanishing to zero. 
    
    * In other words, we need to initialize the weights in such a way that the variance remains the same for x and y.(x = input to current layer, y = output of the current layer) 

* we ignore the updation of biases for this proof:
    
    * <img src="linearNeuron.png" />
    * consider a linear neuron: y = $\sum\limits_i w^ix^i$ (weights is a vector, rather than a matrix, x is the feature vector for a sample)
    
    * var(ab) = E(a)$^2$var(b) + E(b)$^2$var(a) + var(a).var(b)
    
    * **assumption : weights and samples are drawn from a standard distribution, hence their respective mean values = 0**
    
    * hence var(w$^i$x$^i$) = var(w$^i$).var(x$^i$), var(y) = $\sum\limits_i$ var(w$^i$).var(x$^i$)
    
    * since the samples and weights are identically distributed(no overall trends–the distribution doesn’t fluctuate and all items in the sample are taken from the same probability distribution), var(w$^i$) = var(w$^j$), var(x$^i$) = var(x$^j$).
    
    * thus var(y) = N.var(w).var(x), we wanted **var(y) = var(x)**, hence **var(w) = $\frac{1}{N}$**, <font color="purple">N: feature-dimensionality</font>
    
* We need to pick the weights from a Gaussian distribution with zero mean and a variance of 1/N, where N specifies the number of input neurons. 
    * This is how it’s implemented in the Caffe library. 
    
* In the original paper, the authors take the average of the number input neurons and the output neurons. 
    
    * <img src="normalNetwork.jpeg" />
     
    * So the formula becomes:
    var(w) = $\frac{1}{N_{\textrm{avg}}}$, where $N_{\textrm{avg}}$ = <font size="4">$\frac{N_{\textrm{in}}+N_{\textrm{out}}}{2}$</font>
    
    * in the above image, N$_{\textrm{in}}$ = 3, N$_{\textrm{out}}$ = 2

In [None]:
def xavier_init(shape):
    return tf.random.normal(
        shape=shape,
        stddev=1./tf.sqrt(shape[0]/2.0)
    )

1. <font size="4">Variables</font>:

    1. Variable tensors are used when the values require updating within a session. 
    
    2. It is the type of tensor that would be used for the weights matrix when creating neural networks, since these values will be updated as the model is being trained.
    
    3. Something to note is that declaring a variable tensor does not automatically initialize the values.
    
    4. The values need to be intialized explicitly when starting a session.
    
    5. Something to note is that declaring a variable tensor does not automatically initialize the values. 
        1. The values need to be intialized explicitly when starting a session using one of the following: \
        `tf.global_variables_initializer().run()`\
        `session.run(tf.global_variables_initializer())`
        
        2. we have chosen the latter.
    
    6. In Python-based TensorFlow, tf.Variable instance have the same lifecycle as other Python objects. 
    
    7. When there are no references to a variable it is automatically deallocated.
    
    8. Variables can also be named(`a = tf.Variable(..., name="myNameIsJohnCena)`) which can help you track and debug them. 
    
    9. You can give two variables the same name.
    
    10. Variable names are preserved when saving and loading models. 
    
        1. By default, variables in models will acquire unique variable names automatically, so you don't need to assign them yourself unless you want to.
        
    11. You can turn off gradients for a variable by setting trainable to false at creation. 
         1. An example of a variable that would not need gradients is a training step counter.

In [None]:
weights = {
    "disc_H": tf.Variable(xavier_init([image_dim, disc_hidden_dim])),
    "disc_final": tf.Variable(xavier_init([disc_hidden_dim, 1])),
    "gen_H": tf.Variable(xavier_init([z_noise_dim, gen_hidden_dim])),
    "gen_final": tf.Variable(xavier_init([gen_hidden_dim, image_dim]))
}

biases = {
    "disc_H": tf.Variable(xavier_init([disc_hidden_dim])),
    "disc_final": tf.Variable(xavier_init([1])),
    "gen_H": tf.Variable(xavier_init([gen_hidden_dim])),
    "gen_final": tf.Variable(xavier_init([image_dim]))
}

**\_H** means that the parameter is defined for the hidden layer of that model(generator/discriminator)\
**\_final** means that the parameter is defined for the final/output layer of that model(generator/discriminator)

In [None]:
print(weights["disc_H"], "\n\n\nshape of discriminator's hidden layer wiehgts are :", weights["disc_H"].shape)

<font size="4">Placeholder</font>
1. Variable that we can declare, but dont need to assign any value to immediately

2. allows us to create our operations and build our computation graph, without needing the data

3. we then feed data into the graph through these placeholders.

4. it is *a place in memory where we will store value later on.*

5. its value is defined in the `feed_dict` argument that is provided in the `session.run` function

6. placeholder don't need to be statically-sized, but for the network to work for only a certain dimension-value, we provide the shape in such a way that the first element of the shape can be anything, but the last element has to be `z_noise_dim` and `image_dim` for `z_input` and `x_input` respectively.

`tf.name_scope` 
1. So as the name suggests, the scope functions create a scope for the names of the ops you create inside. 

2. This has an effect on how you refer to tensors, on reuse, on how the graph shows in TensorBoard and so on.

3. will make the name of all operations added within it have a prefix.
    1. `Generator(x)` and `Discriminator(x)` are called ops/operations
    
4. usually used to group some variables together in an op. 

In [None]:
def Discriminator(x):
    hidden_layer = tf.nn.relu(tf.add(tf.matmul(x, weights["disc_H"]), biases["disc_H"]))
    final_layer = tf.add(tf.matmul(hidden_layer, weights["disc_final"]), biases["disc_final"])
    disc_output = tf.nn.sigmoid(final_layer)
    return final_layer, disc_output

def Generator(x):
    hidden_layer = tf.nn.relu(tf.add(tf.matmul(x, weights["gen_H"]), biases["gen_H"]))
    final_layer = tf.add(tf.matmul(hidden_layer, weights["gen_final"]), biases["gen_final"])
    gen_output = tf.nn.sigmoid(final_layer)
    return gen_output

# placeholders for external inputs
z_input = tf.placeholder(tf.float32, shape=[None, z_noise_dim], name="input_noise")
x_input = tf.placeholder(tf.float32, shape=[None, image_dim], name="real_input")

# build the generator network
with tf.name_scope("Generator") as scope:
    output_gen = Generator(z_input) # implements G(z)
    
# build the discriminator network
with tf.name_scope("Discriminator") as scope:
    real_output1_disc, real_output_disc = Discriminator(x_input) # implements D(x)
    fake_output1_disc, fake_output_disc = Discriminator(output_gen) # implements D(G(x))

In [None]:
with tf.name_scope("Discriminator_Loss") as scope:
    # expectation value of log(D(x))+log(1-D(G(x)))
    # 0.0001 is added so that the log term doesn't unexpectedly blow up on optimisation
    
    # negative value if taken here, since the tensorflow by default minimizes the loss, but we want to maximise it
    # hence on putting a negative sign, tensorflow will end up doing our desired optimization(primal-dual form)
    Discriminator_Loss = -tf.reduce_mean(tf.log(real_output_disc+0.0001)+tf.log(1.-fake_output_disc+0.0001))
    
    
with tf.name_scope("Generator_Loss") as scope:
    # expectation value of log(1-D(G(x))) or rather log(D(G(x))), 
    # since the former function has very small gradient value at the beginning iterations(vanishing gradients)
    # 0.0001 is added so that the log term doesn't unexpectedly blow up on optimisation
    Generator_Loss = -tf.reduce_mean(tf.log(fake_output_disc+0.0001))
    
disc_loss_total = tf.summary.scalar("Disc_Total_loss", Discriminator_Loss)
gen_loss_total = tf.summary.scalar("Gen_loss", Generator_Loss)

`var_list`: Optional list or tuple of `tf.Variable` to update to minimize loss.

`minimize`
* Add operations to minimize `loss`(here the value of this argument is `Discriminator_Loss`/`Generator_Loss`) by updating `var_list`.

* This method simply combines calls `compute_gradients()` and `apply_gradients()`. 

In [None]:
generator_var = [weights["gen_H"], weights["gen_final"], biases["gen_H"], biases["gen_final"]]
discriminator_var = [weights["disc_H"], weights["disc_final"], biases["disc_H"], biases["disc_final"]]

# define the optimizer
with tf.name_scope("Optimizer_Discriminator") as scope:
    # var_list: update only those variables in this list
    # hence generator is kept constant while training the discriminator
    Discriminator_optimize = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(Discriminator_Loss, var_list=discriminator_var)
    

with tf.name_scope("Optimizer_Generator") as scope:
    Generator_optimize = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(Generator_Loss, var_list=generator_var)

`tf.summary.FileWrite`
1. The FileWriter class provides a mechanism to create an event file in a given directory and add summaries and events to it. 

2. The class updates the file contents **asynchronously**. 

    1. This allows a training program to call methods to add data to the file directly from the training loop, without slowing down training.
    
3. the method `add_summary()`
    
    1. Adds a Summary protocol buffer to the event file.
    
    2. This method wraps the provided summary in an Event protocol buffer and adds it to the event file.
    
    3. Can pass the result of evaluating any summary op, using `tf.Session.run`(<font color="purple">we have done this</font>) or `tf.Tensor.eval`, to this function. 
    
    1. `global_step`:	Number. Optional global step value to record with the summary.

* as mentioned earlier and expected here , the `init` represents the method-call to `global_variables_initializer`, and is then called by the `session.run`



In [None]:
init = tf.global_variables_initializer()

sess = tf.Session()
sess.run(init)
writer = tf.summary.FileWriter("./log", sess.graph)

for epoch in range(epochs):
    x_batch = fetchBatch()
    
    # generator noise to feed to generator
    z_noise = np.random.uniform(-1., 1., size=[batch_size, z_noise_dim])
    _, disc_loss_epoch = sess.run([Discriminator_optimize, Discriminator_Loss], feed_dict = {x_input: x_batch, z_input: z_noise})
    _, gen_loss_epoch = sess.run([Generator_optimize, Generator_Loss], feed_dict = {z_input: z_noise})
    
    # run discriminator summary
    summary_disc_loss = sess.run(disc_loss_total, feed_dict = {x_input: x_batch, z_input: z_noise})
    
    # add discriminator summary
    writer.add_summary(summary_disc_loss, epoch)
    
    # run generator summary
    summary_gen_loss = sess.run(gen_loss_total, feed_dict = {z_input: z_noise})
    
    # add generator summary
    writer.add_summary(summary_gen_loss, epoch)
    
    if epoch % 2000 == 0:
        print("Steps: {0}, generator loss : {1}, discriminator loss : {2}".format(epoch, gen_loss_epoch, disc_loss_epoch))

after training the GAN, actually generate the images from the generator

In [None]:
# n = 6
# canvas = np.empty((28*n, 28*n))

# for i in range(n):
#     # noise input
#     z_noise = np.random.uniform(-1., 1., size=[batch_size, z_noise_dim])
#     g = sess.run(output_gen, feed_dict = {z_input: z_noise})
    
#     # reverse colors for better display
#     g = -1 * (g-1)
    
#     for j in range(n):
#         # draw the generated images
#         canvas[i*28:(i+1)*28, j*28:(j+1)*28] = g[j].reshape([28, 28])

# plt.figure(figsize=(n, n))
# plt.imshow(canvas, origin="upper", cmap="gray")
# plt.show()

z_noise = np.random.uniform(-1., 1., size=[batch_size, z_noise_dim])
g = sess.run(output_gen, feed_dict = {z_input: z_noise})
print(g.shape) # 128,784

# extract first image generated
im1 = g[0]
im1 = im1.reshape((28, 28))
print(im1.shape) # 28, 28

plt.imshow(im1, origin="upper", cmap="gray")
plt.show()

In [None]:
im1 = resize(im1, (256, 256))
fig = plt.figure(figsize=(3, 3))
plt.imshow(im1, origin="upper", cmap="Greys")
plt.axis("off")
plt.show()