# Style-GAN++ — Scribblings

In [1]:
import tensorflow as tf
from tensorflow.python.keras.layers.convolutional import Conv
from tensorflow.python import nn

## Revised style block

In [45]:
class StyleBlock(tf.keras.models.Model):
    def __init__(self, conv_kwargs, dense_kwargs):
        self.conv = tf.keras.layers.Conv2D(**conv_kwargs)
        self.bias = self.add_weight(shape=(conv_kwargs['filters'],),
                             initializer='zeros',
                             trainable=True)
        self.dense_mod = tf.keras.layers.Dense(**dense_kwargs)
        
    def __call__(self, x, style, noise):
        y = self.dense_mod(style)
        x = self.conv(x) * y
        x = x / tf.math.reduce_std(x, axis=[1, 2])
        x = tf.nn.bias_add(x, self.bias)
        x = x + noise
        # TODO: confirm these are the axis
        return x

Let there be $C$ filters in the input and $F$ filters in the output. Let the kernel for the $f$-th filter $w_f$ have shape $H \times W$ and let $p$ be a $C \times H \times W$ size patch of the input. The pixel in the $f$-th channel of the output $a_{fi'j'}$ resulting from this input patch convolved with the $f$-th kernel is given as 

$$a_{fi'j'} = \sum_{c=0}^{C-1}\sum_{i=0}^{H-1}\sum_{j=0}^{W-1}w_{fcij}p_{cij}$$
    
In the modulation step we scale the output feature maps of the conv layer by a $F$-dimensional style vector $y$. But we could equivalently achieve this result by scaling the conv kernel so that instead of $a'_{fi'j'} = y_fa_{fi'j'}$ we have 

$$
w'_f = y_fw_f \\
a'_{fi'j'} = \sum_{c=0}^{C-1}\sum_{i=0}^{H-1}\sum_{j=0}^{W-1}w'_{fcij}p_{cij} \\
$$

If $p_{cij} \sim \mathcal{N}(\mu, 1)$ (i.i.d with std of 1), then std of the $f$-channel activations is

$$\text{Var}(a_f) = \text{Var}\left(\sum_{cij}w_{fcij}p_{cij}\right)
= \sum_{cij}\text{Var}(w_{fcij}p_{cij})
= \sum_{cij}w_{fcij}^2\text{Var}(p_{cij})
= \sum_{cij}w_{fcij}^2
\\\sigma_f = \sqrt{\text{Var}(a_f)} = \sqrt{\sum_{cij}w_{fcij}^2}$$

Since in the norm step the feature maps are scaled by their std, if $p_{cij} \sim \mathcal{N}(\mu, 1)$, then the outputs are scaled by the $L_2$ norm of the weights.  

Similarly the variance scaling can be incorporated into the kernel

$$
w''_f = w'_f / ||w'_f||_2\\
a''_{fi'j'} = \sum_{c=0}^{C-1}\sum_{i=0}^{H-1}\sum_{j=0}^{W-1}w''_{fcij}p_{cij} \\
$$


In [222]:
class StyleConv(Conv):
    
    def __init__(self, *args, **kwargs):
        self.eps = kwargs.pop('eps')
        super(StyleConv, self).__init__(*args, **kwargs)
        
    def _mod_demod_kernel(self, style, noise):
        kernel =  self.kernel * style
        l2_norm = tf.norm(tf.reshape(kernel, [-1, tf.shape(kernel)[-1]]),
                          axis=0, ord=2, keepdims=True)
        kernel = kernel / (l2_norm + self.eps)
        return kernel
    
    def call(self, x, style, noise):
        outputs = self._convolution_op(x, self._mod_demod_kernel(style))
        outputs = outputs + noise
        if self.use_bias:
            if self.data_format == 'channels_first':
                if self.rank == 1:
                    # nn.bias_add does not accept a 1D input tensor.
                    bias = array_ops.reshape(self.bias, (1, self.filters, 1))
                    outputs += bias
                else:
                    outputs = nn.bias_add(outputs, self.bias, data_format='NCHW')
            else:
                outputs = nn.bias_add(outputs, self.bias, data_format='NHWC')

        if self.activation is not None:
            return self.activation(outputs)

        return outputs
    
class StyleConv2D(StyleConv):
    
    def __init__(self, *args, **kwargs):
        kwargs['rank'] = 2
        super(StyleConv2D, self).__init__(*args, **kwargs)

## Lazy regularisation
- In the interests of (speed ?) reg terms are executed only after every $k$ training iterations
- For $k$ iterations usual loss is used
- Then for 1 iteration reg loss is used
- Adam is shared with its parameters adjusted since for every $k$ iterations there are now $k + 1$ iterations

In [228]:
def get_adam_lazy_reg(beta1, beta2, lam, n_iters):
    factor = n_iters / (n_iters + 1)
    return tf.optimizers.Adam(beta1=beta1**factor, 
                              beta2=beta2**factor, 
                              lam=factor*lam)

## Path length regularisation

I think that since $\mathbf{y}$ are referred to as random images, we have that $\nabla_\mathbf{w}(g(\mathbf{w})\cdot \mathbf{y}) = \mathbf{J}_\mathbf{w}^T\mathbf{y} + g(\mathbf{w})\nabla_\mathbf{w}\mathbf{y} = \mathbf{J}_\mathbf{w}^T\mathbf{y}$ because if $\mathbf{y}$ is some random image from $\mathcal{N}(0,\mathbf{I})$ its gradients with respect to $\mathbf{w}$ are 0. 

This contrivance used as it gives us $\mathbf{J}_\mathbf{w}^T\mathbf{y}$ without having to find the Jacobian. 

The expected value of  $\lVert \mathbf{J}_\mathbf{w}^T\mathbf{y} \rVert_2^2$:


$$E_{\mathbf{y}}\left[\left\lVert\mathbf{J}_\mathbf{w}^T\mathbf{y}\right\rVert_2^2\right]
= E_{\mathbf{y}}\left[\sum_a\left(\sum_b{\mathbf{J}_{\mathbf{w},ab}}\mathbf{y}_b\right)^2\right]
= E_{\mathbf{y}}\left[\sum_a\left(\sum_b\sum_{b'}{\mathbf{J}_{\mathbf{w},ab}}\mathbf{y}_b
{\mathbf{J}_{\mathbf{w},ab'}}\mathbf{y}_{b'}\right)\right]
\\ = \sum_a\sum_b {\mathbf{J}_{\mathbf{w},ab}}^2 E_{\mathbf{y_b}}\left[\mathbf{y}_b^2\right]
+
\sum_a\sum_{b, b\neq b'}\sum_{b'}{\mathbf{J}_{\mathbf{w},ab}}
{\mathbf{J}_{\mathbf{w},ab'}}E_{\mathbf{y_b}}\left[\mathbf{y}_{b}\right]E_{\mathbf{y_{b'}}}\left[\mathbf{y}_{b'}\right]
\\ = \sum_a\sum_b {\mathbf{J}_{\mathbf{w},ab}}^2 = \text{tr}\left({\mathbf{J}_{\mathbf{w}}}{\mathbf{J}_{\mathbf{w}}}^T\right)
$$ 

Since the elements of $\mathbf{y}$ are independent the expectation of each element can be found separately. We also rely on the following:
    
$$E[\mathbf{y_b}] = 0 \implies E[\mathbf{y_b}^2] = \text{Var}(\mathbf{y_b}) = 1$$

Minimising the above makes the elements of the Jacobian small. In practice a value $a$ is subtracted. 

$$E_{\mathbf{y}}\left[\left(\left\lVert\mathbf{J}_\mathbf{w}^T\mathbf{y}\right\rVert_2 - a\right)^2\right]$$

The value $a$ is made to be the exponential moving average of $\left\lVert\mathbf{J}_\mathbf{w}^T\mathbf{y}\right\rVert_2$. Let us think about how this regularisation might work:

- At any step $J_w$ depends on the weights at that point.
- Say that at step $t$ $\left\lVert\mathbf{J}_\mathbf{w}^T\mathbf{y}\right\rVert_2^2$ is quite different from $a$ which pushes up the loss --- simplistically this encourages the weights to to push $\left\lVert\mathbf{J}_\mathbf{w}^T\mathbf{y}\right\rVert_2^2$ towards $a$.
- The weights will also be influenced by the other losses.
- At the next step $a$ has been pushed up a bit so if $\left\lVert\mathbf{J}_\mathbf{w}^T\mathbf{y}\right\rVert_2^2$ has decreased too much, then in the step after it will be pushed up again
- Possibly the term will become more stable, staying near to some value of $a$ as the network as a whole becomes more stable. 
- What this value is depends on the weights that work well for the task as a whole. 
- However the effect of the regularisation is to exert some control on the Jacobian. 

## Architecture

- Dimensionality of $Z$ and $W$: 512
- Mapping network architecture 
    - 8 fully connected layers
    - 100$\times$ lower lr
- Architecture
    - leaky ReLU, $\alpha=0.2$
    - bilinear filtering in all up/downsampling layers (?)
    - minibatch std layer at end of discriminator
- Training
    - equalized lr for all trainable params (?)
    - EMA of generator weights
    - style mixing reg
    - non-saturating logistic loss with $R_1$ reg
    - Adam optimiser ($\beta_1 = 0, \beta_2 = 0.99, \epsilon = 10^{-8}$)
    - batch_size 32
    - 8 GPUs