# CUT Architecture



Explained in **section 4** of [Park et al.](https://arxiv.org/abs/2007.15651), the architecture of CUT is largely the same as that of CycleGAN (see [Zhu et al.](https://arxiv.org/abs/1703.10593)).

From CycleGAN, CUT reuses:
- the Resnet-based generator, a variation of that from [Johnson et al.](https://arxiv.org/abs/1603.08155)
- the 70x70 PatchGAN discriminator

CUT differs from CycleGAN by:
- using least-squares instead of binary cross entropy for the adversarial GAN loss
- using PatchNCE loss instead of $\ell_1$ cycle-consistency loss




## Generator

The generator is based off the Resnet-based architecture from [Johnson et al.] with 9 residual blocks (assuming size 256x256 training images). The generator is partitioned into two sub-networks, encoder $G_{enc}$ and decoder $G_{dec}$. 


### Encoder

$G_{enc}$ consists of the following layers:
1. Reflection Padding
    - one (3, 3, 3, 3) [reflection padding](https://pytorch.org/docs/master/generated/torch.nn.ReflectionPad2d.html) layer applied to the input image
2. Convolutional Block (`c7s1-64`)
    - one (7×7 Convolution)-(InstanceNorm)-(ReLU) block, with 64 ﬁlters, stride 1, and **no padding**
3. Downsampling Block (`d128`)
    - one (3×3 Convolution)-(InstanceNorm)-(ReLU) block, with 128 ﬁlters, stride 2, and zero-padding 1
4. Downsampling Block (`d256`)
    - one (3×3 Convolution)-(InstanceNorm)-(ReLU) block, with 256 ﬁlters, stride 2, and zero-padding 1
5. Residual Block (`R256`)
    - nine (3x3 Convolution)-(BatchNorm)-(ReLU)-(3x3 Convolution)-(BatchNorm) blocks, with 256 filters, stride 1, and **no padding**


### Decoder
$G_{dec}$ consists of the following layers:
1. Residual Block (`R256`)
    - nine (3x3 Convolution)-(BatchNorm)-(ReLU)-(3x3 Convolution)-(BatchNorm) blocks, with 256 filters, stride 1, and **no padding**
2. Upsampling Block (`u128`)
    - one (3×3 Convolution)-(InstanceNorm)-(ReLU) block, with 128 ﬁlters, stride **1/2**, and zero-padding 1
3. Upsampling Block (`u64`)
    - one (3×3 Convolution)-(InstanceNorm)-(ReLU) block, with 64 ﬁlters, stride **1/2**, and zero-padding 1
4. Convolutional Block (`c7s1-3`)
    - one (7×7 Convolution)-(InstanceNorm)-(ReLU) block, with 3 ﬁlters, stride 1, and **no padding**
5. Reflection Padding
    - one (3, 3, 3, 3) reflection padding layer
6. (Scaled) Tanh Activation
    - one Tanh activation layer to scale the values back to a range of [0, 255], embedding the output into an RGB image

> **NOTE**: The decoder acts as a symmetric reverse of the encoder. Even the fractionally-strided convolutional blocks in the decoder correspond to the integer-strided convolutional blocks in the encoder (for example, a 128x128 image with stride 1/2 is "equivalent" to a 256x256 image with stride 1).

The naming conventions for the blocks (e.g., `c7s1-3`) is borrowed from Johnson et al.


<img src="https://i.imgur.com/kBlcix2.png" width="300">

Above is a visual depiction of the residual blocks used for the encoder and decoder. Notice the skip connection from the input to the output of the final BatchNorm. (Source: [Supplementary material](https://web.eecs.umich.edu/~justincj/papers/eccv16/JohnsonECCV16Supplementary.pdf) by Johnson et al.)


## Discriminator

**Section 7.2** of Zhu et al. specifies the PatchGAN architecture:

> For discriminator networks, we use 70 × 70 PatchGAN [22]. Let `Ck` denote a 4 × 4 Convolution-InstanceNorm-LeakyReLU layer with k ﬁlters and stride 2. After the last layer, we apply a convolution to produce a 1-dimensional output. We do not use InstanceNorm for the ﬁrst C64 layer. We use leaky ReLUs with a slope of 0.2. The discriminator architecture is: `C64-C128-C256-C512`

Roughly speaking, the PatchGAN discriminator, created by [Isola et al.](https://arxiv.org/abs/1611.07004), is designed to better capture "high frequency" visual information. $\ell_2$ and $\ell_1$ losses capture low-frequency information well, but tend to result in blurry images. See this [Quora post](https://qr.ae/pN2yFq), explaining frequency for image processing.

PatchGAN instead attempts to classify whether each patch (typically of size 70x70) is real or fake. Therefore, PatchGAN can be thought of as measuring style/texture loss. Isola et al. found that using PatchGAN generates higher quality images, runs faster, requires fewer hyperparameters, and works on images of arbitrary size.

## Multilayer Patchwise Contrastive Loss (PatchNCE)

The newly introduced concept of patchwise contrastive loss is computed for a network consisting of two sub-networks: (i) the encoder $G_{enc}$, a subset of layers of $G$, and (ii) $H$, a two-layer MLP (multi-layer perceptron, i.e., fully-connected network). PatchNCE loss replaces the $\ell_1$ cycle-consistency loss from CycleGAN. 

The loss is named as such (NCE, short for "Noise Contrastive Estimation") because:

> "These methods make use of noise contrastive estimation, learning an embedding where associated signals are brought together, in contrast to other samples in the dataset" (Park et al. 4).

In other words, we are maximizing the similarities between corresponding patches from the input and the embedded output, and minimizing *contrastive* patches (random non-corresponding patches from the input). As a rough example, the head of a generated zebra should be more similar to a zebra head from the ground truth dataset than irrelevant patches such as grass or sky.

To compute PatchNCE loss, we use the softmax cross entropy loss to compute the probability of selecting the positive patches over the negative patches. The positive and negative patches are sampled from the latent space $H(G_{enc}(x))$, and the output patches are sampled from the latent space $H(G_{enc}(G(x)))$. 

**Appendix C, Section 1** explains that 256 random samples are drawn from each of the following layers:
1. RGB pixels from the initial image input (size 1x1)
2. `d128`, a downsampling convolution (size 9x9)
3. `d256`, a downsampling convolution (size 15x15)
4. the first of the nine residual blocks (size 35x35)
5. the fifth `R256` residual block (size 99x99)

For example, for the first feature extraction, since the input image has 3 channels (from RGB), and our receptive field size per sample is 1x1, we have a tensor of (256, 3, 1, 1).

## Adversarial GAN Loss

$L_{GAN}$, the discriminator-generator adversarial loss, is computed using  [least-squares loss](https://arxiv.org/abs/1611.04076), instead of the standard binary cross-entropy loss defined by [Goodfellow et al.](https://papers.nips.cc/paper/5423-generative-adversarial-nets)

The formula for Least-Squares GAN is:
- $\text{min}_D V_{LSGAN}(D) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[D(G(x))^2] + \mathbb{E}_{y \sim p_{\text{data}}(y)}[(D(y) - 1)^2]$

- $\text{min}_G V_{LSGAN}(G) = \mathbb{E}_{x \sim p_{\text{data}}(x)}[(D(G(x)) - 1)^2]$