# Group 4 Project Version 2 Submission


## Paper Information and Our Information
### **Paper Title:** SeD Semantic-Aware Discriminator for Image Super-Resolution
### **Paper Authors:** Bingchen Li, Xin Li, Hanxin Zhu, Yeying Jin, Ruoyu Feng, Zhizheng Zhang, Zhibo Chen

### **Authors:**  Yigit Ekin and Mustafa Utku Aydogdu
### **Mail:** e270207@metu.edu.tr e270206@metu.edu.tr
### **Paper Description:** 
### Github Repository: [Link](https://github.com/YigitEkin/sed)

In this work, researchers highlight the use of Generative Adversarial Networks (GANs) for image super-resolution tasks, particularly focusing on texture recovery. They note a limitation in existing methods where a single discriminator is employed to teach the super-resolution network the distribution of high-quality real-world images, leading to coarse learning and unexpected output. To address this, they introduce a Semantic-aware Discriminator (SeD), which incorporates image semantics to guide the network in learning fine-grained image distributions.

The SeD leverages image semantics extracted from a trained semantic model, allowing the discriminator to discern real and fake images based on different semantic conditions. By integrating semantic features into the discriminator using spatial cross-attention modules, they aim to enhance the SR network's ability to generate more realistic and visually appealing images. The approach capitalizes on pretrained vision models and extensive datasets to enrich the understanding of image semantics and improve the fidelity of super-resolved images.


Authors suggest that Vanilla Discriminators ignore the important semantics of the inputs, hence giving  semantic features of an image ( extracted via a pretrained network ), enables a better discriminator and hence a better feedback for the generator. The situation is demonstrated better on Figure 1, giving semantic features as condition enables the discriminator to specialize by finding boundaries within classes.



<img src="img/fig1.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>



A classical setup of Super Resolution GAN Network is as below. The generator network takes the low resolution image as input and produces a high resolution image. The discriminator takes the generated and the ground truth high resolution images and classifies as real or fake. In our setup, our generator takes 64x64 low resolution images and generate 256x256 high resolution images.

<img src="img/sr_gan_setup.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>


The proposed setup of the paper is as below, now the semantic feature maps of the ground truth high resolution images is also given as input to the discriminator. 


<img src="img/sed_gan_setup.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>


Authors employ two discriminator types, a patch-based discriminator and a pixel-wise discriminator. 
The proposed architecture of the Patch-wise Semantic Aware Discriminator is shown below. 


<img src="img/patchwise_sed.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>


<details>
  <summary>Patchwise SED</summary>

  ```python
class DownSampler(nn.Module):
    # Downsamples 4 times in a conv, bn, leaky relu fashion that halves the spatial dimensions in each step and doubles the number of filters
    def __init__(self, input_channels, num_filters=64):
        super().__init__()
        self.conv1 = nn.Conv2d(input_channels, num_filters, kernel_size=4, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(num_filters)
        self.leaky_relu = nn.LeakyReLU(0.2)
        
        self.conv2 = nn.Conv2d(num_filters, num_filters * 2, kernel_size=4, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(num_filters * 2)
        
        self.conv3 = nn.Conv2d(num_filters * 2, num_filters * 4, kernel_size=4, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(num_filters * 4)
        
        self.conv4 = nn.Conv2d(num_filters * 4, num_filters * 8, kernel_size=4, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(num_filters * 8)
        
    def forward(self, x):
        x = self.leaky_relu(self.bn1(self.conv1(x)))
        x = self.leaky_relu(self.bn2(self.conv2(x)))
        x = self.leaky_relu(self.bn3(self.conv3(x)))
        x = self.bn4(self.conv4(x))
        return x
    
    
class PatchDiscriminatorWithSeD(nn.Module):
    # PatchGAN discriminator with semantic-aware fusion blocks 
    def __init__(self, input_channels, num_filters=64):
        super().__init__()
        #First downsample the input size from 256x256 to 16x16 to match the semantic feature map size
        self.downsampler = DownSampler(input_channels, num_filters)
        #Use 3 semantic-aware fusion blocks to fuse the semantic feature maps with the downsampled input
        self.semantic_aware_fusion_block1 = SemanticAwareFusionBlock()
        self.semantic_aware_fusion_block2 = SemanticAwareFusionBlock(channel_size_changer_input_nc=1024)
        self.semantic_aware_fusion_block3 = SemanticAwareFusionBlock(channel_size_changer_input_nc=1024)
        #Final convolution to get the output
        self.final_conv = nn.Conv2d(num_filters * 16, 1, kernel_size=4, stride=1, padding=1)
        
    def forward(self, semantic_feature_maps, fs):
        x = self.downsampler(fs)
        x = self.semantic_aware_fusion_block1(semantic_feature_maps, x)
        x = self.semantic_aware_fusion_block2(semantic_feature_maps, x)
        x = self.semantic_aware_fusion_block3(semantic_feature_maps, x)
        x = self.final_conv(x)
        return x
  ```
</details>


The discriminator has a specialized block called Semantic Aware Fusion Block. Semantic Aware Fusion Block takes the ground truth semantic features extracted by CLIP Feature Extractor, and applies cross attention between either ground truth or generated high resolution images as shown below. First the generated (or ground truth) feature maps are processed through normalization and self-attention mechanism , and then cross attention is applied. 


<img src="img/semantic_aware_fb.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>



<details>
  <summary>Semantic Aware Fusion Block</summary>

  ```python
class SemanticAwareFusionBlock(nn.Module):
    def __init__(self, channel_size_changer_input_nc=512):
        super().__init__()
        self.group_norm = nn.GroupNorm(32, 1024) 

        self.channel_size_changer1 = nn.Conv2d(in_channels=channel_size_changer_input_nc, out_channels=128, kernel_size=1)
        self.reduce_channels2 = nn.Conv2d(in_channels=1024, out_channels=128, kernel_size=1)

        self.layer_norm_1 = nn.LayerNorm(128)
        self.layer_norm_2 = nn.LayerNorm(128)
        self.layer_norm_3 = nn.LayerNorm(128)

        self.self_attention = SelfAttention(128, num_heads=1, dimensionality=128)
        self.cross_attention = CrossAttention(128, heads=1, dim_head=128)

        self.GeLU = nn.GELU()

        #define 1x1 convolutions
        self.increase_channels1 = nn.Conv2d(256, 1024, 1)

    def forward(self, semantic_feature_maps, fs):
        # fs ( or sh for generated) have shape batch, 3 x 16 x 16
        #semantic feature maps  have shape batch x 1024 x 16 x 16
        final_permute_height = semantic_feature_maps.shape[2]
        final_permute_width = semantic_feature_maps.shape[3]
        
        #first handle S_h
        semantic_feature_maps = self.group_norm(semantic_feature_maps)

        #reduce the channel dimensions for the feature maps from 1024 to 128 for computation
        semantic_feature_maps = self.reduce_channels2(semantic_feature_maps)


        # Permute dimensions to rearrange the tensor
        semantic_feature_maps = semantic_feature_maps.permute(0, 2, 3, 1).contiguous().view(semantic_feature_maps.size(0), -1, semantic_feature_maps.size(1))

        #apply layer normalization
        semantic_feature_maps = self.layer_norm_1(semantic_feature_maps)

        #apply self attention
        semantic_feature_maps = self.self_attention(semantic_feature_maps) #returned has shape 1,196,128 for now
        #apply layer normalization
        query = self.layer_norm_2(semantic_feature_maps)

        #now handle fs or  sh
        #reduce the channel dimensions for the sh

        #make number of channels = 128 to be compatible with the semantic feature maps
        fs = self.channel_size_changer1(fs)

        #to use fs as residual, obtain a clone, 
        #note that gradient still accumulates in the original fs, so no problem
        fs_residual = fs.clone()

        #permute the dimensions
        fs = fs.permute(0, 2, 3, 1).contiguous().view(fs.size(0), -1, fs.size(1))

        #apply cross attention, query is the semantic feature maps and fs is the key and value
        out = self.cross_attention(query, fs)

        #apply layer normalization
        out = self.layer_norm_3(out)

        #apply GeLU
        out = self.GeLU(out)

        #permute the dimensions
        out = out.permute(0,2,1).contiguous().view(out.size(0), -1, final_permute_height, final_permute_width)

        #add the residual
        output = torch.cat((out,fs_residual), dim=1)

        #increase the channels back to 1024
        output = self.increase_channels1(output)
    
        return output
```
</details>



The architecture of the CLIP Feature Extractor is demonstrated below. The CLIP Feature Extractor has normally 4 layers, but authors suggest using the outputs of third layer as going further causes loss of spatial information which is problematic while restoring a high resolution image. The architecture is shown below.


<img src="img/clip_feature_extractor.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>

<details>
  <summary>CLIP Feature Extractor</summary>

  ```python

class CLIPRN50(nn.Module):
    """
    A ResNet class that is similar to torchvision's but contains the following changes:
    - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
    - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
    - The final pooling layer is a QKV attention instead of an average pool
    """

    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
        super().__init__()
        self.output_dim = output_dim
        self.input_resolution = input_resolution

        # the 3-layer stem
        self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(width // 2)
        self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(width // 2)
        self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
        self.bn3 = nn.BatchNorm2d(width)
        self.avgpool = nn.AvgPool2d(2)
        self.relu = nn.ReLU(inplace=True)

        # residual layers
        self._inplanes = width  
        self.layer1 = self._make_layer(width, layers[0])
        self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
        self.layer3 = self._make_layer(width * 4, layers[2], stride=2)

        embed_dim = width * 32  # the ResNet feature dimension
        self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)

        #add the openai-provided normalization
        #https://github.com/jianjieluo/OpenAI-CLIP-Feature/blob/01269a8fceb540d3b6477b43177ea33845c9514c/clip/clip.py#L82C9-L82C92
        self.preprocess = transforms.Compose([
            transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
        ])

        #load
        self.ckpt_path = "RN50"
        self.load_ckpt(self.ckpt_path)
        self.freeze()

    def _make_layer(self, planes, blocks, stride=1):
        layers = [Bottleneck(self._inplanes, planes, stride)]

        self._inplanes = planes * Bottleneck.expansion
        for _ in range(1, blocks):
            layers.append(Bottleneck(self._inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        def stem(x):
            for conv, bn in [(self.conv1, self.bn1), (self.conv2, self.bn2), (self.conv3, self.bn3)]:
                x = self.relu(bn(conv(x)))
            x = self.avgpool(x)
            return x

        x = x.type(self.conv1.weight.dtype)
        x = self.preprocess(x)
        x = stem(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)

        return x
```
</details>


The pixel-wise discriminator has U-Net architecture, which also employs the Semantic-Aware Fusion Block. The authors use both patch-based and pixel-wise based discriminators to demonstrate effectiveness of the Semantic-Aware Fusion Block. The architecture of the Pixel-wise Semantic Aware Discriminator is shown below.  

<img src="img/pixelwise_sed.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>

<details>
  <summary>Pixelwise SED</summary>

  ```python

class DownSamplerPx(nn.Module):
    #downsamples 4 times in a conv, bn, leaky relu fashion that halves the spatial dimensions in each step and doubles the number of filters
    def __init__(self, input_channels, num_filters=64):
        super().__init__()
        self.conv1 = nn.Conv2d(input_channels, num_filters, kernel_size=4, stride=2, padding=1)
        self.bn1 = nn.BatchNorm2d(num_filters)
        self.leaky_relu = nn.LeakyReLU(0.2)
        
        self.conv2 = nn.Conv2d(num_filters, num_filters, kernel_size=4, stride=2, padding=1)
        self.bn2 = nn.BatchNorm2d(num_filters)
        
        self.conv3 = nn.Conv2d(num_filters, num_filters, kernel_size=4, stride=2, padding=1)
        self.bn3 = nn.BatchNorm2d(num_filters)
        
        self.conv4 = nn.Conv2d(num_filters, num_filters, kernel_size=4, stride=2, padding=1)
        self.bn4 = nn.BatchNorm2d(num_filters)
        
    def forward(self, x):
        x = self.leaky_relu(self.bn1(self.conv1(x)))
        x = self.leaky_relu(self.bn2(self.conv2(x)))
        x = self.leaky_relu(self.bn3(self.conv3(x)))
        x = self.bn4(self.conv4(x))
        return x
    
class UNetPixelDiscriminatorwithSed(nn.Module):
    def __init__(self, in_channels=3, out_channels=1, num_filters=64):
        super(UNetPixelDiscriminatorwithSed, self).__init__()

        #downsampler takes 256x256 images and downsamples to the 16x16
        #to make dimensionality compatible with semantic feature maps
        self.downsampler = DownSamplerPx(in_channels, num_filters)
        
        # Semantic Aware Fusion Blocks
        self.semantic_aware_fusion_block1 = SemanticAwareFusionBlock(channel_size_changer_input_nc=64)
        self.semantic_aware_fusion_block2 = SemanticAwareFusionBlock(channel_size_changer_input_nc=1024)
        self.semantic_aware_fusion_block3 = SemanticAwareFusionBlock(channel_size_changer_input_nc=1024)
        
        self.upconv1 = nn.Conv2d(1024, 1024, kernel_size=1, stride=1)
        self.upconv2 = nn.Conv2d(1024, 64, kernel_size=1, stride=1)
        self.upconv3 = nn.Conv2d(64, 3, kernel_size=1, stride=1)


    def forward(self,semantic_feature_maps, fs):
        x = self.downsampler(fs)
        enc1 = self.semantic_aware_fusion_block1(semantic_feature_maps, x)
        enc2 = self.semantic_aware_fusion_block2(semantic_feature_maps, enc1)
        enc3 = self.semantic_aware_fusion_block3(semantic_feature_maps, enc2)
        
        dec = self.upconv1(enc3 + enc2)
        dec = self.upconv2(dec + enc1)
        dec = self.upconv3(dec + x)
        
        return dec
```
</details>




Throughout the experiments, we use the RRDB Generator proposed in ESRGAN paper, whose building blocks are shown in below Figure.

<img src="img/rrdb_generator.png" style="width:700px; height:auto; display: flex; justify-content: center"/> <br/> <br/>

<details>
  <summary>RRDB Generator</summary>

  ```python
class DenseBlock(nn.Module):
  '''
  Dense Block structure from https://arxiv.org/pdf/1809.00219 Fig4 : Left
  '''
    def __init__(self, in_channels, out_channels, num_blocks=5, is_upsample=False):
        super().__init__()
        self.blocks = make_blocks(in_channels, out_channels, num_blocks, is_upsample)

    def forward(self, x):
        prev_features = x
        for block in self.blocks:
            current_output = block(prev_features)
            prev_features = torch.cat([prev_features, current_output], dim=1)
        return x + current_output * 0.2

class Residual_in_ResidualBlock(nn.Module):
  '''
  RRDB  structure from https://arxiv.org/pdf/1809.00219 Fig4 : Right
  consists of 3 Dense Blocks
  '''
    def __init__(self, in_channels, num_blocks=3, is_upsample=False):
        super().__init__()
        self.rrdb1 = DenseBlock(in_channels, in_channels, num_blocks, is_upsample)
        self.rrdb2 = DenseBlock(in_channels, in_channels, num_blocks, is_upsample)
        self.rrdb3 = DenseBlock(in_channels, in_channels, num_blocks, is_upsample)
        
    def forward(self, x):
        out1 = self.rrdb1(x)
        out2 = self.rrdb2(out1)
        out3 = self.rrdb3(out2)
        return x + out3 * 0.2

class RRDBNet(nn.Module):
    '''ESRGAN Generator, which consists of 23 Residual in Residual Dense Blocks
    paper : https://arxiv.org/pdf/1809.00219
    '''
    def __init__(self, in_channels=3, num_channels=64, num_blocks=23, clip_output=False):
        super().__init__()
        self.conv1 = get_layer(in_channels, num_channels)
        self.conv2 = get_layer(num_channels, num_channels)
        self.conv3 = get_layer(num_channels, num_channels)
        self.act = nn.LeakyReLU(0.2, inplace=True)
        self.output = get_layer(num_channels, in_channels)
        self.first_ups = get_layer(num_channels, num_channels, is_upsample=True)
        self.second_ups = get_layer(num_channels, num_channels, is_upsample=True)
        self.rrdb = nn.Sequential(*[Residual_in_ResidualBlock(num_channels) for _ in range(num_blocks)])
        self.clip_output = clip_output

    def forward(self, x):
        res = self.conv1(x)
        x = self.rrdb(res)
        x = self.conv2(x)
        x = x + res
        x = self.first_ups(x)
        x = self.second_ups(x)
        x = self.act(self.conv3(x))
        if self.clip_output:
            x = self.output(x).clip(-1, 1)
        else:
            x = self.output(x)
        return x
```
</details>


         
To see the effect of Semantic Aware Fusion Block , we also implemented Vanilla Patch-wise Discriminator and Vanilla Pixel-wise Discriminator.

<details>
  <summary>Patch-wise Discriminator</summary>

  ```python
#Vanilla patchgan discriminator
class PatchDiscriminator(nn.Module):
    def __init__(self, input_channels, num_filters=64):
        super().__init__()
        #Downsample the input size from 256x256 to 16x16
        self.downsampler = DownSampler(input_channels, num_filters)
        self.final_conv = nn.Conv2d(num_filters * 8, 1, kernel_size=4, stride=1, padding=1)
        
    def forward(self, fs):
        fs = self.downsampler(fs)
        fs = self.final_conv(fs)
        return fs
```
</details>



<details>
  <summary>Pixel-wise Discriminator </summary>

  ```python
class UNetPixelDiscriminator(nn.Module):
    def __init__(self, in_channels=3, out_channels=1, num_filters=64):
        super(UNetPixelDiscriminator, self).__init__()

        # Encoder
        self.encoder = nn.Sequential(
            self._conv_block(in_channels, num_filters),
            self._conv_block(num_filters, num_filters),
            self._conv_block(num_filters, num_filters * 2),
            self._conv_block(num_filters * 2, num_filters * 4),
        )

        # Bottleneck
        self.bottleneck = nn.Sequential(
            nn.Conv2d(num_filters * 4, num_filters * 4, kernel_size=4, stride=2, padding=1),
            nn.LeakyReLU(0.2, inplace=True),
        )

        # Decoder
        self.decoder = nn.Sequential(
            self._upconv_block(num_filters * 4, num_filters * 4),
            self._upconv_block(num_filters * 4, num_filters * 2),
            self._upconv_block(num_filters * 2, num_filters),
            self._upconv_block(num_filters, num_filters),
            nn.Conv2d(num_filters, out_channels, kernel_size=1, stride=1, padding=0),
            nn.Sigmoid()
        )

    def _conv_block(self, in_channels, out_channels, kernel_size=4, stride=2, padding=1):
        return nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding),
            nn.LeakyReLU(0.2, inplace=True),
            nn.BatchNorm2d(out_channels),
        )

    def _upconv_block(self, in_channels, out_channels, kernel_size=4, stride=2, padding=1):
        return nn.Sequential(
            nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding),
            nn.ReLU(inplace=True),
            nn.BatchNorm2d(out_channels),
        )

    def forward(self, fs):
        # Encoder
        # fs = 64
        enc1 = self.encoder[0](fs) # 32x32x64
        enc2 = self.encoder[1](enc1) # 16x16x64
        enc3 = self.encoder[2](enc2) # 8x8x128
        enc4 = self.encoder[3](enc3) # 4x4x256

        #Bottleneck
        bottleneck = self.bottleneck(enc4) # 2x2x256

        #Decoder with skip connections using addition
        dec = self.decoder[0](bottleneck) # 4x4x256
        dec = self.decoder[1](dec + enc4) # 8x8x128
        dec = self.decoder[2](dec + enc3) # 16x16x64
        dec = self.decoder[3](dec + enc2) # 32x32x64
        dec = self.decoder[4](dec + enc1) # 64x64x1

        return dec
```
</details>

### **Our Assumptions:**
Implementing a super resolution model based solely on a paper, without access to the accompanying code, was challenging due to the complexities of understanding and implementing the loss function, architecture, and performance metrics described in the paper. Dealing with dimensionality inconsistencies in paper. Some are listed below.

* we assumed that the group normalization has 32 groups (not stated in the paper)
* we assumed that the conv block in patchwise discriminator is a  convolution block that doubles the channel size and with kernel_size of 4, stride=2 and padding=1 followed by a batch normalization block followed by a leaky relu block (not included in the last convolution block) which is not stated in the paper.
* They did not specified the adverserial loss function details. As a result, we have decided to go with wassertein loss with gradient penalty to achieve a more stable training.
* They did not specify how they have preprocessed the dataset. As a result, due to small number of images in the dataset, we have decided to conduct a literature survey on how different models have overcome this issue and found that ESRGAN does combine 2 datasets and crops random patches from each image to increase the number of images.
* We have decided to move with crop size of 400 for hr images and 100 for lr images. This means that during training our model inputs 100x100 crops and tries to generate 400x400 hr version of it.
* For cross attention, we have decided to use single head attention rather than multi head attention
* CLIP preprocessor normally downscales the image to 224x224 before extracting embeddings. We believed that this can downgrade the performance w.r.t hr images as a result, we did not use this preprocessor.
* To obtain same spatial dimensionality with the clip embeddings (for concatenation specified in the image below in part d), we added extra convolution layer that did not change the channel size but decreases the spatial dimensions.
* The authors did not describe the weight (lambda) values of the loss functions as a result, we have decided to go with 1 for mse and 10 for gradient penalty in wasserstein loss
* The authors did not specify whether they have used multi-head attention or single head attention. As a result, we have decided to go with single head attention because we thought it should be sufficient enough.
* The authors did not specify the dimensionlity of attention head. So, we have decided to go with 128 as this will result in 8 times less memory usage. 
* For the coefficients of the losses (i.e VGG, adverserial, MSE), we have conducted several experiments and the current setup in the config files are the ones that have achieved the best scores. One thing that we have tried to keep constant is the ratio between the losses. For example, if the VGG loss is 0.1 times the adverserial loss, we have tried to keep this ratio constant in all experiments by changing the coefficients.
* The authors did not specify how they have calculated the psnr, ssim, lpips scores for the datasets. Given that they have given qualitative results as crops from image, we have calculated the scores by using random crops as ground truth images.

## Hyper-parameters of your model

We aim to compare the effect of SeD discriminator with vanilla discriminator. As a result, we have two different training setups. Before reading the hyperparameters, please note that the hyperparameters are the same for both models except for the discriminator part. In addition, the losses used for the model can be seen from the image below where L_s is VGG  perceptual loss, L_p is the pixelwise MSE loss and L_adv is the adverserial loss.


<img src="img/losses.png"> <br/> <br/>
The hyperparameters of the models are as follows:

### Vanilla Patchgan Discriminator
- **Train Batch Size**: 16
- **Image Size**: 256
- **Downsample Factor**: 4 (downsampling factor for low-resolution images)
- **Losses**:
  - **VGG**:
    - `weight`: 1000.0
    - `output_layer_idx`: 23 (index of the layer to extract features from)
  - **Adversarial_G**:
    - `weight`: 1.0
  - **MSE**:
    - `weight`: 1e-1
  - **Adversarial_D**:
    - `r1_gamma`: 10.0 (constant for wasserstein GP)
    - `r2_gamma`: 0.0 (constant for wasserstein GP)
- **Super Resolution Module Configuration**:
  - `generator_learning_rate`: 1e-4
  - `discriminator_learning_rate`: 1e-5
  - `generator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `discriminator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `generator_decay_gamma`: 0.5
  - `discriminator_decay_gamma`: 0.5

### Patch SeD
- **Train Batch Size**: 16
- **Image Size**: 256
- **Downsample Factor**: 4 (downsampling factor for low-resolution images)
- **Losses**:
  - **VGG**:
    - `weight`: 1000.0
    - `output_layer_idx`: 23 (index of the layer to extract features from)
  - **Adversarial_G**:
    - `weight`: 1.0
  - **MSE**:
    - `weight`: 1e-1
  - **Adversarial_D**:
    - `r1_gamma`: 10.0 (constant for wasserstein GP)
    - `r2_gamma`: 0.0 (constant for wasserstein GP)
- **Super Resolution Module Configuration**:
  - `generator_learning_rate`: 1e-4
  - `discriminator_learning_rate`: 1e-5
  - `generator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `discriminator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `generator_decay_gamma`: 0.5
  - `discriminator_decay_gamma`: 0.5


### Vanilla Pixelwise Discriminator
- **Train Batch Size**: 16
- **Image Size**: 256
- **Downsample Factor**: 4 (downsampling factor for low-resolution images)
- **Losses**:
  - **VGG**:
    - `weight`: 1000.0
    - `output_layer_idx`: 23 (index of the layer to extract features from)
  - **Adversarial_G**:
    - `weight`: 1.0
  - **MSE**:
    - `weight`: 1e-1
  - **Adversarial_D**:
    - `r1_gamma`: 10.0 (constant for wasserstein GP)
    - `r2_gamma`: 0.0 (constant for wasserstein GP)
- **Super Resolution Module Configuration**:
  - `generator_learning_rate`: 1e-4
  - `discriminator_learning_rate`: 1e-5
  - `generator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `discriminator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `generator_decay_gamma`: 0.5
  - `discriminator_decay_gamma`: 0.5

### Pixelwise SeD
- **Train Batch Size**: 16
- **Image Size**: 256
- **Downsample Factor**: 4 (downsampling factor for low-resolution images)
- **Losses**:
  - **VGG**:
    - `weight`: 1000.0
    - `output_layer_idx`: 23 (index of the layer to extract features from)
  - **Adversarial_G**:
    - `weight`: 1.0
  - **MSE**:
    - `weight`: 1e-1
  - **Adversarial_D**:
    - `r1_gamma`: 10.0 (constant for wasserstein GP)
    - `r2_gamma`: 0.0 (constant for wasserstein GP)
- **Super Resolution Module Configuration**:
  - `generator_learning_rate`: 1e-4
  - `discriminator_learning_rate`: 1e-5
  - `generator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `discriminator_decay_steps`: [50_000, 100_000, 150_000, 200_000, 250_000]
  - `generator_decay_gamma`: 0.5
  - `discriminator_decay_gamma`: 0.5

### Device Support

In [None]:
import torch
device = "gpu" if torch.cuda.is_available() else "cpu"

## Training and saving of the model.

### Training with SeD

#### **IMPORTANT NOTE:** the training of the model is done on a remote server where we have not used jupyter notebook. Normally, scripts in the cells below are used to train the model. However, in order to not overly crowd the jupyter notebook for the reviewers, we have included the code that is responsible for training but the training logs will be displayed in the last cell of this section named as training loop which abstracts all this logic


### TRAINING CONFIG

### Note for Device Choice : To select device, change line 8 on any config file:
Line 8 contains the following:
 ```python
 device = torch.device("cuda") if accelerator=="gpu" else torch.device("cpu")
 ```
Change device to `cpu`, `cuda` or do not change if you want to select the best available option

```python
import torch
from pytorch_lightning.strategies import DDPStrategy

train_batch_size = 16
val_batch_size = 8
test_batch_size = 8

image_size = 256


###########################
##### Dataset Configs #####
###########################

dataset_module = dict(
    num_workers=4,
    train_batch_size=train_batch_size,
    val_batch_size=val_batch_size,
    test_batch_size=test_batch_size,
    train_dataset_config=dict(image_size=256, image_dir_hr="data/dataset_cropped/hr", image_dir_lr="data/dataset_cropped/lr", downsample_factor=4,mirror_augment_prob=0.5),
    val_dataset_config=dict(image_size=256, image_dir_hr="data/evaluation/hr/manga109", image_dir_lr="data/evaluation/lr/manga109"),
    test_dataset_config=dict(image_size=256, image_dir_hr="data/evaluation/hr/manga109", image_dir_lr="data/evaluation/lr/manga109"),
)

##################
##### Losses #####
##################
vgg_ckpt_path="pretrained_models/vgg16.pth"
loss_dict = dict(
    VGG=dict(weight=5e-5, model_config=dict(path=vgg_ckpt_path, output_layer_idx=23, resize_input=False)),
    Adversarial_G=dict(weight=1.0),
    MSE=dict(weight=1.0),
    Adversarial_D=dict(r1_gamma=10.0, r2_gamma=0.0)
)

#########################
##### Model Configs #####
#########################

super_resolution_module_config = dict(loss_dict=loss_dict, 
    generator_learning_rate=1e-4, discriminator_learning_rate=1e-5, 
    generator_decay_steps=[50_000, 100_000, 150_000, 200_000, 250_000], 
    discriminator_decay_steps=[50_000, 100_000, 150_000, 200_000, 250_000], 
    generator_decay_gamma=0.5, discriminator_decay_gamma=0.5,
    clip_generator_outputs=False,
    use_sed_discriminator=True)

#######################
###### Callbacks ######
#######################

ckpt_callback = dict(every_n_train_steps=4000, save_top_k=1, save_last=True, monitor='fid_test', mode='min')
synthesize_callback_train = dict(num_samples=12, eval_every=2000) # TODO: 4000
synthesize_callback_test = dict(num_samples=6, eval_every=2000)
fid_callback = dict(eval_every=4000)
```

### Training loop for the following Models:
- Patch-wise Discriminator with SeD
- Vanilla Patch-wise Discriminator
- Pixel-wise Discriminator with SeD
- Vanilla Pixel-wise Discriminator


In [6]:
CFG="configs/patchgan_sed.py" #Training of patchgan discriminator with SeD

!python train.py --config_file=$CFG --device=device  # --resume_from logs/sed

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                      | Params
------------------------------------------------------------
0 | generator     | RRDBNet                   | 15.4 M
1 | discriminator | PatchDiscriminatorWithSeD | 4.7 M 
2 | clip          | CLIPRN50                  | 23.4 M
------------------------------------------------------------
20.1 M    Trainable params
23.4 M    Non-trainable params
43.5 M    Total params
173.867   Total estimated model para

In [7]:
CFG="configs/patchgan.py" #Training of patchgan discriminator without SeD

!python train.py --config_file=$CFG --device=device #--debug # --resume_from logs/sed

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type               | Params
-----------------------------------------------------
0 | generator     | RRDBNet            | 15.4 M
1 | discriminator | PatchDiscriminator | 2.8 M 
2 | clip          | CLIPRN50           | 23.4 M
-----------------------------------------------------
18.2 M    Trainable params
23.4 M    Non-trainable params
41.5 M    Total params
166.180   Total estimated model params size (MB)
SLURM auto-requeueing enabled

In [8]:
CFG="configs/pixelwise_sed.py" #Training of pixelwise discriminator with SeD

!python train.py --config_file=$CFG --device=device #--debug # --resume_from logs/sed

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                          | Params
----------------------------------------------------------------
0 | generator     | RRDBNet                       | 15.4 M
1 | discriminator | UNetPixelDiscriminatorwithSed | 3.2 M 
2 | clip          | CLIPRN50                      | 23.4 M
----------------------------------------------------------------
18.6 M    Trainable params
23.4 M    Non-trainable params
42.0 M    Total params
167.802   To

In [None]:
CFG="configs/pixelwise.py" #Training of pixelwise discriminator without SeD

!python train.py --config_file=$CFG --device=device #--debug # --resume_from logs/sed

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                   | Params
---------------------------------------------------------
0 | generator     | RRDBNet                | 15.4 M
1 | discriminator | UNetPixelDiscriminator | 3.5 M 
2 | clip          | CLIPRN50               | 23.4 M
---------------------------------------------------------
19.0 M    Trainable params
23.4 M    Non-trainable params
42.3 M    Total params
169.295   Total estimated model params size (MB)
SLURM

<details>
  <summary>
        <div>
            <h3>Training Details:</h3>
            The authors state in the paper that they have conducted a training of 35 hours with 4 tesla V100 GPU's with a batch size of 8. We, due to resource constraints, were not eligible to 4 Tesla V100 GPU's for distributed training. As a result, we have conducted a training with a single Nvidia A40 with a batch size of 16 for approximately 45 hours. <br/>
            The results given below are the training loss curves and lpips test logs of the original training we have conducted. The training logs are not included in the notebook due to the large size of the logs. Hence, we have only provided the screenshot of the logs.
        </div>
  </summary>

  <details>
      <summary><h4>Patchgan SeD:</h4></summary>
      <div style="display:flex; justify-content:center; align-items:center;">
      <img src="img/adv_d_ptch_sed.png" style="width: 600px; height:auto;"/>
      <img src="img/adv_g_ptc_sed.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/mse_ptch_sed.png" style="width: 600px; height:auto;"/>
      <img src="img/vgg_ptch_sed.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/lpips_test_ptch_sed.png" style="width: 800px; height:auto;"/>
      </div>
  </details>

  <details>
      <summary><h4>Vanilla Patchgan:</h4></summary>
      <div style="display:flex; justify-content:center; align-items:center;">
      <img src="img/adv_d_ptc.png" style="width: 600px; height:auto;"/>
      <img src="img/adv_g_ptch.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/mse_ptc.png" style="width: 600px; height:auto;"/>
      <img src="img/vgg_ptch.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/lpips_ptc.png" style="width: 800px; height:auto;"/>
      </div>
  </details>
  <details>
      <summary><h4>Pixelwise SeD:</h4></summary>
      <div style="display:flex; justify-content:center; align-items:center;">
      <img src="img/adv_d_px_sed.png" style="width: 600px; height:auto;"/>
      <img src="img/adv_g_px_sed.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/mse_l_px_sed.png" style="width: 600px; height:auto;"/>
      <img src="img/vgg_px_sed.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/lpips_px_sed.png" style="width: 800px; height:auto;"/>
      </div>
  </details>
  <details>
      <summary><h4>Vanilla Pixelwise Discriminator:</h4></summary>
      <div style="display:flex; justify-content:center; align-items:center;">
      <img src="img/adv_d_px.png" style="width: 600px; height:auto;"/>
      <img src="img/adv_g_px.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/mse_px.png" style="width: 600px; height:auto;"/>
      <img src="img/vgg_px.png" style="width: 600px; height:auto;"/>
      </div>
      <div style="display:flex; justify-content:center; align-items:center; margin-top:50px;">
      <img src="img/lpips_px.png" style="width: 800px; height:auto;"/>
      </div>
  </details>
</details>

## Inference
**Important Note:** currently this script generates super-resolved results for the Set5 dataset. If you want to generate results for other datasets, you need to change the paths accordingly.

In [2]:
from models.super_resolution_module import SuperResolutionModule
from datasets.dataset_module import DatasetModule
from tqdm import tqdm
from PIL import Image
import torch 
import numpy as np
import os


def postprocess_image(image, min_val=-1.0, max_val=1.0):
    image = image.astype(np.float64)
    image = np.clip(image, -1, 1)
    image = (image - min_val) * 255 / (max_val - min_val)
    image = image.astype(np.uint8)
    image = image.transpose(1, 2, 0)
    return image

def generate_results(ckpt, image_dir_hr, image_dir_lr, save_path, device):
    model = SuperResolutionModule.load_from_checkpoint(ckpt) 
    train_batch_size = 1  # given so that each image is processed by itself
    val_batch_size = 1 # given so that each image is processed by itself
    test_batch_size = 1 # given so that each image is processed by itself

    model.eval()
    dataset_module = dict(
        num_workers=4,
        train_batch_size=train_batch_size,
        val_batch_size=val_batch_size,
        test_batch_size=test_batch_size,
        train_dataset_config=dict(image_size=256, image_dir_hr=image_dir_hr, image_dir_lr=image_dir_lr, downsample_factor=4),
        val_dataset_config=dict(image_size=256, image_dir_hr=image_dir_hr, image_dir_lr=image_dir_lr),
        test_dataset_config=dict(image_size=256, image_dir_hr=image_dir_hr, image_dir_lr=image_dir_lr),
    )


    data_module_gt = DatasetModule(**dataset_module)
    data_module_gt.setup('test')
    dataloader = data_module_gt.test_dataloader()

    os.makedirs("results_Set5", exist_ok=True)
    os.makedirs(f"results_Set5/{save_path}", exist_ok=True)

    os.makedirs("gt_Set5", exist_ok=True)
    
    cnt = 0
    for batch in tqdm(dataloader, desc=f"Generating Results on SR images", total=len(dataloader)):
        sr_images = model.make_high_resolution(batch)
        sr = sr_images['generated_super_resolution_image'].to(device)
        hr = batch['image_hr'].to(device)
        #save the sr images to the "sr_pngs" folder
        for i in range(len(sr)):
            img = sr[i]
            hr_img = hr[i]
            
            img = postprocess_image(img.detach().cpu().numpy())
            img = Image.fromarray(img)
            img.save(f"results_Set5/{save_path}/{cnt}.png")
            
            hr_img = postprocess_image(hr_img.detach().cpu().numpy())
            hr_img = Image.fromarray(hr_img)
            hr_img.save(f"gt_Set5/{cnt}.png")


            
            cnt += 1
    
torch.manual_seed(1256)
np.random.seed(1256)
ckpt="model_weights/patchgan_sed.ckpt"
ckpt2="model_weights/pixelwise_sed.ckpt"
ckpt3="model_weights/pixelwise.ckpt"
ckpt4="model_weights/patchgan.ckpt"

generate_results_dict = {
    "patchgan": ckpt4,
    "patchgan_sed": ckpt,
    "pixelwise": ckpt3,
    "pixelwise_sed": ckpt2,
} 

image_path_hr = "data/evaluation/hr/Set5"
image_path_lr = "data/evaluation/hr/Set5"

for key, item in generate_results_dict.items():
    generate_results(item, image_path_hr, image_path_lr, key, device)


Calculating FID on SR images: 100%|██████████| 5/5 [00:01<00:00,  4.24it/s]
Calculating FID on SR images: 100%|██████████| 5/5 [00:01<00:00,  4.07it/s]
Calculating FID on SR images: 100%|██████████| 5/5 [00:00<00:00,  5.30it/s]
Calculating FID on SR images: 100%|██████████| 5/5 [00:01<00:00,  4.58it/s]


## Reproducing results

In [4]:
"""
StarGAN v2
Copyright (c) 2020-present NAVER Corp.
This work is licensed under the Creative Commons Attribution-NonCommercial
4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc/4.0/ or send a letter to
Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
"""

#WE HAVE IMPLEMENTED THIS CODE BLOCK BY USING THE REFERENCE AT THE TOP AS A GUIDANCE

import torch
import numpy as np
from tqdm import tqdm
from datasets.dataset_module import DatasetModule
from losses.lpips.lpips import LPIPS
import torch.nn.functional as F
import cv2

def print_metrics_given_path(sr_path, gt_path, label, device):
    print("calculating metrics for " + sr_path + "and " +  gt_path)
    train_batch_size = 1  # given as temporary data
    val_batch_size = 1 # given as temporary data
    test_batch_size = 1 # given as temporary data
    
    
    ################ lpips
    lpips_model = LPIPS(net_type='alex', device=device).to('cpu')
    lpips_model.eval()
    image_size = 256
    #train and val datasets must be given to the dataset module dict. Hence, we have provided a dummy instance for both module dicts.
    #Note that they are not used for calculating the metrics
    dataset_module_gt = dict(
        num_workers=4,
        train_batch_size=train_batch_size,
        val_batch_size=val_batch_size,
        test_batch_size=test_batch_size,
        train_dataset_config=dict(image_size=256, image_dir_hr="", image_dir_lr="", downsample_factor=4,mirror_augment_prob=0), #dummy
        val_dataset_config=dict(image_size=256, image_dir_hr="", image_dir_lr=""), #dummy
        test_dataset_config=dict(image_size=256, image_dir_hr=gt_path, image_dir_lr=gt_path),
    )
    
    dataset_module_gt = DatasetModule(**dataset_module_gt)
    dataset_module_gt.setup('test')
    first_dataloader = dataset_module_gt.test_dataloader()
    
    #train and val datasets must be given to the dataset module dict. Hence, we have provided a dummy instance for both module dicts.
    #Note that they are not used for calculating the metrics
    dataset_module_sr = dict( 
        num_workers=4,
        train_batch_size=train_batch_size,
        val_batch_size=val_batch_size,
        test_batch_size=test_batch_size,
        train_dataset_config=dict(image_size=256, image_dir_hr="", image_dir_lr="", downsample_factor=4,mirror_augment_prob=0),#dummy
        val_dataset_config=dict(image_size=256, image_dir_hr="", image_dir_lr=""), #dummy
        test_dataset_config=dict(image_size=256, image_dir_hr=sr_path, image_dir_lr=sr_path),
    )
    
    data_module_sr = DatasetModule(**dataset_module_sr)
    data_module_sr.setup('test')
    second_dataloader = data_module_sr.test_dataloader()
    
    def get_lpips_mean(dataloader1,dataloader2,lpips_model,device,dataset_type):
        lpips_model.to(device)
        lpips_list = []
        with torch.no_grad():
            for batch1,batch2 in tqdm(zip(dataloader1,dataloader2), desc=f"Calculating {dataset_type} LPIPS on sr images", total=len(dataloader1)):
                gt_images = batch1["image_hr"].to(device) * 0.5 + 0.5
                sr_images = batch2["image_hr"].to(device) * 0.5 + 0.5
                lpips = lpips_model(gt_images, sr_images, return_similarity=True)
                lpips_list.append(lpips.cpu())
        lpips_list = torch.cat(lpips_list).numpy()
        lpips_mean = np.nanmean(lpips_list)
        lpips_model.to('cpu')
        return lpips_mean
    
    
    
    lpips_mean = get_lpips_mean(first_dataloader,second_dataloader,lpips_model,device,"lpips")
        
    # Constants for SSIM calculation

    def create_gaussian_window(size=11, sigma=1.5):
        kernel = cv2.getGaussianKernel(size, sigma)
        window = np.outer(kernel, kernel)
        return window
    
    def compute_mean(image, window):
        return cv2.filter2D(image, -1, window)
    
    def compute_variance(image, mean, window):
        return cv2.filter2D(image ** 2, -1, window) - mean ** 2
    
    def compute_covariance(image1, image2, mean1, mean2, window):
        return cv2.filter2D(image1 * image2, -1, window) - mean1 * mean2
    
    def ssim_fn(img1, img2):

        img1 = img1.astype(np.float64)
        img2 = img2.astype(np.float64)
    
        # generate Gaussian window
        window = create_gaussian_window()
    
        # Compute means
        mean1 = compute_mean(img1, window)
        mean2 = compute_mean(img2, window)
    
        # Compute variances
        variance1 = compute_variance(img1, mean1, window)
        variance2 = compute_variance(img2, mean2, window)
    
        # Compute covariance
        covariance = compute_covariance(img1, img2, mean1, mean2, window)
    
        # Calculate SSIM score
        mean_product = 2 * mean1 * mean2
        mean_sum = mean1 ** 2 + mean2 ** 2
        variance_sum = variance1 + variance2
        covariance_product = 2 * covariance
    
        numerator = (mean_product + (0.01 * 255) ** 2) * (covariance_product + (0.03 * 255) ** 2)
        denominator = (mean_sum + (0.01 * 255) ** 2) * (variance_sum + (0.03 * 255) ** 2)
        ssim_score = numerator / denominator

        # Return mean SSIM
        return ssim_score.mean()

    def get_ssim_mean(dataloader1,dataloader2,device,dataset_type):
        ssim_list = []
        with torch.no_grad():
            for batch1,batch2 in tqdm(zip(dataloader1,dataloader2), desc=f"Calculating {dataset_type} SSIM on sr images", total=len(dataloader1)):
                gt_images = batch1["image_hr"].to(device) * 127.5 + 127.5
                sr_images = batch2["image_hr"].to(device) * 127.5 + 127.5
                gt_images = gt_images.squeeze(0).detach().cpu().numpy().transpose(1,2,0)
                sr_images = sr_images.squeeze(0).detach().cpu().numpy().transpose(1,2,0)
                ssim_val = ssim_fn(sr_images, gt_images)
                ssim_list.append(ssim_val)
        ssim_mean = np.nanmean(ssim_list)
        return ssim_mean
        
    ssim_mean = get_ssim_mean(first_dataloader,second_dataloader,device,"ssim")
    
    #### PSNR
    
    def psnr(img1, img2, max_val=1.0):
        # Convert images to float tensors
        img1 = img1.float()
        img2 = img2.float()
        
        max_val = img1.max()
        # Calculate MSE (Mean Squared Error)
        mse = F.mse_loss(img1, img2)
        
        # Calculate PSNR (Peak Signal-to-Noise Ratio)
        psnr = 20 * torch.log10(max_val / torch.sqrt(mse))
        
        return psnr.item()
    
    def get_psnr_mean(dataloader1,dataloader2,device,dataset_type):
        psnr_list = []
        with torch.no_grad():
            for batch1,batch2 in tqdm(zip(dataloader1,dataloader2), desc=f"Calculating {dataset_type} PSNR on sr images", total=len(dataloader1)):
                gt_images = batch1["image_hr"].to(device) * 127.5 + 127.5
                sr_images = batch2["image_hr"].to(device) * 127.5 + 127.5
                psnr_val = psnr(sr_images, gt_images)
                psnr_list.append(psnr_val)
        psnr_mean = np.nanmean(psnr_list)
        return psnr_mean
    
    
    psnr_mean = get_psnr_mean(first_dataloader,second_dataloader,device,"psnr")
    print(f"for the {label} scores are:\n psnr: {psnr_mean}\n lpips: {lpips_mean} \n ssim: {ssim_mean}")

metric_path_dict = {
    "Set5":["results_Set5/", "gt_Set5"],
    "Set14":["results_Set14/", "gt_Set14"],
    "urban100":["results_urban100/", "gt_urban100"],
    "manga109":["results_manga/", "gt_manga"],
    "DIV2K": ["results_DIV2K_valid_HR/", "gt_div2k"]
}

models_list = ["patchgan/", "patchgan_sed/", "pixelwise/", "pixelwise_sed/"]

for dataset, dir_ in metric_path_dict.items():
    for model in models_list:
        sr_path = dir_[0] + model
        gt_path = dir_[1] + "/"
        label = dataset + " using " + model.replace("/","")
        print_metrics_given_path(sr_path, gt_path, label, device)



calculating metrics for tmp/results_Set5/patchgan/and tmp/gt_Set5/


Calculating lpips LPIPS on sr images: 100%|██████████| 5/5 [00:00<00:00, 25.54it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 5/5 [00:00<00:00, 11.68it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 5/5 [00:00<00:00, 25.34it/s]


for the Set5 using patchgan scores are:
 psnr: 28.32532958984375
 lpips: 0.07299451529979706 
 ssim: 0.8197302284707041
calculating metrics for tmp/results_Set5/patchgan_sed/and tmp/gt_Set5/


Calculating lpips LPIPS on sr images: 100%|██████████| 5/5 [00:00<00:00, 23.60it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 5/5 [00:00<00:00, 11.65it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 5/5 [00:00<00:00, 23.56it/s]


for the Set5 using patchgan_sed scores are:
 psnr: 28.52031478881836
 lpips: 0.07668115198612213 
 ssim: 0.8267332764951745
calculating metrics for tmp/results_Set5/pixelwise/and tmp/gt_Set5/


Calculating lpips LPIPS on sr images: 100%|██████████| 5/5 [00:00<00:00, 47.34it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 5/5 [00:00<00:00, 15.25it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 5/5 [00:00<00:00, 26.85it/s]


for the Set5 using pixelwise scores are:
 psnr: 28.85939750671387
 lpips: 0.11080751568078995 
 ssim: 0.8417996802882358
calculating metrics for tmp/results_Set5/pixelwise_sed/and tmp/gt_Set5/


Calculating lpips LPIPS on sr images: 100%|██████████| 5/5 [00:00<00:00, 23.49it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 5/5 [00:00<00:00, 15.21it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 5/5 [00:00<00:00, 27.35it/s]


for the Set5 using pixelwise_sed scores are:
 psnr: 28.582693099975586
 lpips: 0.08782997727394104 
 ssim: 0.8272810757698196
calculating metrics for tmp/results_Set14/patchgan/and tmp/gt_Set14/


Calculating lpips LPIPS on sr images: 100%|██████████| 14/14 [00:00<00:00, 65.74it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 14/14 [00:00<00:00, 16.67it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 14/14 [00:00<00:00, 60.30it/s]


for the Set14 using patchgan scores are:
 psnr: 25.425380979265487
 lpips: 0.12650790810585022 
 ssim: 0.7052855982882288
calculating metrics for tmp/results_Set14/patchgan_sed/and tmp/gt_Set14/


Calculating lpips LPIPS on sr images: 100%|██████████| 14/14 [00:00<00:00, 56.17it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 14/14 [00:00<00:00, 15.91it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 14/14 [00:00<00:00, 98.36it/s]


for the Set14 using patchgan_sed scores are:
 psnr: 25.516498974391393
 lpips: 0.12707917392253876 
 ssim: 0.7115346855304715
calculating metrics for tmp/results_Set14/pixelwise/and tmp/gt_Set14/


Calculating lpips LPIPS on sr images: 100%|██████████| 14/14 [00:00<00:00, 55.88it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 14/14 [00:00<00:00, 18.15it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 14/14 [00:00<00:00, 62.72it/s]


for the Set14 using pixelwise scores are:
 psnr: 25.908776010785783
 lpips: 0.18113267421722412 
 ssim: 0.7328405892092797
calculating metrics for tmp/results_Set14/pixelwise_sed/and tmp/gt_Set14/


Calculating lpips LPIPS on sr images: 100%|██████████| 14/14 [00:00<00:00, 73.07it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 14/14 [00:00<00:00, 17.03it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 14/14 [00:00<00:00, 65.54it/s]


for the Set14 using pixelwise_sed scores are:
 psnr: 25.551700592041016
 lpips: 0.14808118343353271 
 ssim: 0.7168674343844766
calculating metrics for tmp/results_urban100/patchgan/and tmp/gt_urban100/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 128.13it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:05<00:00, 16.76it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 159.73it/s]


for the urban100 using patchgan scores are:
 psnr: 23.356260175704957
 lpips: 0.13328886032104492 
 ssim: 0.7040493599656144
calculating metrics for tmp/results_urban100/patchgan_sed/and tmp/gt_urban100/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 129.10it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:06<00:00, 16.65it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 136.29it/s]


for the urban100 using patchgan_sed scores are:
 psnr: 23.39994520187378
 lpips: 0.13472136855125427 
 ssim: 0.7065629538047355
calculating metrics for tmp/results_urban100/pixelwise/and tmp/gt_urban100/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 129.07it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:05<00:00, 18.57it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 138.10it/s]


for the urban100 using pixelwise scores are:
 psnr: 23.566014013290406
 lpips: 0.17821992933750153 
 ssim: 0.7208096088247611
calculating metrics for tmp/results_urban100/pixelwise_sed/and tmp/gt_urban100/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 134.52it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:05<00:00, 17.01it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 143.73it/s]


for the urban100 using pixelwise_sed scores are:
 psnr: 23.432228927612304
 lpips: 0.1529310643672943 
 ssim: 0.710573309456417
calculating metrics for tmp/results_manga/patchgan/and tmp/gt_manga/


Calculating lpips LPIPS on sr images: 100%|██████████| 109/109 [00:00<00:00, 157.99it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 109/109 [00:05<00:00, 18.69it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 109/109 [00:00<00:00, 174.50it/s]


for the manga109 using patchgan scores are:
 psnr: 26.605389043825483
 lpips: 0.05496285855770111 
 ssim: 0.824411791092232
calculating metrics for tmp/results_manga/patchgan_sed/and tmp/gt_manga/


Calculating lpips LPIPS on sr images: 100%|██████████| 109/109 [00:00<00:00, 138.84it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 109/109 [00:05<00:00, 18.74it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 109/109 [00:00<00:00, 145.91it/s]


for the manga109 using patchgan_sed scores are:
 psnr: 26.53113312677506
 lpips: 0.05527523159980774 
 ssim: 0.8234676195979304
calculating metrics for tmp/results_manga/pixelwise/and tmp/gt_manga/


Calculating lpips LPIPS on sr images: 100%|██████████| 109/109 [00:00<00:00, 136.92it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 109/109 [00:06<00:00, 17.66it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 109/109 [00:00<00:00, 162.57it/s]


for the manga109 using pixelwise scores are:
 psnr: 26.92198289643734
 lpips: 0.06947644054889679 
 ssim: 0.8361318013636717
calculating metrics for tmp/results_manga/pixelwise_sed/and tmp/gt_manga/


Calculating lpips LPIPS on sr images: 100%|██████████| 109/109 [00:00<00:00, 131.03it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 109/109 [00:06<00:00, 17.48it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 109/109 [00:00<00:00, 155.97it/s]


for the manga109 using pixelwise_sed scores are:
 psnr: 26.71990483835203
 lpips: 0.061318933963775635 
 ssim: 0.8284960323340378
calculating metrics for tmp/results_DIV2K_valid_HR/patchgan/and tmp/gt_div2k/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 128.28it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:05<00:00, 17.89it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 155.96it/s]


for the DIV2K using patchgan scores are:
 psnr: 31.41322151184082
 lpips: 0.08203309774398804 
 ssim: 0.831089736958795
calculating metrics for tmp/results_DIV2K_valid_HR/patchgan_sed/and tmp/gt_div2k/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 124.99it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:05<00:00, 16.87it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 180.01it/s]


for the DIV2K using patchgan_sed scores are:
 psnr: 31.53876937866211
 lpips: 0.08428461849689484 
 ssim: 0.8307865551960572
calculating metrics for tmp/results_DIV2K_valid_HR/pixelwise/and tmp/gt_div2k/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 126.74it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:05<00:00, 17.31it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 160.98it/s]


for the DIV2K using pixelwise scores are:
 psnr: 31.448500556945802
 lpips: 0.12308312207460403 
 ssim: 0.8460997286660628
calculating metrics for tmp/results_DIV2K_valid_HR/pixelwise_sed/and tmp/gt_div2k/


Calculating lpips LPIPS on sr images: 100%|██████████| 100/100 [00:00<00:00, 128.90it/s]
Calculating ssim SSIM on sr images: 100%|██████████| 100/100 [00:05<00:00, 17.45it/s]
Calculating psnr PSNR on sr images: 100%|██████████| 100/100 [00:00<00:00, 144.97it/s]


for the DIV2K using pixelwise_sed scores are:
 psnr: 31.655537166595458
 lpips: 0.1033484935760498 
 ssim: 0.8364254826235448


## Comparisons

### Quantitative Comparisions

| Dataset  | Method          | PSNR (Ours)             | PSNR              | LPIPS     (Ours)        | LPIPS             | SSIM    (Ours)          | SSIM              |
|----------|-----------------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
| Set5     | patchgan        | 28.3              | 30.6              | 0.073             | 0.070             | 0.820             | 0.860             |
| Set5     | patchgan_sed    | 28.5              | 31.2              | 0.077             | 0.064             | 0.827             | 0.867             |
| Set5     | pixelwise       | 28.9              | 31.1              | 0.111             | 0.072             | 0.842             | 0.869             |
| Set5     | pixelwise_sed   | 28.6              | 31.7              | 0.088             | 0.069             | 0.827             | 0.880             |
| Set14    | patchgan        | 25.4              | 26.9              | 0.127             | 0.130             | 0.705             | 0.724             |
| Set14    | patchgan_sed    | 25.5              | 27.3              | 0.127             | 0.117             | 0.712             | 0.736             |
| Set14    | pixelwise       | 25.9              | 27.5              | 0.181             | 0.127             | 0.733             | 0.739             |
| Set14    | pixelwise_sed   | 25.6              | 27.9              | 0.148             | 0.123             | 0.717             | 0.757             |
| Urban100 | patchgan        | 23.4              | 24.8              | 0.133             | 0.120             | 0.704             | 0.752             |
| Urban100 | patchgan_sed    | 23.4              | 25.9              | 0.135             | 0.106             | 0.707             | 0.779             |
| Urban100 | pixelwise       | 23.6              | 25.6              | 0.178             | 0.125             | 0.721             | 0.768             |
| Urban100 | pixelwise_sed   | 23.4              | 26.2              | 0.153             | 0.112             | 0.711             | 0.788             |
| Manga109 | patchgan        | 26.6              | 28.6              | 0.055             | 0.058             | 0.824             | 0.872             |
| Manga109 | patchgan_sed    | 26.5              | 29.9              | 0.055             | 0.048             | 0.823             | 0.888             |
| Manga109 | pixelwise       | 26.9              | 29.4              | 0.069             | 0.056             | 0.836             | 0.882             |
| Manga109 | pixelwise_sed   | 26.7              | 30.4              | 0.061             | 0.047             | 0.828             | 0.897             |
| DIV2K    | patchgan        | 31.4              | 28.7              | 0.082             | 0.111             | 0.831             | 0.792             |
| DIV2K    | patchgan_sed    | 31.5              | 29.2              | 0.084             | 0.094             | 0.831             | 0.802             |
| DIV2K    | pixelwise       | 31.4              | 29.2              | 0.123             | 0.110             | 0.846             | 0.802             |
| DIV2K    | pixelwise_sed   | 31.7              | 29.9              | 0.103             | 0.102             | 0.836             | 0.818             |

## Qualitative Comparison for our goals
**Ground Truth (GT) - GT downscaled by 4 - Patchgan - Patchgan + SeD - Pixelwise - Pixelwise + SeD**
<img src="img/merged_image2.jpg" /> <br/>
<img src="img/merged_image3.jpg" /> <br/>
<img src="img/merged_image4.jpg" /> <br/>
<img src="img/merged_image5.jpg" /> <br/>
<img src="img/merged_image6.jpg" /> <br/>


## Challenges we have encountered when implementing the paper
- The main challenge was that there were too many ambiguities in the paper (dimension of layers, multi-head/single head attention)
- Preprocessing the data was also a challenge as the authors did not specify how they have preprocessed the data so we have decided to go with cropping the datasets to enhance the size of the dataset
- Implementing CLIP feature extractor was also a challenge while trying to get the outputs from the 3rd layer.
- Integrating the Semantic Aware Fusion Block was also a challenge because there were dimensionality mismatches. At first, we have tried to implement the model with the given setup but we have encountered dimensionality mismatches. As a result, we have decided to add extra convolution layers to make the dimensions compatible.


### Comments and Responses

#### Comment: 
**Paper author information is not provided.**

**Response:**
Provided in v2.

---

#### Comment: 
**We could not try the notebook due to the problems on downloading datasets (big size and low download speed), one restricted download link and path issues (elaborated below).**

**Response:**
Download problems resolved (for restriction and issues). We also faced the big size in some servers but were able to download on another server.

---

#### Comment: 
**Because of the technical issues they mentioned to the instructor, the declared plots are for a small subset of the dataset.**

**Response:**
We also demonstrate training logic on a small toy dataset (as we cannot keep the Jupyter notebook open for 1.8 days). However, declared plots are for original results.

---

#### Comment: 
**Download script is problematic for the reasons stated in the additional comments part below.**

**Response:**
Path problems, etc., fixed in v2.

---

#### Comment: 
**Why single head cross attention is used instead of multi-head cross attention? (Why num_heads is 1?)**

**Response:**
The paper does not explicitly mention that they use multi-head cross attention. Hence, we used a single head cross attention logic. But this can still be adjusted in the code for multi-head easily.

---

#### Comment: 
**It is not written in the assumptions, but the feature maps are assumed to have a dimension of 128, which we cannot see in the paper explicitly. Is there a reason for this? (Why “dim_head” is 128?)**

**Response:**
For the computational resources we have, a dimension of 128 was the maximum that we could handle. The paper also does not explicitly mention dimensionality. This can be changed for larger GPU resources easily.

---

#### Comment: 
**Is the choice of lambda values in the loss as 1 and 10 stems from a reason? For example, is it derived by hyperparameter experiments? We think this choice should affect the training significantly and the reason for lambda choices should have a reasoning.**

**Response:**
We conducted hyperparameter tuning with a few experiments. These values reproduced the original paper results (almost). Our main focus was to make loss values (Perceptual, VGG, Adversarial) on the same scale.

---

#### Comment: 
**Because of the technical problems they have encountered, they could not measure the metrics on the desired datasets.**

**Response:**
Metrics are measured in v2, and target results are achieved.

---

#### Comment: 
**The notebook also reports the original values of the targeted quantitative results, for comparison. They are not explicitly stated in the notebook.**

**Response:**
They are added in v2.

---

#### Comment: 
**LICENSE contains only Yiğit Ekin’s name.**

**Response:**
Fixed in v2.

---

#### Comment: 
**For RRDB module, leaky relu (or any activation) is not used between conv layers, which is contrary to the referred RRDB paper.**

**Response:**
Very good catch. We thank the reviewers. Fixed in v2.

---

#### Comment: 
**Plots and examples are demonstrated well. We think that adding the `train.py` files content to the main Jupyter notebook is not necessary.**

**Response:**
Since they are needed in the demo, we hold them as well in v2.

---

#### Comment: 
**The code lacks comments, which makes it hard to follow the code. It should contain more comments, possibly referring where it is related to in which paper for easier interpretability.**

**Response:**
More comments are added in v2.

---

#### Comment: 
**Using too much (e.g., `RRDBNet` -> `Residual_in_ResidualBlock` -> `DenseBlock` -> `make_blocks` -> `get_layer` -> `Conv2d`) hierarchy deteriorates the code’s readability in our opinion. It may be 1-2 levels more basic to enhance following up the code.**

**Response:**
We did not change this part, as we build these blocks based on the referred paper.

---

#### Comment: 
**In the implementations `models/patchgan_discriminator.py` file, there is an absolute path import `sys.path.append('/scratch/users/hpc-yekin/hpc_run/SeD/models')`, which should be changed with a dynamic path like `sys.path.append('./models')`, since the former can only work with your local PC, and the latter may be more flexible.**

**Response:**
Fixed in v2, now using dynamic path.

---

#### Comment: 
**The `environment.yml` file does not install `clip`, but it must be imported in `models/feature_extractor_model.py`.**

**Response:**
Clip added to `environment.yml` in v2.

---

#### Comment: 
**`download_dataset.sh` did not work with your `resize_4.py`. This is because the parser should have `--folder` instead of `folder` and `--save_path` instead of `save_path` (-- is missing). The `gdown https://drive.google.com/file/d/1henrktM4Cw9hJIJBDEObAzl-eCbpzNaJ/view?usp=drive_link` part does not work since the link requires access. Moreover, the download of `Flickr2K` was very slow (lower than 500kb/s) in our case. Again, it may be a problem on our side, but it is likely to be caused by the enabled download rate from the given link. If some other faster link or option is provided in the script, it would be much better.**

**Response:**
Fixed in v2.

---

#### Comment: 
**In the `prepare_dataset.py` script, the `os.mkdir("data2/dataset_cropped") if not os.path.exists("data2/dataset_cropped") else None` line does not work. You can replace it with `os.makedirs("data2/dataset_cropped", exist_ok=True)`. Similarly, replacing the `os.mkdir(save_folder) if not os.path.exists(save_folder) else None` with `os.makedirs(save_folder, exist_ok=True)` should work.**

**Response:**
Fixed in v2.

---

#### Comment: 
**The script `prepare_dataset.py` does not work since it assumes there is a `data2/hr` and `data2/lr` directories exist already. (we did not understand the `data2` folder. Maybe, it is left from the code authors’ local environment)**

**Response:**
It was used for a toy dataset. Removed in v2.

---

#### Comment: 
**In the `models/super_resolution_module.py` there is a possible redundant model instance creation at line 34. It should be merged with an if-else block with line 36. Creating a `PatchDiscriminatorWithSeD` instance can be somehow costly and if `is_pixelwise_disc` is False, it will be created and then deleted for no reason.**

**Response:**
Fixed according to the feedback.

---

#### Comment: 
**In `models/super_resolution_module.py`, lines 132 and 133 send low and high-resolution batch parts to GPU, which remains unused because `get_loss_forward_dict` method uses the batch object. This causes unnecessary usage of GPU memory, which may limit the highest possible training batch size to 50%.**

**Response:**
Fixed according to the feedback.

---

#### Comment: 
**For SeD patchwise discriminator, the adversarial loss of the generator converged to a very high value too early, whereas the discriminator's adversarial loss converged to a low value. This may be a sign of mode collapse or unstable training (since the generator's loss always increased). Similarly, for patchwise discriminator, the generator's adversarial loss converges to a value that does not show significant improvement (decrease), whereas the discriminator's adversarial loss seems to converge to a low value too early, may be caused by the same issues.**

**Response:**
We do not find this review very useful (and also not intuitive). We detected that the main problem in our v1 submission was related to a data augmentation problem in our dataloader class. We were applying mirror augmentation to high resolution images and not to low resolution images during training. As a result, the generator was not able to learn the correct mapping between low and high resolution images. To be precise, it is not included in our logs but the generator of v1 was learning to create a blurry image where the instance on one side was copied on to the other side (like a model that duplicates the object on the right side of the image to both left and right side). We solved the dataloader problem, and discriminator and generator losses started looking as desired.

---

#### Comment: 
**In the discriminator's initial part, the paper only states a “conv” in the architecture. However, using a 4-block downsampler with stride=2 may cause an imbalance between the discriminator and the generator. One of the reasons the discriminator learns too much and the generator learns nearly nothing might be due to this (reported plots of training adversarial losses in the notebook). We are suggesting paying attention to discriminator experiments.**

**Response:**
We do not find this review very useful (and also not intuitive). As mentioned before, we detected that the main problem in our v1 submission was related to a data augmentation problem in our dataloader class. We solved the dataloader problem, and discriminator and generator losses started looking as desired.

---

