#First Part: What is Style Transfer?

![image](https://drive.google.com/uc?export=view&id=1CyF8m1l-tVChLZmAzqLEKOwp8HbhEMB9)

## This is style transfer.




# Arbitrary Style Transfer in a data paucity scenario

The objective of this project is to test the validity of the 'Adaptive instance normalization' (AdaIn) to capture style from images in the style transfer task.   
The first approach was attempted in 2017 by [Xun Huang, Serge Belongiein](https://arxiv.org/abs/1703.06868), and it has proven to be successful and comparable with the previous non-arbitrary approach.  
In the following years have been developed several strategies to carry out this particular task, including the more popular GANs (Generative Adversarial Networks) architectures.  
One of the objective of the authors was to try new Encoder-Decoder architectures, more complex than vgg19, but still with the AdaIn mechanism.  
In this notebook is provided a whole unofficial re-implementation with pytorch, of the main experiments conducted in the original paper, and was tested the effectiveness of AdaIn with Resnet34 with and without the residual connections.  


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [9]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import torch.nn.functional as F
import plotly.graph_objects as go
from torch.optim import lr_scheduler
from PIL import ImageFile
from architectures import *
from utils import *
import ipywidgets as widgets
ImageFile.LOAD_TRUNCATED_IMAGES = True
Image.MAX_IMAGE_PIXELS = 100000000000 


In [10]:
import IPython
js_code = '''
function ClickConnect(){
console.log("Working");
document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect,60000)
'''
display(IPython.display.Javascript(js_code))

<IPython.core.display.Javascript object>

In [None]:
!nvidia-smi -L

In [None]:
!pip3 install pytorch-lightning==1.5.10
import pytorch_lightning as pl

SEED = 2005

pl.seed_everything(SEED)

# Datasets

* In order to make the Network to learn we need to feed it with two different datasets, one with the content images, and the other with the paintings.  

* The former was the MS-COCO 2017, and the latter was the Painter by Numbers dataset (instead of the Wiki-Art dataset).  

* In the experiments were used 20000 pairs of images from both datasets, whereas for the testing step were used 200 pairs of images. The datasets are available at https://drive.google.com/drive/folders/1S0S-H_vXYiKBR6lBbf46YSmbrBQfU1Yv?usp=sharing.  
  
* All the images were preprocessed according to the procedures described in the research. In particular the images were first resized to the 512x512 resolution and then were Center cropped to 256x256 resolution.

* It was not necessary to normalize the images because the network was trained to do that, and the normalization would have affected the performances.

![image](https://drive.google.com/uc?export=view&id=19bZ-IhozbQTnAUZU0IuSoj5Wn6w_TCkJ)


#### The next cells would work only if are defined the datasets' paths.
* It was used pytorch lightning framework to develop the code in a more organised way.

In [None]:
transform = transforms.Compose([transforms.Resize((512,512)),
                               transforms.CenterCrop(256),
                               transforms.ToTensor()])

path_content = ''  # MS-COCO 20K Images
path_style = ''    # Painter by Numbers 20k Images
path_test_content = '' #MS-COCO 200 Images
path_test_style = ''   #Painter by Numbers 200 Images

# mscoco = torchvision.datasets.ImageFolder(root = path_content, transform = transform)
# paint_by = torchvision.datasets.ImageFolder(root = path_style, transform = transform)
# test_content = torchvision.datasets.ImageFolder(root = path_test_content, transform = transform)
# test_style = torchvision.datasets.ImageFolder(root = path_test_style, transform = transform)



In [None]:
class datasets(Dataset):  # This class is simply needed to inherit the methods from the Dataset class.

  def __init__(self, dataset1, dataset2, dataset_length):
    self.dataset1 = dataset1
    self.dataset2 = dataset2
    self.dataset_length = dataset_length

  def __len__(self):
    return self.dataset_length

  def __getitem__(self,idx):
    out1, _ = self.dataset1[idx]
    out2, _ = self.dataset2[idx]
    return out1,out2


In [None]:
# The lightning module is needed to combine the Datasets method with the dataloader functions #

class pl_Datasets(pl.LightningDataModule):

    def __init__(self, dataset1, dataset2, dataset3, dataset4, batch_size, dataset_length):
      self.d1 = dataset1
      self.d2 = dataset2
      self.d3 = dataset3
      self.d4 = dataset4
      self.batch_size = batch_size
      self.dataset_length = dataset_length
      self.test_set_length = 200
    def setup(self, stage = None):
        if stage == 'fit':
            self.train_dataset = datasets(self.d1, self.d2, self.dataset_length) # These are referred to the main datasets 20k images from both
        elif stage == 'test':
            self.test_dataset = datasets(self.d3, self.d4, self.test_set_length) # These will be the other datasets

    def train_dataloader(self, *args, **kwargs):
        return DataLoader(self.train_dataset, batch_size = self.batch_size, shuffle = True)

    def val_dataloader(self, *args, **kwargs):
        return DataLoader(self.test_dataset, batch_size = self.batch_size, shuffle = False)

    def test_dataloader(self, *args, **kwargs):
        return DataLoader(self.test_dataset, batch_size = self.batch_size, shuffle = False)

batch_size = 8  # As specified in the paper
dataset_length = 20000 #00
universal_device = 'cuda'
#pl_data = pl_Datasets(mscoco, paint_by, test_content, test_style, batch_size, dataset_length)

In [None]:
pl_data.setup('fit')
pl_data.setup('test')


batch = next(iter(pl_data.train_dataloader()))
content_image, style_image = batch
print("The size of the style image is",style_image.shape)
print("The size of the content image is",content_image.shape)

The size of the style image is torch.Size([8, 3, 256, 256])
The size of the content image is torch.Size([8, 3, 256, 256])


# Architecture
The heart of the method is the Adaptive Instance Normalization layer (AdaIn).  
 Such layer is defined as follows:  
$$ AdaIN(x,y) = {\sigma(y) \cdot (\frac{x- \mu(x)}{\sigma(x)}) + \mu(y) }, $$ 
  
where x is the Encoded Content Image, and y is the Encoded Style Image.

### The Inthuition
The inthuition behind this is that we embed the two images with the same pre-trained Encoder, and the AdaIn layer perform the images matching at a lower dimension. 
Then it follows a trained Decoder that will map back the feature maps to the image space. 

* Roughly speaking the AdaIn layer matches the stylistic features with the content features at a lower dimension, and then the reconstructed images will depict these stylistic features.

* The reason why the layer is called adaptive is because it does not require to learn any parameter, but it depends always on the feature maps statistics. 

* Mean and variance (for each channel) of the content input (encoded) are adjusted to match those of the style input (encoded) ('Content' of the image is scaled by $\sigma(y)$ and shifted by $\mu(y)$).

* In the decoder there are not any normalization layers, because they tend to center the content input to some pre-defined styles.  
  
All these information can be better understood by looking at the visual representation of the architecture. (Taken from the paper)  

![image](https://drive.google.com/uc?export=view&id=1uiyj5xu62dAmbF02zNUKtfdN78y3ANTD)  
As we can see from the image it was used a pre-trained Vgg19 as Encoder, the encoding was defined up to Relu 4_1 (512, 64x64).  
* In my experiments i have used always this Encoder  
* Together with the encoder was employed also a Decoder architecture, that was the one trained to reconstruct the image, in particular i have tried 3 different possible Decoder architectures:  
 1. Vgg19,
 2. ResNet34,
 3. ResNet34 with no residual blocks. 

The (2) and (3) architecture are depicted in the following Figure:
![image](https://drive.google.com/uc?export=view&id=1r5ZcaNfzQKch2J-GziKklXKYcy3l4M-V)  
In the (3), we do not consider the residual connections.

Additionally the main network is fully convolutional, thus we can apply it to images of any size.  





In [None]:
# This is the engineering code, here we defining the architecture of the AdaIn network, also using the Encoder and Decoder defineed in the architectures.py file in the repository. #

class Neural_style_network(pl.LightningModule):
  def __init__(self, lr, dec_path, alpha, num_epochs, device, first_train = False, net = 'vgg', residuals = True):
    super(Neural_style_network,self).__init__()
    
    self.enc = Encoder(device) # The Encoder is common to all the networks.
    self.path = dec_path       # The path is chosen to be where it was saved the model, or where we are about to save the model.
    if net == 'vgg':
      self.dec = Decoder()     # The Decoder is the mirror of the Encoder
    elif net == 'res':
      self.dec = DecodedRes(residuals) # The Decoder for resnet is different, so it was defined as a different class, and it alsoo change based on the presence of the residuals.

    self.first = first_train           # We need to specify if it is the first trial with a network or not
    self.lr = lr
    self.alpha = alpha                 # alpha is the style weight to apply in the total loss function.
    self.epochs = num_epochs           # These are the number of epochs
    self.loss_list = []
    self.optimizer = torch.optim.Adam(self.dec.parameters(), lr=self.lr)
    self.scheduler = lr_scheduler.CosineAnnealingLR(self.optimizer, T_max=self.epochs, eta_min= 0, last_epoch= -1, verbose=True)
    
    if first_train!=True:
      self.checkpoint = load_model(self.path, self, device)
      self.scheduler.load_state_dict(self.checkpoint['scheduler'])

    
  
  def forward(self, content_image, style_image, test = None):  #This method implements the architecture as sketched in the structure.
    enc_image = self.enc.forward(content_image)
    enc_style = self.enc.forward(style_image, lista = True)  # The list that we give to the encoder means that we want to retrieve from the layers of the networks the levels associated to relu_1_1, relu_2_1, relu_3_1, relu_4_1.

    adapted_image = self.AdaIn(enc_image, enc_style[-1])
    
    if test!=None: # At test time we want to return the decoded image
      
      decoded_adapt =(1-test)*enc_image +test*(adapted_image) # We can decide the level of styleness to apply at the decoded image.
      
      return self.dec(decoded_adapt)
    
    decoded_adapt =  self.dec(adapted_image)
    renc = self.enc.forward(decoded_adapt, lista = True)

    content_loss = self.Content_loss(renc[-1], adapted_image)
    style_loss = self.Style_loss(renc, enc_style)  

    return self.total_loss(content_loss, style_loss)
  
  def training_step(self, batch, batch_idx):
    content, style = batch
    loss = self.forward(content, style)
    self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
    
    return loss

  def validation_step(self, batch, batch_idx):
    content, style = batch
    loss = self.forward(content, style)
    self.log("test_loss", loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
    self.loss_list.append(loss.item())
    return loss

  def on_train_epoch_end(self, *args, **kwargs):
    if self.path!=None:  # If the target PATH is not defined we don't save the model.
      model = self.state_dict()
      loss = sum(self.loss_list)/len(self.loss_list)
      checkpoint = {}
      checkpoint['model_state'] = model
      checkpoint['scheduler'] = self.scheduler.state_dict()
      if self.first: # we initialize the checkpoint loss list
        checkpoint['loss'] = [loss]
        self.first = False # Now we know that for a given path there will be already some information stored. 

      else:
        checkpoint['loss'] = self.checkpoint['loss'] + [loss] # Update the previous information.

      self.loss_list = [] # re-initialize the loss before the next epoch
      save_model(checkpoint, self.path)
      self.checkpoint = load_model(self.path, self, self.device)
      self.scheduler.load_state_dict(self.checkpoint['scheduler'])
    else:
      return



  def configure_optimizers(self):
    return [self.optimizer],[self.scheduler]

  def calc_mean_std(self, input, eps=1e-5): # given a feature maps layer, for each channel and each batch we compute its mean and variance (batch, channel, 1, 1)
    batch_size, channels = input.shape[:2]

    reshaped = input.view(batch_size, channels, -1) # Reshape channel wise
    mean = torch.mean(reshaped, dim = 2).view(batch_size, channels, 1, 1) # Calculate mean and reshape
    std = torch.sqrt(torch.var(reshaped, dim=2)+eps).view(batch_size, channels, 1, 1) # Calculate variance, add epsilon (avoid 0 division),
                                                                                      # calculate std and reshape
    return mean, std

  def total_loss(self, content_loss, style_loss): # This is the total loss
    return content_loss + self.alpha*style_loss


  def AdaIn(self, content, style):
    assert content.shape[:2] == style.shape[:2] # Only first two dim, such that different image sizes is possible
    batch_size, n_channels = content.shape[:2]
    mean_content, std_content = self.calc_mean_std(content)
    mean_style, std_style = self.calc_mean_std(style)

    output = std_style*((content - mean_content) / (std_content)) + mean_style # Normalize, then modify mean and std
    return output

  def Content_loss(self, input, target): # Content loss is a simple MSE Loss, we want to reduce the distance of the AdaIn output, with the re-encoded stylized image
    loss = F.mse_loss(input, target)
    return loss

  def Style_loss(self, input, target):
    mean_loss, std_loss = 0, 0

    for input_layer, target_layer in zip(input, target): 
      mean_input_layer, std_input_layer = self.calc_mean_std(input_layer)
      mean_target_layer, std_target_layer = self.calc_mean_std(target_layer)

      mean_loss += F.mse_loss(mean_input_layer, mean_target_layer) # Distance in the same channels is reduced within the same layer, and then it is done for all the layers.
      std_loss += F.mse_loss(std_input_layer, std_target_layer)

    return mean_loss+std_loss


# Training step 

The networks were trained until their test loss started to become too flat.
Some of the hyperparameters are listed here:  
* learning rate: $10^{-5}$,
* λ: 2, (This is different from the weight decay!!)
* batch size: 8,
* [Adam Optimizer](https://arxiv.org/abs/1412.6980)
* [Cosine learning rate Scheduler](https://arxiv.org/abs/1608.03983)
* Up-Sampling mode: 'nearest'
* Reflection Padding.  
The implementation of the losses were taken from this [repository](https://github.com/MAlberts99/PyTorch-AdaIN-StyleTransfer).  
The formulas are:  
$$ \text{Content Loss}: L_c = {|| f(g(t)) - t ||_2}, $$ 
$$ \text{Style Loss}: L_s = {\sum_{i=1}^{L} || \mu(ϕ_i(g(t))) - \mu(ϕ_i(s)) ||_2 + || \sigma(ϕ_i(g(t))) - \sigma(ϕ_i(s)) ||}, $$
$$ \text{Total Loss}: L_c + λ\cdot L_s. $$  

Where $t = AdaIn(x,y)$, $g(t)$ is the decoded image, and f($\cdot$) is the encoder function.  

As regards to the Style Loss, we minimize only the distances between the statistics, whereas for the content loss we want to preserve the spatial structure of the image.



In [None]:
lr = 0.00001
alpha = 2
num_epochs =20

net_type = 'vgg'

universal_device = 'cuda'

PATH = None #'./models/' + net_type+ ".pt", if None, the model won't be saved.

FIRST = False # If false we don't load any path, but at the end of every epoch we save the model at the initialized path.

model = Neural_style_network( lr, PATH, alpha, num_epochs, universal_device, first_train = FIRST) #, net = 'res', residuals = True)


In [None]:
trainer = pl.Trainer(
    max_epochs=num_epochs,  # maximum number of epochs.
    gpus=1,  # the number of gpus we have at our disposal.
    default_root_dir="./models/"
)


In [None]:
trainer.fit(model = model, datamodule = pl_data)

## End of the training code
### Proceed with the other notebook......