# Neural style transfer

---
> **This notebook is quite compute intensive: it might be better to run it on Google Colab (or on Kaggle). <br>
> Depending on your recent Google Colab usage, Kaggle may be faster.**
---

As seen in our previous courses, Machine Learning models are usually used to solve industrial problems such as classifiyng satellite images or extracting features of an image. Apart from these "real-life" industrial problems, some research works have less "practical" and direct purposes but are more artistic by often introducing artistic generative methods and models. 
**Style transfer** is a technique aiming at generating a new image by merging a content image with a style image. For example, we can merge a picture of ISAE-SUPAERO's building with the style of some paintings or visual works :

<br>

<center><img src="https://drive.google.com/uc?id=1lb37kiTRzqu5QhjJMkZyF4XUSLsI3o6p" width="80%" style="background-color:white">
</center>


This notebook is divided into two main parts. The first section [Style transfer baseline](#section1) presents the basics of neural style transfer, and reproduces the results of the article [1]. The second section [Real-time style transfer](#section2), that relies on a part of the article [4], improves the results of the first section by speeding up the generation process. A final short section [Going further](#section3) presents some improvements in visual neural style transfer, and also some adaptations of style transfer in other domains, especially in Music Information Retrieval.

In the following, we will call:
- The **content image**: the original image on which we want to transfer the style.
- The **style image**: the original image that contains the required style.
- The **combined image**: the output image that combines the content image with the style image.

Run this cell to download data and set up your session. If you are on Kaggle, check that the "Internet" toggle is switched on. The `%load` magic command does not seem to work on Colab, so simply copy-paste from the .py file if needed, or run a `!cat file` <br>

In [None]:
# Download data (size ~250 Mo)
!wget https://nextcloud.isae.fr/index.php/s/DwP7K9HpmNKZaR2/download -O data.zip
    
!unzip -q data.zip
!rm data.zip

!mkdir train
!unzip -q data/coco.zip -d train
!unzip -q data/saved_models.zip 
!unzip -q data/images.zip
!unzip -q data/solutions.zip
!unzip -q data/results.zip
!mv data/utils.py ./utils.py
!rm -r data

!mkdir models
!rm -rf sample_data

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms, datasets

import re
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import imageio
from PIL import Image
from tqdm.autonotebook import tqdm
from IPython.display import display, HTML, Audio

import utils

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

<a class="anchor" id="section1"></a>
## Style transfer baseline 

This section is based on [Gatys & al. (2015), A neural algorithm of artistic style](https://arxiv.org/pdf/1508.06576.pdf), the main paper that tackled the issue of style transfer. This algorithm is an unsupervised method that manages at generating one combined image from a content image and a style image.

The style transfer method described in this paper relies on three main ideas that can differ from other machine learning problems:
- We do not train a model itself. The optimized parameters are **not** the layers' parameters, but the **input** of the model.
- We compute a custom loss that takes into account the content and the style.
- The loss is computed using abstractions of the images, and not the images themselves.

### Overall process
The overall process in presented in this figure :

<br>

<center><img src="https://drive.google.com/uc?id=1WKQ6SgrAdZ8lz_Atq8a_gfywH9mHYA1Y" style="border: solid 1pt #cccccc" width="75%"><br>
Figure 1: Overall style transfer process presented in Gatys & al. (2015)
</center>

<br>

We first choose a content image $I_{content}$, a style image $I_{style}$, and we initalize a combined image $I_{comb}$ randomly. Then, we repeat this process:
- (1) Feed the combined image in a pre-trained VGG-19.
- (2) Using the original content and style images, compute the loss of the output of the VGG-19, which is the weighted sum of a content and a style loss.
- (3) Backpropagate to the combined image and **only** update the pixels of the combined image (i.e. the VGG-19's parameters are not updated).

We then return the combined image.

The main remaining questions are thus : 
- *What are the inputs of the loss function ?*
- *How can we define a content loss and a style loss ?*

### Data loader

We first load our data: the style image and the content image. Moreover, the pre-trained VGG-19 model used below has been initially trained on the ImageNet dataset. Thus, it is usual to normalize our input images using the ImageNet mean and standard deviation.

In [None]:
img_size = 512 
prep = transforms.Compose([
        transforms.Resize(img_size),
        transforms.ToTensor(),
        transforms.Lambda(lambda x: x[torch.LongTensor([2,1,0])]), # turn to BGR
        transforms.Normalize(mean=[0.406, 0.456, 0.485], # subtract imagenet mean
                            std=[1,1,1]),
        transforms.Lambda(lambda x: x.mul_(255)),
    ])

post = transforms.Compose([
        transforms.Lambda(lambda x: x.mul_(1./255)),
        transforms.Normalize(mean=[-0.406, -0.456, -0.485], # add imagenet mean
                            std=[1,1,1]),
        transforms.Lambda(lambda x: x[torch.LongTensor([2,1,0])]), # turn to RGB
        transforms.Lambda(lambda x: torch.clip(x,0,1)),
        transforms.ToPILImage()
    ])

In [None]:
style_filename = "images/style-images/mosaic.jpg"
content_filename = "images/content-images/isae.jpg"

img_names = [style_filename, content_filename]
imgs = [Image.open(name) for name in img_names]
imgs_torch = [prep(img).unsqueeze(0).to(device) for img in imgs]
style_img, content_img = imgs_torch

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
utils.show_img_gatys(style_img, ax=ax1, title="Style")
utils.show_img_gatys(content_img, ax=ax2, title="Content")

<a class="anchor" id="subsection-model"></a>
### Model 

In order to get an abstraction of an image, the transfer style method uses the features found in a pre-trained VGG-19 network. We thus use the following network, that is to say the full VGG-19 network in which the upper classification dense layers are removed:

<br>

<center><img src="https://drive.google.com/uc?id=1hIY24ERLVEUfOPJDuXX7Dpk1R1TBGmsM" style="border: solid 1pt #cccccc"/><br>
Figure 2: VGG-19 network and feature maps used for content and style losses
</center>

<br>

As described more precisely in the following section [Loss](#subsection-loss), we will use the intermediate outputs of some layers as abstractions of the images.

In [None]:
full_vgg19 = torch.hub.load('pytorch/vision:v0.10.0', 'vgg19', pretrained=True)

We define a more practical model, so that the output of the model is not only the result after the fifth MaxPooling2D, but also all the required intermediate results.

In [None]:
class VGG19Features(nn.Module):
    def __init__(self):
        super(VGG19Features, self).__init__()
        self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
        self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv3_4 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
        self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv4_4 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool4 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_4 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool5 = nn.MaxPool2d(kernel_size=2, stride=2)
        
    
    def forward(self, x, out_keys):
        # arguments
        # ---------
        # out_keys : list[str]
        #    IDs of layers used in the loss function (both style and content)
        #
        # return
        # ------
        # list[torch.tensor] 
        #     List of layers' output, in the same order as out_keys 
        
        out = dict()
        out['r11'] = F.relu(self.conv1_1(x))
        out['r12'] = F.relu(self.conv1_2(out['r11']))
        out['p1'] = self.pool1(out['r12'])
        
        out['r21'] = F.relu(self.conv2_1(out['p1']))
        out['r22'] = F.relu(self.conv2_2(out['r21']))
        out['p2'] = self.pool2(out['r22'])
        
        out['r31'] = F.relu(self.conv3_1(out['p2']))
        out['r32'] = F.relu(self.conv3_2(out['r31']))
        out['r33'] = F.relu(self.conv3_3(out['r32']))
        out['r34'] = F.relu(self.conv3_4(out['r33']))
        out['p3'] = self.pool3(out['r34'])
        
        out['r41'] = F.relu(self.conv4_1(out['p3']))
        out['r42'] = F.relu(self.conv4_2(out['r41']))
        out['r43'] = F.relu(self.conv4_3(out['r42']))
        out['r44'] = F.relu(self.conv4_4(out['r43']))
        out['p4'] = self.pool4(out['r44'])
        
        out['r51'] = F.relu(self.conv5_1(out['p4']))
        out['r52'] = F.relu(self.conv5_2(out['r51']))
        out['r53'] = F.relu(self.conv5_3(out['r52']))
        out['r54'] = F.relu(self.conv5_4(out['r53']))
        out['p5'] = self.pool5(out['r54'])
        
        return [out[key] for key in out_keys]

In [None]:
VGG19Features()

We now set the parameters of the torch pre-trained model to our model.

In [None]:
# Retrieve parameters from the full VGG
list_params_original = []
for layer in full_vgg19.features:
    if isinstance(layer, torch.nn.modules.conv.Conv2d):
        params = layer.parameters()
        list_params_original.append(torch.nn.utils.parameters_to_vector(params))
        
# Set params to the new model
def init_params(model):
    i = 0
    for layer in model.children():
        if isinstance(layer, torch.nn.modules.conv.Conv2d):
            torch.nn.utils.vector_to_parameters(list_params_original[i], layer.parameters())
            i += 1

    for param in model.parameters():
        param.requires_grad = False

In [None]:
# Check that our model and the torch model give both the same result
mtest = VGG19Features()
init_params(mtest)

x0 = torch.rand(4, 3, 100, 100)
assert torch.equal(full_vgg19.features(x0), mtest(x0, ['p5'])[0])

In [None]:
# Initialization of the VGG model, that will then NEVER be updated
vgg19 = VGG19Features()
init_params(vgg19)
vgg19 = vgg19.to(device)

<a class="anchor" id="subsection-loss"></a>
### Loss

The loss function is the key idea of the paper. We want to quantify these two statements: "we can still recognize the content of the image" and "the image is in the style of the style image". Thus, the loss function is the sum of two terms, the content loss and the style loss.

#### Content loss

First of all, we want to define a content loss. 

We can take a simple example: the content image is a gray house, and here are combined images. 

<br>

<center><img src="https://drive.google.com/uc?id=1DLiBFAQbOb45XbHITbpxOgkBumsglKyA" style="border: solid 1pt #cccccc" width="40%"/></center>

The three combined images are also obviously houses, but if we compare the content image and a combined image pixel per pixel, they will be very different. So a pixel per pixel MSE between the content and the combination won't be relevant.

Instead, it would be better to compute the loss between abstractions of the content image and the combined image. We can thus use feature maps resulting from one of the late convolution layers of the VGG-19 presented below. Indeed, given that we use a *pre-trained* model, we hope that the model has learnt to store high-level information in the late feature maps, that is to say, information about the content of the image. Therefore to compare the content of two images, we will compare their two corresponding feature maps. 

In practice, we will simply compute the MSE between a late feature map with the content image as input, and that same feature map with the combined image as input:

<center><img src="https://drive.google.com/uc?id=1PuNIlW1LXC3OgB5yIJOhKuzzOA2TTxGr" style="border: solid 1pt #cccccc" width="80%"/><br>
Figure 3: Content loss
</center>

<br>

where $F^\ell_{comb}$ (resp. $F^\ell_{content}$) is the feature map $\ell$ found in the VGG-19 model, with the combined (resp. content) image as input.

The paper suggests to use the feature maps resulting from the convolutional layer `conv4_2`.

<font color="green">
<hr>
    
> **Your turn !** <br>
> - `loss_fn_content` is the content loss function. <br>
> - `content_targets` is a list of tensors of size `len(content_layers)` containing the objects passed to the content loss function during the training process.
<hr>
</font>

In [None]:
content_layers = ['r42']
loss_fn_content = ...
content_targets = ...

In [None]:
# !cat solutions/ex1.py
# %load solutions/ex1.py

#### Style loss 

The style loss is less direct. Indeed, the style can be caracterized by the color, the texture, eventual repeated patterns... The paper describes the style as "features correlations", that is to say how often two features co-occur. For instance, in the Van Gogh's *Starry Night*, we want to describe the style as: "there are a lot of blue lines". We now want to quantify these co-occurrences.

Let's take a simple example (adapted from [3]). Let's consider a simple convolutional layer with two output channels. The first one detects diagonal lines and the second one detects green objects. We feed the layer with a 4x4 image like that:

<center><img src="https://drive.google.com/uc?id=1LlBG9DWK7xdbYz-2xuBKbtEvCdOOVc3Z" width="30%"/></center>

Thus, the content of the feature maps is the following. The diagonal feature map captures the two diagonal lines on the left and the green feature map captures the green objects at the bottom.

<center><img src="https://drive.google.com/uc?id=1VyLCDgVEN1xgYONoHcu1mqd8RPhl4sUk" width="60%"/></center>

If we compute the [Froebenius inner product](https://en.wikipedia.org/wiki/Frobenius_inner_product) between these two feature maps (i.e. the inner product of the two flattened matrices), we have :

$$
\begin{bmatrix}
1 & 0 \\ 1 & 0
\end{bmatrix}
\cdot
\begin{bmatrix}
0 & 0 \\ 1 & 1
\end{bmatrix}
= 1
$$

Now, let's take another image as input of the convolutional layer:

<center><img src="https://drive.google.com/uc?id=1TeTqdRyv9UgHPVsmmjv2gCrD3rYpD5-i" width="60%"/></center>

We also compute the inner product :
$$
\begin{bmatrix}
1 & 1 \\ 1 & 1
\end{bmatrix}
\cdot
\begin{bmatrix}
1 & 1 \\ 1 & 1
\end{bmatrix}
= 4
$$

The second inner product is greater than the first one. We managed to quantify the observation that "green diagonal lines have a greater contribution in the style of the second image than in the first image". In other words, we managed to put a single value describing the co-occurrence between these two features "green" and "diagonal". Now, let's formalize and generalize this approach.

To describe a style, we want to quantify how often a feature co-occurs with another feature. Thus, from the output of one convolutional layer of depth $K$, we will compute $K^2$ inner products, corresponding to the inner product between each pair of feature maps, and store them in a $K \times K$ matrix, called [Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix). This Gram matrix is thus the fingerprint of the style of an image. 

Visually, for a set $F^\ell$ of feature maps of size $(h, w, K)$ resulting from a convolutional layer: 

<center><img src="https://drive.google.com/uc?id=1-XyuRauYVqHX8SpjFmY091k514Lko3n1" width="20%"/></center>

we construct the Gram matrix $G^\ell$ of size $K \times K$, in which, for $A_i, A_j$, two features maps of size $h \times w$ :

$$
\begin{align}
    \forall(i,j) \in [1 ... K]^2,~~G_{ij}^\ell 
    & = \langle A_i, A_j \rangle \\
    & = \text{flatten}(A_i) \times \text{flatten}(A_j)^T
\end{align}
$$

where $\langle\cdot,\cdot\rangle$ is the Frobenius inner product.

Thus, the Gram matrix contains the style information captured by one convolutional layer.

We can then compute multiple Gram matrices from multiple layers within the VGG-19 model. To compare styles of two images, we thus compare the Gram matrices of these layers, by applying a mean-squared error function. 
To sum up, we first retrieve the Gram matrices of the style image and the Gram matrices of the combined image and we then compute the MSE between these two Gram matrices:

<center><img src="https://drive.google.com/uc?id=1EwPOTfaHLtB2lVl1Ni7-CXB_FFkZbzic" style="border: solid 1pt #cccccc" width="90%"/><br>
Figure 4: Style loss
</center>

<br>

And we reiterate this process for each chosen layer. The final style loss is then :
$$
    \mathcal{L}_{style} = \sum_\ell w_\ell E_\ell
$$

where $w_\ell$ is an eventual weight and $E_\ell$, the style loss of one layer:
$$
    E_\ell = \text{MSE}\left(\text{Gram}(F^\ell_{comb}), \text{Gram}(F^\ell_{content})\right)
$$


The paper suggests to use the feature maps $\ell$ resulting from the convolutional layers `conv1_1`, `conv2_1`, `conv3_1`, `conv4_1` and `conv5_1`.

In [None]:
def gram_matrix(y):
    """Computes the Gram matrix of a set of feature maps"""
    (b, ch, h, w) = y.size() # batch, channels, height, width
    # Flatten the feature map
    features = y.view(b, ch, w * h) 
    # Transpose the flatten matrix without modifying the batch and channel dimensions
    features_t = features.transpose(1, 2) 
    # Batch-matrix-multiplication and normalization by channels*height*width
    gram = features.bmm(features_t) / (ch*h*w)
    return gram

<font color="green">
<hr>
    
> **Your turn !** <br>
> - `out` is the value of the style loss between a feature map `feature` and its corresponding `target` stored in `style_targets`.  <br> 
> - `loss_fn_style` is the style loss function. <br>
> - `style_targets` is a list of tensors of size `len(style_layers)` containing the objects passed to the style loss function during the training process.
<hr>
</font>

In [None]:
class GramMSELoss(nn.Module):
    def forward(self, feature, target):
        out = ...
        return out

In [None]:
style_layers = ['r11','r21','r31','r41', 'r51'] 
loss_fn_style = ...
style_targets = ...

In [None]:
# !cat solutions/ex2.py
# %load solutions/ex2.py

In [None]:
# !cat solutions/ex3.py
# %load solutions/ex3.py

The final loss if a weighted sum between the style loss and the content loss :

$$
\mathcal{L}_{total}(I_{comb}) = \alpha.\mathcal{L}_{content}(I_{comb}, I_{content}) + \beta.\mathcal{L}_{style}(I_{comb}, I_{style})
$$

In [None]:
# Style and content weights for the total loss
style_weight = 1e5
content_weight = 1e0

### Training

We can now begin the training process. To recap, we initialize a random combined image. We then feed the VGG-19 with a combined image. After having extracted features to compute the content and the style losses, we backpropagate the gradient through the network **without updating its parameters**. Once the gradient arrives to the input combined image, we now **update the pixels of this combined image**. And we reiterate this process with this new updated combined image as input.

Some practical steps before training. 
- First, just to make it handier, we concatenate the lists to call `model()` only once, instead of twice (to compute the content loss and then the style loss). 
- We can also `.detach()` the targets (high-level feature map for content, and Gram matrices for style) given that they are fixed values: the gradient does not need to backpropagate through them. Thus, detaching them from the computation graph can save (a small amount) of computation power.

In [None]:
loss_layers = style_layers + content_layers
targets = [target.detach() for target in (style_targets + content_targets)]

- Instead of initializing the first combined image as a random image, we can initialize it by cloning the content image. Thus, the content loss will increase (given that adding the style will twist the actual content), but we will get an acceptable result faster. You can initialize it with noise instead, if you want.
- We use the [L-BFGS](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html) optimizer, as suggested in the baseline material: that is why we must define the required function `closure()`. Note that the params argument is `out_img` - and not something like `model.parameters()` that we usually write in previous courses.

On Colab, it takes ~ 5 minutes (and on Kaggle ~ 1 min).

In [None]:
# Initialize the output image
## with noise
# out_img = torch.randn_like(content_img).requires_grad_()
## with the content image
out_img = content_img.clone().requires_grad_()


# ===== Hyper-parameters =====
max_iter = 500
optimizer = torch.optim.LBFGS([out_img])
# ============================

show_every = 50
save_every = 10
n_iter = 0
list_images = [] # To make the animation

utils.print_head()
    

# Training loop
def closure():
    global n_iter

    # Set gradients of out_img at zero
    optimizer.zero_grad()
    
    # `out` is a list of tensors, corresponding to each feature maps indexed by loss_layers
    out = vgg19(out_img, loss_layers)

    # Retrieve style_losses and content_losses
    style_losses = []
    content_losses = []
    for feat_i, feature in enumerate(out):
        if feat_i < len(style_layers):
            style_losses.append(style_weight * loss_fn_style(feature, targets[feat_i])) 
        else: 
            content_losses.append(content_weight * loss_fn_content(feature, targets[feat_i])) 

    # The total loss is the sum of style and content losses
    content_loss = sum(content_losses)
    style_loss = sum(style_losses) 
    loss = content_loss + style_loss
    
    # Updates the image
    loss.backward()
    n_iter += 1

    # Log
    if n_iter % show_every == (show_every-1):
        utils.print_log(n_iter, content_loss, style_loss, loss)
        # print([f"{loss_layers[li]} : {l.item()}" for li, l in enumerate(layer_losses)])

    if n_iter % save_every == (save_every-1):
        list_images.append(post(out_img.clone().data[0].cpu().squeeze()))

    return loss


while n_iter <= max_iter:
    optimizer.step(closure)

### Visualization

Once the training process has ended, we can retrieve `out_img` (that we optimized in the training loop) to visualize it. Some style images give better results than other images: for instance, Van Gogh's *Starry Night* or Hokusai's *Great Wave* are perceptually less convincing, probably due to more complex features and a less homogeneous texture.

In [None]:
utils.show_img_gatys(out_img)
plt.gcf().set_size_inches(7, 7)

In [None]:
utils.save_img_gatys(out_img, filename='combined_image_gatys.png')

We can also visualize the evolution of `out_image` during the training process.

In [None]:
anim = utils.make_gif_one_canvas(list_images)
HTML(anim.to_html5_video())

And finally, here is the impact of the relative weighting between content and style. The indicated value is the ratio $\beta/\alpha$, i.e. the ratio style weight on content weight. For these images, `out_img` was first initialized with noise.

<center><img src="https://drive.google.com/uc?id=17Un-UJmHRbZK7LiXpxViWpDrwJ1HqA5M" width="90%" style="background-color:white">
</center>

<br>

To conclude this first section, we have managed to merge a content image and a style image to generate this combined image.

However, you may notice that the main drawback of this method was that generating *one* image is already time-consuming. For example, it would not be suited for video style transfer, or worse for real-time style transfer.
The next section will focus on improving this method to generate images **of the same 
style** more quickly.

---

<a class="anchor" id="section2"></a>
## Real-time style transfer 

This section is based on [Johnson & al. (2016), Perceptual Losses for Real-Time Style Transfer and Super-Resolution](https://arxiv.org/abs/1603.08155), that improves Gatys's paper.

### Overall process

The style transfer process relies on two models. A first model aims at **generating combined images** from a content image and is **trained** on a specific style. The second model, similarly as Gatys's paper, is a VGG-16 that aims at **abstracting images**, from which we will retrieve feature maps to compute content and style losses.

The overall process is presented in this figure:

<br>

<center><img src="https://drive.google.com/uc?id=1wDsTN9UCY2E9ghf2RhDfooauk3Pwsaca" style="border: solid 1pt #cccccc" width="100%"><br>
Figure 5: Overall style transfer process presented in Johnson & al. (2016)
</center>

<br>

We first choose an image dataset, a style image $I_{style}$, and we initalize the image transformation network randomly. Then, we repeat this process:
- (1) Feed an image from the dataset into the transformation network.
- (2) This network then produces a combined image $I_{comb}$.
- (3) Feed this combined image $I_{comb}$ in a pre-trained VGG-16.
- (4) Using the original content image (i.e. the image from the dataset) and the chosen style image, compute the loss of the output of the VGG-16 like in Gatys's paper.
- (5) Backpropagate to the transformation network and **only** update its parameters (i.e. the VGG-16 and any images are updated).

Thus, the image transformation network will learn to generate images **of the same style** as $I_{style}$. To transfer this style on our images, we simply feed them in this network and get the returned combined image.

To recap, the main difference with the Gatys's paper is that we will now **optimize a model**, and not the input of the model.

### Data Loader

Like above, the style image will be normalized with the ImageNet mean and standard deviation. The content images will also be normalized before being fed to the networks during the training process.

The model is trained on the [COCO Dataset](https://cocodataset.org/) that gathers images (and annotations, that we won't use). Models provided in the paper are actually trained on the complete dataset (330.000 images). To make it faster, the data downloaded at the beginning contains only 1000 images of the COCO dataset. Though, we manage to get acceptable results only with this small amount of training data.
Some completely trained models are provided in `saved_models/` (both provided with the article, and manually trained for this workshop).

In [None]:
# ===== Hyper-parameters =====
image_size = 128
dataset = "train"
lr = 1e-2
batch_size = 4
style_image = "images/style-images/mosaic.jpg"
epochs = 10
content_weight = 1e5
style_weight = 1e10
# ============================

In [None]:
transform = transforms.Compose([
    transforms.Resize(image_size),
    transforms.CenterCrop(image_size),
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.mul(255))
])
train_dataset = datasets.ImageFolder(dataset, transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

In [None]:
def normalize_batch(batch):
    # Normalize using imagenet mean and std
    mean = batch.new_tensor([0.485, 0.456, 0.406]).view(-1, 1, 1)
    std = batch.new_tensor([0.229, 0.224, 0.225]).view(-1, 1, 1)
    batch = batch.div_(255.0)
    return (batch - mean) / std

In [None]:
style_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
])
style = Image.open(style_image).convert('RGB')
style = style_transform(style)
style = style.repeat(batch_size, 1, 1, 1).to(device)

### Model


In the figure above, the first network "Transform Image Network" learns to generate combined images from a content image and is trained on an image dataset. It is based on downsampling convolutional layers, residual blocks and upsampling convolutional layers. The network is detailed in the paper's [supplementary material](https://cs.stanford.edu/people/jcjohns/papers/fast-style/fast-style-supp.pdf).


The second network is used to extract different levels of abstraction of the combined image in order to compute the content and the style losses. Like in Gatys's paper, a convolutional network is used. However, it specifically uses the features of a VGG-16 model, which is similar to the VGG-19, except three missing convolutional layers.

<br>

<center><img src="https://drive.google.com/uc?id=1pvXn8cjoJm2nPf4vXg4o8z2uSKeJ2wL3" style="border: solid 1pt #cccccc" width="90%"><br>
Figure 6: VGG-16 network and feature maps used for content and style losses
</center>

<br>

In [None]:
vgg16 = utils.initialize_pretrained_vgg16().to(device)

### Loss

We use the same loss as above, that is to say the sum of a content loss and a style loss. The content loss remains a MSE between two late feature maps (i.e. two high-level abstractions of the image). The style loss remains the sum $E_\ell$ on a few layers $\ell$, where $E_\ell$ is the MSE between Gram matrices, that represent co-occurrences of features.

<font color="green">
<hr>
    
> **Your turn !** (Similar to above...)<br>
> - `style_targets` is a list of tensors of size `len(style_layers)` containing the objects passed to the style loss function during the training process.
> - `loss_fn_style` is the style loss function. <br>
> - `loss_fn_content` is the content loss function. <br>
<hr>
</font>

Note that we do not compute `content_targets` beforehand, given that the content will change depending on the image of the dataset. Thus, these targets must be computed within the training loop.

In [None]:
style_layers = ['r12','r22','r33','r43'] 
style_targets = ...
loss_fn_style = ...

content_layers = ['r22']
loss_fn_content = ...

loss_layers = style_layers + content_layers

In [None]:
# !cat solutions/ex4.py
# %load solutions/ex4.py

### Training (optional)

The training loop is a common training loop, in which we now optimize the **model** parameters (unlike the training loop above, in which we optimized the image). 

**If you have enough time, you can train your own model (~ 2'00 per epoch on Colab, and 0'40 per epoch on Kaggle), else you can skip this cell and use a pre-trained model in the following subsection.**

In [None]:
model = utils.ImageTransformNet().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr)

for e in range(epochs):
    model.train()
    agg_content_loss = 0
    agg_style_loss = 0
    count = 0
    
    pbar = tqdm(total=len(train_loader), desc=f"Epoch {e+1}/{epochs}", leave=False)
    utils.print_head()
    
    for batch_id, (x, _) in enumerate(train_loader):
        n_batch = len(x)
        count += n_batch
        
        # Set gradients of out_img at zero
        optimizer.zero_grad()

        x = x.to(device)
        y = model(x)
        
        # `feature_a` is a list of tensors, corresponding to each feature maps indexed by loss_layers
        features_y = vgg16(normalize_batch(y), loss_layers)
        features_x = vgg16(normalize_batch(x), loss_layers)
        
        # Retrieve style_losses and content_losses
        style_losses = []
        content_losses = []
        for feat_i, feature in enumerate(features_y):
            if feat_i < len(style_layers):
                gram_feat_i = style_targets[feat_i] # Get the batch of Gram matrices corresponding to this feature
                style_losses.append(style_weight * loss_fn_style(feature, gram_feat_i[:n_batch, :, :]))
            else: 
                content_losses.append(content_weight * loss_fn_content(feature, features_x[feat_i])) 
        
        # The total loss is the sum of style and content losses
        content_loss = sum(content_losses)
        style_loss = sum(style_losses)
        total_loss = content_loss + style_loss
        
        # Updates the model
        total_loss.backward()
        optimizer.step()
        
        # Log
        agg_content_loss += content_loss.item()
        agg_style_loss += style_loss.item()

        if (batch_id + 1) % 10 == 0:
            avg_content_loss = agg_content_loss / (batch_id + 1)
            avg_style_loss = agg_style_loss / (batch_id + 1)
            avg_total_loss = avg_content_loss + avg_style_loss
            utils.print_log(e, avg_content_loss, avg_style_loss, avg_total_loss, count, len(train_dataset))
        
        if (batch_id + 1) % 100 == 0:
            model.eval().cpu()
            save_model_filename = f"models/epoch_{e+1}_checkpoint_{batch_id+1}.pth"
            torch.save(model.state_dict(), save_model_filename)
            model.to(device).train()

        
        pbar.update(1)
    pbar.close()
    print()


    # Save model
    model.eval().cpu()
    save_model_filename = "models/epoch_" + str(e+1) + ".pth"
    torch.save(model.state_dict(), save_model_filename)
    model.to(device).train()

### Visualization

We now have a trained model, specialized in a specific style. We can now feed this model with any image we want, it has learnt to generate a new image in a specific style. Moreover, we can also feed a series of images (for example, an animation), and the model will transfer the style of this series of images, much faster than the first method.

Your model has been saved in `models/` and you can find pre-trained models in `saved_models/`.

#### Static image

In [None]:
content_image_filename = "images/content-images/isae.jpg"
# model = "saved_models/udnie.pth"
# model = "saved_models/mosaic.pth"
model = "models/epoch_10.pth"

In [None]:
# Load and transform the image
content_image = Image.open(content_image_filename).convert('RGB')
content_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.mul(255))
])
content_image = content_transform(content_image)
content_image = content_image.unsqueeze(0).to(device)

# Apply the style model
with torch.no_grad():
    style_model = utils.load_model(model).to(device)
    output = style_model(content_image).cpu()[0]

In [None]:
utils.show_image_johnson(output)
plt.gcf().set_size_inches(7, 7)

#### Animation

In [None]:
gif_filename = "images/gif/cat_typing.gif"

model = "saved_models/udnie.pth"
# model = "saved_models/mosaic.pth"
# model = "models/epoch_10.pth"

SHOW_FRAMES = True

In [None]:
# Load the original gif
gif = imageio.get_reader(gif_filename)
n_frames = len(gif)

# Load the style model
style_model = utils.load_model(model).to(device)

# Loop on each frame
list_images_style = []
list_images_original = []
for i_frame, frame in enumerate(tqdm(gif, leave=False)):
    content_image = utils.get_frame_gif(frame).to(device) 
    
    with torch.no_grad():
        output = style_model(content_image).cpu()[0]
    
    list_images_style.append(utils.get_image_johnson(output))
    list_images_original.append(frame)

    
    if SHOW_FRAMES:
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
        ax1.imshow(frame)
        utils.show_image_johnson(output, ax2)
        print(f"Frame {i_frame+1} / {n_frames}")
        plt.tight_layout()
        plt.show()

In [None]:
# anim = utils.make_gif_one_canvas(list_images_style)
anim = utils.make_gif_both(list_images_original, list_images_style)
HTML(anim.to_html5_video())

In [None]:
anim.save("both_anim.gif", writer=animation.PillowWriter(fps=20))

In [None]:
#@markdown Animation results
display(HTML(utils.colab_video("./results/udnie_anim.mp4")))
display(HTML(utils.colab_video("./results/mosaic_anim.mp4")))

Depending on the style model, the stylized animation can be more or less convincing. For instance, the `udnie.pth` model gives relatively correct results, while the model `mosaic.pth` can generate kind of "flickering" animation. The first style relies more on colors, while the second style is more about geometric patterns. Thus, given that the model transfers the style on each image independently of the other, geometric shapes that disappear and appear on each frame are much more visible and gives this lack of continuity. Some studies focus on this discontinuity in style transfer ([Ruder & al. (2016)](https://arxiv.org/pdf/1604.08610.pdf))

---

<a class="anchor" id="section3"></a>
## Going further 

### Improvements

Neural Style transfer is a fairly broad research field: several studies then specializes in specific styles, focus on photorealistic outputs, extend the process to other images in 2D or 3D:

<br>
<center><img src="https://d3i71xaburhd42.cloudfront.net/b0760764dc573b519f76d5a79531d49af333c67a/3-Figure2-1.png" width="100%"><br>
Taxonomy of Neural Style transfer techniques [source: <a href=https://arxiv.org/pdf/1705.04058.pdf> Jing & al (2018)</a>]
</center>

From a technical point of view, while Gatys's model implements convolutional networks, several later visual style transfer models rely on Generative Adversarial Networks (GAN). The main style transfer model using this approach is [CycleGAN](https://arxiv.org/pdf/1703.10593.pdf). It deals with a larger problem which is "image-to-image translation" that is to say transfering a concept from an image to another image. For instance, it manages to transform a zebra into a horse, or a summer picture into a winter picture, thus, style transfer is also a particular case of application of this model.

### Style transfer in other artistic research fields

Style transfer is also tackled in other artistic fields, especially in Music Information Retrieval (MIR), an active research field that includes computer science applied to music. 


Firstly, with an audio approach (i.e. in which the source material is the sound's waveform), the Google's Magenta team released the [DDSP](https://magenta.tensorflow.org/ddsp) (Differentiable Digital Signal Processing) model that can perform timbre transfer. In other words, you can have an instrument replacing another instrument in an audio track.
From a technical point of view, the model is an autoencoder whose latent space is explainable, coupled with standard signal processing methods. Both the encoder and the decoder are trained on a specific timbre and thus specialized in a specific instrument. For instance, by combining a flute-encoder and a violin-decoder, you can transfer the timbre of the violin into a flute track (with some modifications in the actual paper...). 

In [None]:
#@markdown Demo DDSP: timbre transfer (taken from DDSP's website)
print("Original violin track")
display(Audio("https://storage.googleapis.com/ddsp/timbre_transfer/instruments/violin_violin.mp3"))
print("Transfered with flute timbre")
display(Audio("https://storage.googleapis.com/ddsp/timbre_transfer/instruments/violin_flute2.mp3"))

Apart from the audio approach, with a symbolic approach (i.e. in which the source material is the sheet music), some models have also been implemented to process "musical style transfer" focusing on musical genre transfer. In other words, they manage to orchestrate or arrange a melody in the style of a specific composer. <br>
MIR studies usually try to apply other domains model to music (for instance, a lot of [NLP methods](https://sites.google.com/view/nlp4musa-2021) are used for musical studies). In the case of style transfer, the [CycleGAN](https://arxiv.org/pdf/1809.07575.pdf) model have been implemented and adapted for musical results and gives preliminary results for later improvements in musicl style transfer based on GANs. <br>
Symbolic style transfer is also tackled in the study [Groove2Groove](https://groove2groove.telecom-paris.fr/), which generates accompaniments of different styles. Technically, it relies on a content encoder and a style encoder, that feed a decoder based on recurrent networks.
In the same way as Van Gogh's less convincing results, the results of Groove2Groove in the baroque style of Haendel are less satisfying. Indeed, the baroque style is usually characterized by a diverse musical texture and often includes more complex structures (like contrapuntal writing), contrary to pop music which is much more repetitive.


---

## Conclusion

To sum up, Neural Style transfer aims at merging the style of an image into a content image. This can be done by optimizing the combined image itself, or by training a model on a specific style. In both cases, the idea is to compute the loss as a sum of, on the one hand, a content loss that compares abstractions of images, and on the other hand, a style loss that compares the co-occurrences of features.

Neural Style Transfer goes beyond the simple examples seen in this notebook, by improving these techniques to other types of images, or by extending into other artistic research fields.

Thanks for reading!

## Sources

- [1] Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic style. _arXiv preprint arXiv:1508.06576._ [Link](https://arxiv.org/pdf/1508.06576.pdf) 
- [2] Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In _Proceedings of the IEEE conference on computer vision and pattern recognition_ (pp. 2414-2423). [Link](https://openaccess.thecvf.com/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf)
- [3] Rhodes, N. (2021). CS 152 Day 15: Neural Networks/Deep Learning — Spring, 2021, _Harvey Mudd College, Computer Science_. [Link](https://www.youtube.com/playlist?list=PL2Yggtk_pK6-v87LproSNu1Mzz1oPd97X)
- [4] Johnson, J., Alahi, A., & Fei-Fei, L. (2016, October). Perceptual losses for real-time style transfer and super-resolution. In _European conference on computer vision_ (pp. 694-711). Springer, Cham. [Link](https://arxiv.org/abs/1603.08155). Supplementary material: [Link](https://cs.stanford.edu/people/jcjohns/papers/fast-style/fast-style-supp.pdf)

Adapted from these repositories :
- Gatys & al. : [Pytorch Neural Style Transfer](https://github.com/leongatys/PytorchNeuralStyleTransfer)
- Johnson & al. : [Pytorch Fast Neural Style](https://github.com/pytorch/examples/tree/master/fast_neural_style)