# Neural Style Transfer

Resources:
* [Original paper](https://arxiv.org/abs/1508.06576)
* [Official PyTorch Tutorial](https://pytorch.org/tutorials/advanced/neural_style_tutorial.html)
* [Blog by Amar](https://towardsdatascience.com/implementing-neural-style-transfer-using-pytorch-fd8d43fb7bfa)

In [None]:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models

## Model

Here we describe vgg19 model.

Some image of vgg19.

Then its implementation in pytorch and how I am slicing it.

In [None]:
class VGG19(nn.Module):
    def __init__(self):
        super(VGG19, self).__init__()
        self.model = models.vgg19(pretrained=True)

    def forward(self):
        pass


## Loss

### Total Loss

$$
\mathcal{L}_{total}(\vec{p}, \vec{a}, \vec{x}) = \alpha\mathcal{L}_{content}(\vec{p}, \vec{x}) + \beta\mathcal{L}_{style}(\vec{a}, \vec{x}) \tag{1}
$$
$ \vec{p} $ - content image  
$ \vec{a} $ - style image   
$ \vec{x} $ - generated image   
$ \alpha $ - content coefficient   
$ \beta $ - style cooefficient 

> Generated image $\vec{x}$ can be either initialized as content image or white noise (random values).

In [None]:
class TotalLoss(nn.Module):
    def __init__(self, content_features: Tensor, style_features: Tensor, alpha: float = 1., beta: float = 1000.):
        super(TotalLoss, self).__init__()

        self.alpha = alpha
        self.beta = beta

        self.content_loss = ContentLoss(content_features)
        self.style_loss = StyleLoss(style_features)

    def forward(self):
        total_loss = self.alpha * self.content_loss + self.beta * self.style_loss
        return total_loss

### Content Loss

$$
\mathcal{L}_{content}(\vec{p}, \vec{x}) = \dfrac{1}{2} \sum_{i,j}(F_{i,j}^{l} - P_{i,j}^{l})^2 \tag{2}
$$

Content loss is a squared error between: 
  
$ F_{i,j}^{l} $ - Output of conv layer **$l$** for input (generated) image  
$ P_{i,j}^{l} $ - Output of conv layer **$l$** for content image  

> $ i, j $ represent **i-th** position of the filter at position **j** which implementation-wise doesn't change anything as we take whole outputs of conv layers

There's a small difference in notation when it comes to implementation. In paper, $F^{l}$ is defined as $F^{l} \in \mathbb{R}^{N_l x M_l}$, which means it's a matrix with shape ($N_l$ - number of feature maps, $M_l$ - $width * height$ of feature maps). In contrast, here we have $F^{l} \in \mathbb{R}^{N_l x H_l x W_l}$ which is just a reshaped version of the matrix in the paper.

In [None]:
class ContentLoss(nn.Module):
    def __init__(self, content_features: Tensor):
        super(ContentLoss, self).__init__()
        
        self.mse = nn.MSELoss()
        self.content_features = content_features

    def forward(self, input_features: Tensor) -> Tensor:
        return self.mse(input_features, self.content_features, reduction='sum') / 2  # Matches formula (2)

### Style Loss

Instead of taking raw conv layer outputs as in content loss, style loss firstly computes Gram matrix. They mentioned that Gram matrix computes feature correlations between feature maps? (whill have to look more into what that exactly means).
$$
G_{i,j}^l = \sum_{k}(F_{i,k}^lF_{j,k}^l) \tag{3}
$$
where:

$F_{i,k}^l$ is again output of conv layer **$l$**, defined as $F^{l} \in \mathbb{R}^{N_l x M_l}$, where $(N_l = channels, M_l = height * width)$  
$F_{j,k}^l$ is transposed version of previously mentioned matrix

This essentially means $G$ is computed as $F * F^T$ and $G \in \mathbb{R}^{NxN}$

> Because $G$ is of shape $NxN$, it means that the dimensions of Gram matrix vary between conv layers with different number of feature maps. In official PyTorch tutorial this is resolved by normalazing each gram matrix by dividing it with its number of elements. I haven't seen this in paper but I'll test both **with** and **without** normalization.

In [None]:
def gram_matrix(feature_maps: Tensor, normalize: bool = False) -> Tensor:
    B, C, H, W = feature_maps.size()

    assert B == 1, f"Batch size must be 1! Got B={B}"

    feature_maps = feature_maps.squeeze(0)  # Remove batch_size
    features = feature_maps.view(C, H * W)
    g = torch.mm(features, features.t())

    if normalize:
        g = g / g.numel()

    return g

With *Gram matrix* defined, they computed the **loss per layer** as mean-squared error:
$$
E_l = \dfrac{1}{4N_l^2M_l^2}\sum_{i,j}(G_{i,j}^l-A_{i,j}^l)^2\tag{4}
$$
> How the hell is this mean?

In [None]:
class StyleLoss(nn.Module):
    def __init__(self, style_features: List[Tensor]):
        super(StyleLoss, self).__init__()

        self.mse = nn.MSELoss()
        self.style_features = style_features
        self.style_gram_matrix = [gram_matrix(style_feature) for style_feature in self.style_features]

    @property
    def num_layers(self):
        return len(self.style_features)  # This will be constant w

    def forward(self, input_features: List[Tensor]) -> Tensor:
        assert len(input_features) == len(self.style_gram_matrix), \
            f"Mismatched lengths of features! {len(input_features)} != {len(self.style_features)}"

        inputs_gram_matrix = [gram_matrix(inpute_feature) for inpute_feature in input_features]

        style_loss = 0
        for style_gram, input_gram in zip(self.style_gram_matrix, inputs_gram_matrix):
            e_l = self.mse(input_gram, style_gram)  # input_gram = G, style_gram = A, e_l = El in formula (4)
            style_loss += e_l

        return style_loss / self.num_layers