This notebook demonstrates an example of neural style transfer in Pytorch. Neural style transfer is where the style of the pixels that make up one image is transferred to another image. The content of the second image is preserved, it is just rendered in a style that mimics the style of the first image. 

Unlike in an image classification or recognition task, two loss metrics are used instead of just one. A content loss will be determined as well as a style loss, and the two loss metrics will be combined to deliver a representation between the total loss for the content and the total loss for the style.

To start out with, we want to import all the libraries and models we need. 

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.optim as optim
from torchvision import transforms, models
from PIL import Image

We're going to define a function to load in the images and transform them into tensors that the model can work with. Before we do that, we're going to need to specify the size of the image we want to work with. 

After we define these variables, we'll create the list of transforms we want to use. We're going to create a function to load in the image using our transforms. We'll use the transforms we specified earlier as well as the Image function from PIL to import the images and turn them into tensors. 

We also need to normalize the channels within the model and send the normalized data to the device. 

In [2]:
# function loads image and transforms to PyTorch Tensor, also normalizes
# specify a max size and an optional shape
# normalization numbers come from recommended numbers used on the ImageNet dataset

def image_load(img_path, max_size=800, shape=None):

    # open the image and convert it to RGB
    image = Image.open(img_path).convert('RGB')

    # if the total image size is greater than our specified size,
    # recast to chosen max size
    if max(image.size) > max_size:
        size = max_size
    else:
        size = max(image.size)


    if shape is not None:
        size = shape


    im_transforms = transforms.Compose([transforms.Resize((size, int(1.5* size))),
                                        transforms.ToTensor(),
                                        transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
                                        ])

    # unsqueeze the image after doing transforms
    # unsqueeze takes input, dimensions and output tensor(optional)
    # this discards the transparent alpha channel (:3) and adds batch dimension
    image = im_transforms(image)[:3, :, :].unsqueeze(0)

    return image


Let's be sure that the images are loading in as we expect.

In [3]:
# use function to load image
style_image = image_load("style1.jpg")

# let's print the shape to make sure it is what we expect
# should be [1, 3, x, x] - batch dim, color channels, h x w
print(style_image.shape)

torch.Size([1, 3, 800, 1200])


We'll also need to have a function that converts the tensors output by our model to images which can be displayed.

In [4]:
# function to convert the tensors back to images
# so that the image can be displayed

def image_convert(tensor):
    # clone the image
    image = tensor.to("cpu").clone().detach()
    image = image.numpy().squeeze()
    # convert from a tensor into a numpy array
    image = image.transpose(1, 2, 0)
    # undo normalization
    image = image * np.array((0.229, 0.224, 0.225)) + np.array(
        (0.485, 0.456, 0.406))
    # clip the values to between 0 and 1
    image = image.clip(0, 1)
    return image

We now need to create functions that will calculate the loss for the content and style. We can calculate the content loss by getting the features of the content image. We'll have a function select the layers which handle the processing of the image features, the convolutional layers, and return the features. This same process can be applied to getting the Style Loss as well, although one extra function is needed to fully compute the style loss.

In [5]:
def select_features(image, model, layers=None):
    # if no layers are specified, use these layers
    if layers is None:
        layers = {'0': 'conv1_1', '5': 'conv2_1',
                  '10': 'conv3_1',
                  '19': 'conv4_1',
                  '21': 'conv4_2',  ## content layer
                  '28': 'conv5_1'}

    # Dict to store the features
    features = {}

    x = image

    # store the feature map responses if the name of the layer matches
    # one of the keys in a given predefined layer dict.

    for name, layer in enumerate(model.features):
        # set x to the layer the image is passing through
        x = layer(x)
        # if the name of the layer is in the chosen layers, get the features from that layer
        if str(name) in layers:
            features[layers[str(name)]] = x

    return features

We'll also need to set up another function to finish getting the style loss. Computing the style loss is easily done by constructing a gram matrix. A gram matrix results from multiplying a matrix by its transpose. Given a matrix, it will contain feature maps FXL of layer L. FXL will be reshaped into F^XL, which is a KxN matrix where K is the number of feature maps at layer L and N is the length of any vectorized feature map FkXL.

The gram matrix needs to be normalized. Normalizing the gram matrix can be done by dividing every element in the matrix by the total number of elements. If this is not done, large dimension N values can negatively impact gradient descent.

In [6]:
def gram_matrix(tensor):
    # get the number of filters, height and width (channel doesn't matter)
    _, num_filters, h, w = tensor.size()
    tensor = tensor.view(num_filters, h * w)
    # matrix multiplication against the transpose to get the gram matrix
    gram = torch.mm(tensor, tensor.t())

    return gram

We now need to select a model to carry out the transfer with. Our convolutional neural network in this case will be VGG19. The VGG19 model will take in predfined weights, which we can load in from PyTorch's website. We won't be computing any gradients for the predefined model, so we need to be sure that `requires_grad` is set to false.

In [7]:
# load in the VGG model and set requires grad to False

torch.utils.model_zoo.load_url('https://download.pytorch.org/models/vgg19-dcbb9e9d.pth', model_dir='/home/daniel/Downloads/saved_weights/')

vgg_model = models.vgg19()
vgg_model.load_state_dict(torch.load('/home/daniel/Downloads/saved_weights/vgg19-dcbb9e9d.pth'))

for param in vgg_model.parameters():
    param.requires_grad_(False)

While this isn't necessary, we can replace the Max Pooling layers with Average Pooling layers, as the Average Pooling layers tend to perform a little better for style transfer tasks.

In [8]:
# replace max pool with average pooling layers, as they seem to perform better
for i, layer in enumerate(vgg_model.features):
    # if it matches a max pooling layer, replace
    if isinstance(layer, torch.nn.MaxPool2d):
        vgg_model.features[i] = torch.nn.AvgPool2d(kernel_size=2, stride=2, padding=0)

Now we can declare the device we are using, the CUDA if it is available, and send the model to the device in evaluation mode.

In [9]:
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vgg_model.to(device).eval()

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace)
    (4): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace)
    (9): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace)
    (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (17): ReLU(inplace)
    (18): AvgPool2d(kernel_size=2, stride=2, padding=0)
 

Now we need to load in the content and style images and send them to the device as well. 

In [10]:
# send both images to the device
content = image_load("content1.jpg").to(device)
style = style_image.to(device)

# extract the feature maps from both of the images
features_content = select_features(content, vgg_model)
features_style = select_features(style, vgg_model)

Using the feature extraction function we defined earlier, we'll used the pretrained model to get the features for both the content and style images. We'll then get the gram matrix for the style features.

In [11]:
# compute the gram matrices for the style layers
style_grams = {layer: gram_matrix(features_style[layer]) for layer in features_style}

We now have to create a third image that will be transformed into our target image tensor. We can either create a random image or copy the content image.

In [12]:
# Create an image to transform, make random image
target_image = torch.randn_like(content).requires_grad_(True).to(device)

We've selected the features for the style and content images, but since the network didn't compute weights we'll have to specify these ourselves. We need the weights to finally retrieve the loss for the content and style images. 

We'll select mutiple convolutional layers and give them weights, this is because the different layers define different portions of the style, or contribute in different ways to its representation. Since there are different style layers, we can define the style weights individually. We'll use a multiplicative weight scheme for the different layers
meaning we can edit these values and tune the style artifacts to our liking.

In [13]:
# define the style weights individually
style_image_weights = {'conv1_1': 0.75,
                 'conv2_1': 0.5,
                 'conv3_1': 0.2,
                 'conv4_1': 0.2,
                 'conv5_1': 0.2}

We also need something that tracks the weight of the individual loss terms for content and style.

In [14]:
# track individual loss terms for content and style
content_weight = 1e4
style_weight = 1e2

We're almost ready to start the training process. In terms of handling the loss for the content, it is just MSE Loss between the feature map responses of both the content image and target image. Meanwhile, the style loss is similar but the feature maps are replaced by the divided gram matricies while the MSE loss is divided by the total number of elements in the respective feature map. Before we create the training loop, we'll choose an optimizer to use. The Adam optimizer should work fine in this instance.


In [15]:
# declare the optimizer we're going to use

optimizer = optim.Adam([target_image], lr=0.01)

After we decide how many iterations we want to run the training cycle for, we'll crate the transformation loop. The loop computes the losses for style and content, multiplies the losses by the weights, and then sums them together to get the total loss.

After the total loss is calculated, backpropogation is done and the valuee of the pixels are updated until the iterations are done.

In [16]:
for i in range(1, 6000 + 1):

    optimizer.zero_grad()

    # get the features from the target image and model
    target_features = select_features(target_image, vgg_model)

    # compute contnet loss
    content_loss = torch.mean((target_features['conv4_2'] - features_content['conv4_2']) ** 2)

    # set inital style loss to zero
    style_loss = 0

    # get the individual target features
    for layer in style_image_weights:
        target_feature = target_features[layer]
        # compute the gram matrix
        target_gram = gram_matrix(target_feature)
        # find the shape of the target feature
        _, d, h, w = target_feature.shape
        # get the style gram of the current layer
        style_gram = style_grams[layer]
        # compute the loss for the current style layer
        layer_style_loss = style_image_weights[layer] * torch.mean((target_gram - style_gram) ** 2)
        # update the total style loss
        style_loss += layer_style_loss / (d * h * w)

    content_loss = content_weight * content_loss
    style_loss = style_weight * style_loss
    total_loss = content_loss + style_loss
    # do backprop and optimize
    total_loss.backward(retain_graph=True)
    optimizer.step()

    # every 50 iterations, print out statistics
    if i % 50 == 0:
        total_loss_rounded = round(total_loss.item(), 2)
        # proportion of loss belonging to content
        content_fraction = round(content_weight*content_loss.item()/total_loss.item(), 2)
        # proportion of loss belonging to style
        style_fraction = round(content_weight*content_loss.item()/total_loss.item(), 2)
        # Print the current iteration and both the content and style loss
        print('Current Iteration: {}, Total loss: {} - (content: {}, style {})'.format(i, total_loss_rounded, content_fraction, style_fraction))

Current Iteration: 50, Total loss: 40195.97 - (content: 1400.77, style 1400.77)


KeyboardInterrupt: 

Now that the training is finished, the final image can be saved to a variable.

In [None]:
# now we can carry out our training and create the final image
created_img = image_convert(target_image)

Now let's visualize the output.

In [None]:
# now let's visualize the image
fig = plt.figure()
plt.imshow(created_img)
plt.axis('off')
plt.savefig('shinkawa-lastofus.png')