In [12]:
import os
import PIL
import cv2
import torch
import torch.nn as nn
from torchvision.utils import save_image
from torchvision.models import vgg19
from torchvision import transforms
from PIL import Image

# Neural Style Transfer Report

Welcome to the Neural Style Transfer Report! This Jupyter notebook serves as a comprehensive guide to understanding the implementation of the full algorithm of neural style transfer, including video style transfer. The primary objective of this report is to provide a detailed explanation of each block of the source code, highlighting its functionality and its role in contributing to the overall neural style transfer process.

Neural style transfer is a fascinating technique that combines the content of one image or video with the style of another image to create visually stunning and artistically transformed results. The algorithm of neural style transfer leverages the concept of transfer learning, utilizing a pre-trained deep neural network called VGG19. This model serves as a powerful feature extractor, capturing the content and style information from the input images or frames.

Throughout this report, we will delve into the code blocks, discussing their purpose and functionality in depth. We will explore how the algorithm utilizes the VGG19 model and extracts features from the content and style image by examining the output of specific layers within the network. By the end, you will have a solid understanding of how each block of the source code contributes to the entire neural style transfer pipeline for both images and videos.

_NOTE_: This report is intended to be a companion to the source code. It is not meant to be a standalone document. The source code is well commented and should be read in conjunction with this report. Additionally, this report focuses heavily on the technical details of the implementation. If you wish to utilize the neural style transfer tools provided by this repository, please refer to the [README](README.md) for instructions on how to run the programs.

## Notebook Structure

This notebook is divided into several sections to provide a comprehensive report on the project from different aspects. The following is an overview of the notebook's structure:

1. Data Preparation: This section discusses how the content and style images/videos are preprocessed before being fed into the model.

2. Model Architecture: This section demonstrates how the model is built on top of the pre-trained VGG19 model.

3. Model Training: This section explains how the loss function is computed, and how the model is trained to minimize the loss.

4. Experiments and Results: This section illustrates how the final outcomes of the model are saved and how they are evaluated.

5. Conclusion and Future Work: This section summarizes the main takeaways and limitation of the project and discusses potential future work.

_NOTE_: In each section, I will first introduce how the algorithm is implemented for image style transfer, and then I will discuss how the algorithm is extended to video style transfer, if applicable.

## Data Preparation

In the neural style transfer process, data collection is not a mandatory step. The essential elements to begin with are a content image and a style image that you intend to merge to produce a style-transferred image. However, prior to feeding the images into the model, some preprocessing is necessary. This preprocessing ensures that (1) the images are in the appropriate tensor format, and (2) they possess uniform dimensions, enabling efficient calculation of the loss function during the subsequent training phase.

Once we obtain the paths to the content image and the style image, we may call the following `load_image()` function from `src/process_image.py` to convert an image into a tensor.


In [1]:
def load_image(image_path, device, output_size=None):
    """Loads an image by transforming it into a tensor."""
    img = Image.open(image_path)

    output_dim = None
    if output_size is None:
        output_dim = (img.size[1], img.size[0])
    elif isinstance(output_size, int):
        output_dim = (output_size, output_size)
    elif isinstance(output_size, tuple):
        if (len(output_size) == 2) and isinstance(output_size[0], int) and isinstance(output_size[1], int):
            output_dim = output_size
    else:
        raise ValueError("ERROR: output_size must be an integer or a 2-tuple of (height, width) if provided.")

    torch_loader = transforms.Compose(
        [
            transforms.Resize(output_dim),
            transforms.ToTensor()
        ]
    )
    
    img_tensor = torch_loader(img).unsqueeze(0)
    return img_tensor.to(device)

This function takes in the path to the image as an argument and returns a tensor of the image allocated to the specified device. The function also performs the necessary preprocessing steps, including resizing the image to have the same dimensionality as the user-defined output image size, and normalizing the pixel values to the range of 0 to 1. We call the above function in `image_style_transfer.py` to load the content and style images as follows.

In [3]:
def image_style_transfer(config):
    """Implements neural style transfer on a content image using a style image, applying provided configuration."""
    ...
    
    # load content and style images
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    output_size = config.get('output_image_size')
    if output_size is not None:
        if len(output_size) > 1: 
            output_size = tuple(output_size)
        else:
            output_size = output_size[0]

    content_tensor = load_image(content_path, device, output_size=output_size)
    output_size = (content_tensor.shape[2], content_tensor.shape[3])
    style_tensor = load_image(style_path, device, output_size=output_size)
    
    ...

In the case of video style transfer, we perform the same preprocessing step on the style image. However, for the content video, we first use the following code block from `video_style_transfer.py` to extract the frames from the video and save each frame as an image in a directory called `content_frames`. This directory is created in the same directory as the output directory indicates.

In [2]:
def video_style_transfer(config):
    """Implements neural style transfer on a video using a style image, applying provided configuration."""
    ...

    # extract frames from content video
    cap = cv2.VideoCapture(content_video_path)
    # retrieve metadata from content video
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    content_fps = cap.get(cv2.CAP_PROP_FPS)

    if total_frames == 0:
        print(f"ERROR: could not retrieve frames from content video at path: '{content_video_path}'.")
        return

    # extract frames from content video
    for i in range(total_frames):
        success, img = cap.read()
        if success:
            cv2.imwrite(os.path.join(output_dir, "content_frames", f"frame-{i+1:08d}.jpg"), img)
        else:
            print(F'ERROR: {os.path.join(output_dir, "content_frames", f"frame-{i+1:08d}.jpg")} failed to be extracted.')
            return

    cap.release()
    
    ...

Then for each frame of the content video, we call the `load_image()` function to convert the frame into a tensor. The following code block of functions `video_style_transfer()` and `_image_style_transfer()` from `video_style_transfer.py` shows how we can iterate through the frames of the content video and convert each frame into a tensor.

In [4]:
def _image_style_transfer(content_frame_path, style_path, output_frame_path, output_size):
    try:
        content_img = Image.open(content_frame_path)
    except FileNotFoundError:
        print(f"ERROR: could not find such file: '{content_frame_path}'.")
        return
    except PIL.UnidentifiedImageError:
        print(f"ERROR: could not identify image file: '{content_frame_path}'.")
        return

    try:
        style_img = Image.open(style_path)
    except FileNotFoundError:
        print(f"ERROR: could not find such file: '{style_path}'.")
        return
    except PIL.UnidentifiedImageError:
        print(f"ERROR: could not identify image file: '{style_path}'.")
        return

    # load content and style images
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    content_tensor = load_image(content_frame_path, device, output_size=output_size)
    output_size = (content_tensor.shape[2], content_tensor.shape[3])
    style_tensor = load_image(style_path, device, output_size=output_size)
    
    ...


def video_style_transfer(config):
    """Implements neural style transfer on a video using a style image, applying provided configuration."""
    ...
    
    for i in range(total_frames):
        content_frame_path = os.path.join(output_dir, "content_frames", f"frame-{i+1:08d}.jpg")
        output_frame_path = os.path.join(output_dir, "transferred_frames", f"transferred_frame-{i+1:08d}.jpg")
        success = _image_style_transfer(content_frame_path, style_path, output_frame_path, output_size)
    
    ...

## Model Architecture

The `ImageStyleTransfer_VGG19` model architecture consists of a modified version of the VGG19 network. It aims to extract important features from input images to enable further steps of neural style transfer. The code implementation of this model is as follows from `src/train_model.py`.

In [6]:
class ImageStyleTransfer_VGG19(nn.Module):
    def __init__(self):
        super(ImageStyleTransfer_VGG19, self).__init__()

        self.chosen_features = {0: 'conv11', 5: 'conv21', 10: 'conv31', 19: 'conv41', 28: 'conv51'}
        self.model = vgg19(weights='DEFAULT').features[:29]

    def forward(self, x):
        feature_maps = dict()
        for idx, layer in enumerate(self.model):
            x = layer(x)
            if idx in self.chosen_features.keys():
                feature_maps[self.chosen_features[idx]] = x
        
        return feature_maps

In the initialization method (`__init__()`), the model defines a dictionary called `chosen_features`, which specifies the layers of interest in the VGG19 network, namely the 1st, 6th, 11th, 20th, and 29th layers. These layers are selected based on their ability to capture semantic information in content and style images. Additionally, the model loads the pre-trained VGG19 model's features, up to the 29th layer, which will be used for feature extraction.

During the forward pass (`forward()` method), the input image (`x`) is processed through the model's layers. As each layer is applied the output of selected layers specified in `chosen_features` are of the interest and stored with their layer names in the output dictionary. They will be used to in subsequent steps to calculate the loss function.

## Model Training

The model training process for image style transfer begins with the content image and the style image, both as tensors. The goal is to iteratively update the generated image to minimize a specific loss function so that a new image is generated that combines the content of the content image with the artistic style of the style image. 

The generated image is initialized as a copy of the content image. This ensures that the initial output retains the content information of the content image. Compared to initializing the generated image with totally random inputs, this approach allows the model to converge faster, and thus, be more computationally efficient in the training process. The loss function used in image style transfer is a linear combination of the content loss and the style loss.

### Content Loss

The content loss measures the similarity between the features extracted from the generated image and the content image. Mathematically it is the mean squared difference between extracted feature maps of generated image tensor and content image tensor. It encourages the generated image to preserve the content of the content image. The following function `_get_content_loss()` implements the content loss calculation.

In [7]:
def _get_content_loss(content_feature, generated_feature):
    """Compute MSE between content feature map and generated feature map as content loss."""
    return torch.mean((generated_feature - content_feature) ** 2)

### Style Loss

The style loss, on the other hand, evaluates the differences in texture, colors, and patterns between the features of the generated image and the style image. Mathematically speaking it calculates the mean squared difference between gram matrices of generated image tensor and style image tensor. You could refer to this [wiki page](https://en.wikipedia.org/wiki/Gram_matrix) for more information about gram matrices. It encourages the generated image to adopt the artistic style of the style image. The following `_get_style_loss()` function implements the style loss calculation.

In [8]:
def _get_style_loss(style_feature, generated_feature):
    """Compute MSE between gram matrix of style feature map and of generated feature map as style loss."""
    _, channel, height, width = generated_feature.shape
    style_gram = style_feature.view(channel, height*width).mm(
        style_feature.view(channel, height*width).t()
    )
    generated_gram = generated_feature.view(channel, height*width).mm(
        generated_feature.view(channel, height*width).t()
    )

    return torch.mean((generated_gram - style_gram) ** 2)

### Training Process

During the training process, the generated image is iteratively updated by minimizing the combined content and style loss. This is achieved by adjusting the pixel values of the generated image using gradient descent optimization. The gradients are computed by loss backpropagation through the network, which allows the model to learn and refine the generated image to better match the desired content and style. The training continues for a specified number of iterations. The final generated image represents the outcome of the image style transfer process, combining the content of the content image with the artistic style of the style image, as learned by the model during the training iterations.

The function `train_image()` from `src/train_model.py` is responsible for the entire training process described above for image style transfer. The function takes in the content and style images as tensors, the initialized output image as tensor, the output directory, and the optional training configuration as arguments. It stores the final outcome of the training process in the generated image tensor, and returns an indicator of whether the training process was successful or not.

_NOTE_: Here we only present the most important code blocks of the training process. For the full implementation, please refer to the [source code](src/train_model.py).

In [9]:
def train_image(content, style, generated, device, train_config, output_dir, output_img_fmt, content_img_name, style_img_name, verbose=False):
    """Update the output image using pre-trained VGG19 model."""
    ...
    
    model = ImageStyleTransfer_VGG19().to(device).eval()    # freeze parameters in the model
    optimizer = torch.optim.Adam([generated], lr=lr)
    
    for epoch in range(num_epochs):
        # get features maps of content, style and generated images from chosen layers
        content_features = model(content)
        style_features = model(style)
        generated_features = model(generated)

        content_loss = style_loss = 0

        for layer_name in generated_features.keys():
            content_feature = content_features[layer_name]
            style_feature = style_features[layer_name]
            generated_feature = generated_features[layer_name]

            if layer_name in capture_content_features_from:
                # compute content loss
                content_loss_per_feature = _get_content_loss(content_feature, generated_feature)
                content_loss += content_loss_per_feature
            
            if layer_name in capture_style_features_from:
                # compute style loss
                style_loss_per_feature = _get_style_loss(style_feature, generated_feature)
                style_loss += style_loss_per_feature

        # compute loss 
        total_loss = alpha * content_loss + beta * style_loss

        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

    ...

    return 1

There is no major difference in terms of the training process between image style transfer and video style transfer. The only difference is that in video style transfer, we need to iterate through the frames of the content video and perform the training process on each frame. We use the same code block as above in terms of each frame of the content video to perform the training process. To streamline the discussion in this notebook, we will not include the code block here. Please refer to the [source code](src/video_style_transfer.py) for the full implementation if interested.

_NOTE_: The hyperparameter values used in the training process for both image and video style transfer are discussed in details in the [README](README.md) file.

## Experiments and Results

### Saving the Result

After the training process is completed, we use the `save_image()` API from `torchvision.utils` to convert the generated image tensor into a PIL image and save it in the output directory. The following code block from `image_style_transfer.py` shows how we can save the generated image.

In [10]:
def image_style_transfer(config):
    """Implements neural style transfer on a content image using a style image, applying provided configuration."""
    ...
    
    # train model
    success = train_image(content_tensor, style_tensor, generated_tensor, device, train_config, output_dir, output_img_fmt, content_img_name, style_img_name, verbose=verbose)

    # save output image to specified directory
    if success:
        save_image(generated_tensor, os.path.join(output_dir, f'nst-{content_img_name}-{style_img_name}-final.{output_img_fmt}'))
    
    ...

In the case of video style transfer, after saving all transferred frames into a directory called `transferred_frames`, we need one further step to synthesize the transferred frames into a video. We leverage the `OpenCV` library to perform this step. The following code block from `video_style_transfer.py` shows how we can synthesize the transferred frames into a video.

In [11]:
    
def video_style_transfer(config):
    """Implements neural style transfer on a video using a style image, applying provided configuration."""
    ...
    
    # synthesize video using transferred content frames
    cv2_fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    video_writer = cv2.VideoWriter(output_video_path, cv2_fourcc, output_fps, (output_frame_width, output_frame_height), True)

    for i in range(total_frames):
        frame = cv2.imread(os.path.join(output_dir, "transferred_frames", f"transferred_frame-{i+1:08d}.jpg"))
        if frame is not None:
            video_writer.write(frame)

    video_writer.release()
    
    ...

Again, since this report emphasizes on the technical details of the implementation, we will not include the final outcome of the image and video style transfer process in this notebook. If interested, the [README](README.md) file presents several examples of the final outcome of the image and video style transfer process. Please refer to the [README](README.md) file for more details on installation, usage, and examples.

### Model Evaluation

The evaluation of the neural style transfer model is based on the quality of the final outcome of the image and/or video style transfer process. Unlike training a classical machine learning model, we do not have a quantitative metric to evaluate the quality of the final outcome in this project. This is because the quality of the final outcome is subjective and depends on the user's preference. Nonetheless, there are some general guidelines that can be used to evaluate the quality of the final outcome. For example, the final outcome should retain the content of the content image and adopt the artistic style of the style image. Additionally, the final outcome should be visually appealing and aesthetically pleasing.

We use these general guidelines to fine-tune the hyperparameters of the model and evaluate the quality of the final outcome during our experiments, and finally set the hyperparameter values that yield the results that are visually aesthetically pleasing while avoiding ridiculously long training time. However, we encourage the users to experiment with different hyperparameter values to find the best combination that suits their practical needs and aesthetic preferences. Please refer to the [README](README.md) file for more details on how to fine-tune the hyperparameters.

## Conclusion and Future Work

### Conclusion
The product and outcomes of this neural style transfer project have provided valuable insights into the application of style transfer techniques on images and videos. We have successfully demonstrated the ability to combine the content of one image or video with the artistic style of another, resulting in visually appealing style-transferred outputs. By applying the concept of transfer learning and leveraging pre-trained deep neural networks, such as VGG19 in our case, we are able to extract and manipulate high-level features from digital contents to achieve impressive artistic transformations.

The implications of our results are significant, as they showcase the potential of high-level, delicate manipulation on images and videos. This opens up avenues for various applications, including artistic expression, visual storytelling, and multimedia content generation.

### Limitations
However, it is important to acknowledge the limitations of our approach. The computational complexity of the style transfer process can be demanding, especially when dealing with high-resolution images or long videos with high FPS rate. Additionally, the choice of style image and content image greatly impacts the quality and artistic appeal of the final output. Therefore, selecting appropriate and articulate style and content images becomes crucial for achieving optimal results.

### Future Work
For future work, several aspects can be explored to further improve upon this project. On one hand, exploring alternative deep neural network architectures and loss functions could potentially improve the overall performance and quality of the style-transferred outputs. Secondly, optimizing the computational efficiency of the algorithm, such as utilizing parallel processing or hardware acceleration, is pressing and demanding more effort as it would enable real-time style transfer on high-resolution images and videos.

In conclusion, this neural style transfer project has demonstrated the effectiveness of combining content and style to generate visually captivating outputs, and unveiled the potential of high-level feature extraction and manipulation. What we have done lies the foundation for further advancements in this field, opening up possibilities for future research and practical applications in art, design, and multimedia content generation.