## PCA to Autoencoders
##### EECS 495: Deep Learning from Scratch, Final Project
##### By Philip Meyers IV (pmm432)

For my final project, I wanted to look at the method of style transfer discussed in 'A Neural Algorithm of Artistic Style' by Gatys et al. The paper presents a unique and very successful approach to the problem of style transfer, or creating a composite image by combining the content of one image with the style of another. Although such task appears quite difficult, the process demonstrated by Gatys et al. is pretty simple to implement. (Running it, on the other hard, is far more intensive as I found out). Gatys et al. use activations from hidden layers in a CNN trained for image classification (specifically the VGG-Network) to define a new optimization problem. The researchers observed that the activations in the earlier layers offer representation of lower-level features and structures of images that we associate with the content of images. Similarly, the activations in the later layers are more closely tied to more global image properties, like an artist's painting style. Thus a composite image can be generated from a purely noisy image by learning values for each pixel such that the learned values maximize similarities to early layer activations of the content image and late layer activations of the style image. 

Below is my implementation of the style transfer net using VGG19 and Tensorflow:

In [None]:
import os
import urllib.request
import scipy.io
import tensorflow as tf
import numpy as np
from functools import reduce

class VGG19:
    VGG19_URL = "http://www.vlfeat.org/matconvnet/models/beta16/imagenet-vgg-verydeep-19.mat"
    VGG19_PATH = os.path.abspath(VGG19_URL.split('/')[-1])
    
    # From https://www.mathworks.com/help/nnet/ref/vgg19.html
    # We only care about convolutional part and ignore everything that touches the fully-connected layers
    VGG19_LAYERS = ( 
        'conv1_1', 'relu1_1', 'conv1_2', 'relu1_2', 'pool1', 
        'conv2_1', 'relu2_1', 'conv2_2', 'relu2_2', 'pool2', 
        'conv3_1', 'relu3_1', 'conv3_2', 'relu3_2', 'conv3_3', 'relu3_3', 'conv3_4', 'relu3_4', 'pool3',
        'conv4_1', 'relu4_1', 'conv4_2', 'relu4_2', 'conv4_3', 'relu4_3', 'conv4_4', 'relu4_4', 'pool4',
        'conv5_1', 'relu5_1', 'conv5_2', 'relu5_2', 'conv5_3', 'relu5_3', 'conv5_4', 'relu5_4',
    )
    
    def ensure_exists(self):
        if not os.path.isfile(self.VGG19_PATH):
            print("No cache of VGG19 data found")
            print("Downloading from {} (this could take a while)...".format(self.VGG19_URL))
            urllib.request.urlretrieve(self.VGG19_URL, self.VGG19_PATH)
            print("VGG19 data downloaded and stored at {}".format(self.VGG19_PATH))
        else:
            print("Cached VGG19 data found at {}".format(self.VGG19_PATH))
    
    def load_data(self, path):
        self.ensure_exists()
        
        data = scipy.io.loadmat(path)
        weights = data['layers'][0]
        mean = data['normalization'][0][0][0]
        mean_pixel = np.mean(mean, axis=(0, 1))
        self.mean_pixel = mean_pixel
        self.weights = weights
    
    def load_net(self, input_layer, pooling_type='max'):
        if self.weights is None or self.mean_pixel is None:
            raise Exception('Data not loaded!')
        weights = self.weights
        net = {}
        prev = input_layer
        for i, name in enumerate(self.VGG19_LAYERS):
            if 'conv' in name:
                kernels, bias = weights[i][0][0][0][0]
                
                # matlab is [width, height, channels_in, channels_out]
                # tf is     [height, width, channels_in, channels_out]
                kernels = np.transpose(kernels, (1,0,2,3))
                bias = bias.reshape(-1)
                            
                conv = tf.nn.conv2d(prev, tf.constant(kernels), strides=(1,1,1,1), padding='SAME')
                conv = tf.nn.bias_add(conv, bias)
                prev = conv
            elif 'relu' in name:
                relu = tf.nn.relu(prev)
                prev = relu
            elif 'pool' in name:
                pool = None
                if pooling_type == 'avg':
                    pool = tf.nn.avg_pool(prev, ksize=(1,2,2,1), strides=(1,2,2,1), padding='SAME')
                else:
                    pool = tf.nn.max_pool(prev, ksize=(1,2,2,1), strides=(1,2,2,1), padding='SAME')
                prev = pool        
            else:
                raise Exception('Unrecognized layer `{}` encountered'.format(name))
            
            net[name] = prev
        
        return net

def compute_features(image, vgg, feature_layers, pooling_type, gram, g):
    shape = (1,) + image.shape
    placeholder = tf.placeholder('float', shape=shape)
    net = vgg.load_net(placeholder, pooling_type)
    image_norm = image - vgg.mean_pixel

    features = {}
    with g.as_default(), tf.Session() as sess:
        for l in feature_layers:
            f = net[l].eval(feed_dict={placeholder:np.asarray([image_norm])})
            if gram:
                f = np.reshape(f, (-1, f.shape[3]))
                f = np.matmul(f.T, f) / f.size
            features[l] = f
    return features

def style_transfer(
    vgg,
    content_image, 
    style_image,
    iterations,
    content_weight,
    style_weight,
    pooling_type,
    content_layers,
    style_layers,
    learning_rate,
    beta1,
    beta2,
    epsilon,
    checkpoint_iters):
    
    g = tf.Graph()
    with g.as_default():
        content_features = compute_features(content_image, vgg, content_layers, pooling_type, False, g)
        style_features = compute_features(style_image, vgg, style_layers, pooling_type, True, g)

        initial_image = tf.Variable(tf.random_normal((1,) + content_image.shape))
        net = vgg.load_net(initial_image, pooling_type)
        
        content_loss = []
        for l in content_layers:
            content_loss.append(tf.nn.l2_loss(net[l]-content_features[l]) / content_features[l].size)
        content_loss = reduce(tf.add, content_loss)
        
        style_loss = []
        for l in style_layers:
            layer = net[l]
            _, height, width, number = map(lambda i: i.value, layer.get_shape())
            size = height * width * number
            feats = tf.reshape(layer, (-1, number))
            gram = tf.matmul(tf.transpose(feats), feats) / size
            style_gram = style_features[l]
            style_loss.append(tf.nn.l2_loss(gram - style_gram) / style_gram.size)        
        style_loss = reduce(tf.add, style_loss)
        
        total_loss = content_weight * content_loss + style_weight * style_loss
        train = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(total_loss)
        
        best_loss = float('inf')
        best = None
        did_improve = False
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
            sess.run(tf.global_variables_initializer())
            
            for i in range(iterations):
                did_improve = False
                train.run()
                loss = total_loss.eval()
                if loss < best_loss:
                    best_loss = loss
                    best = initial_image.eval()
                    did_improve = True
                    
                img_out = best.reshape(content_image.shape) + vgg.mean_pixel
                yield (i, img_out, did_improve)

In [None]:
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.autolayout'] = True
%matplotlib inline
from matplotlib.pyplot import imshow
import scipy.misc
import time

def read_img(path):
    return scipy.misc.imread(path).astype(np.float)

def save_img(path, img):
    img = np.clip(img, 0, 255).astype(np.uint8)
    Image.fromarray(img).save(path, quality=95)

I explored this approach to style transfer on three pairs of images:

* Content image of Main Library, style image of cubism
![alt text](main.jpg "Main Library")
![alt text](cubism.jpg "Main Library")
* Content image of me, style image of Van Gogh's self portait
![alt text](me.jpg "Me")
![alt text](vg.jpg "Van Gogh")
* Content image of me, style image of Naruto
![alt text](me.jpg "Me")
![alt text](naruto.jpg "Me")

I explored the hyperparameter space of content weight, style weight, pooling type, and learning rate, generating as many possible combinations of the parameters as time permitted. 

Time was unexpectedly the biggest challenge of this project. I knew the optimization would be long, but I didn't know that it would be 2-hours-per-image long, even after I resized the image! When I attempted to reduce the size of the image to a point where the process only took 5-10 minutes, the images were way too small to be able to discern much detail or style. A big learning experience of this project was simply learning how to queue up a lot of work (**and** constantly save results--I lost about a day's worth of work because my computer crashed before I could save the results of multiple style transfers) and leave my computer to itself for a day or two to grind through the style transfers. 

The following cell was used to experiment with 2 pooling types (max and average), 3 content weights ($1/2, 5, 50), and 3 style weights (50, 500, 5000) for a total of hyperparameter combinations:

In [None]:
content_path = 'me.jpg'
style_path = 'naruto.jpg'

content_image = read_img(content_path)
content_image = scipy.misc.imresize(content_image, 0.25)
style_image = read_img(style_path)
style_image = scipy.misc.imresize(style_image, 0.5)

vgg = VGG19()
vgg.load_data(vgg.VGG19_PATH)

content_layers = ('relu4_2', 'relu5_2')
style_layers = ('relu1_1', 'relu2_1', 'relu3_1', 'relu4_1', 'relu5_1')

pooling_type ='max'
epsilon = 1e-08
beta1 = 0.9
beta2 = 0.999
learning_rate = 1e1
iterations = 1000
content_weight = 5e0
style_weight=5e2
ct = 0
best_images = []
for pooling_type in ('max', 'avg'):
    for cw in (-1, 0, 1):
        content_weight = 5*10**cw
        for sw in (1,2,3):
            ct += 1
            style_weight = 5*10**sw
            
            print(ct, pooling_type, content_weight, style_weight)

            best_img = content_image
            for i, img, did_improve in style_transfer(
                vgg,
                content_image, 
                style_image,
                iterations,
                content_weight,
                style_weight,
                pooling_type,
                content_layers,
                style_layers,
                learning_rate,
                beta1,
                beta2,
                epsilon,
                checkpoint_iters=10):

                if did_improve:
                    best_img = img
                    
            hp = {
                'ct': ct,
                'content_weight': content_weight,
                'cw':cw,
                'style_weight':style_weight,
                'sw': sw,
                'pooling_type':pooling_type
            }
            
            best_images.append((best_img, hp))

for (img, hp) in best_images:
    filename = 'me_naruto_cw{}_sw{}_{}.jpg'.format(hp['content_weight'], hp['style_weight'], hp['pooling_type'])
    path = 'images/{}'.format(filename)
    save_img(path, img)

I used the exact same setup to explore the same hyperparameter space for the other two image pairs. Furthermore, I was disappointed with the results from the initial Me/Naruto style transfer so I was just barely able to re-run all 18 combinations with a higher learning rate of 100 (instead of 10).

The following is my favorite result from each of the image pairs:
![alt text](images/me_naruto_cw50_sw5000_avg.jpg "9th Hokage")
![alt text](images/me_vg_cw5_sw500_max.jpg "Van Phil")
![alt text](images/main_cubism_cw50_sw5000_avg.jpg "Main Cubism")


Finally, an analysis of the results. The "best looking" images all had a content weight:style weight ratio of 1:100. When the ratio was lower than 1:100 (ie 1:10), the generated image maintained too many of the details from the content image:
![](images/main_cubism_cw5_sw50_avg.jpg)
And when the ratio was much higher than 1:100, almost all of the original content was lost and the generated image looked like a distorted version of the style image:
![](images/main_cubism_cw0.5_sw5000_avg.jpg)
When only the pooling type was varied, max pooling always yielded a more "transformed" image. Sometimes this distortion came as a better or stronger presence of style:
![](images/me_vg_cw50_sw500_avg.jpg)
![](images/me_vg_cw50_sw500_max.jpg)
But other times the average pooling yielded an acceptable result while the max pooling yielded an incomprehsible result:
![](images/me_naruto_cw5_sw5000_avg.jpg)
![](images/me_naruto_cw5_sw5000_max.jpg)
To my disappointment, running the Me/Naruto image pair did not yield better results with the higher learning rate. Perhaps the images are too dissimilar or perhaps the Naruto image simply doesn't contain enough "style" information in the chosen VGG-19 layers. Given more time (and a more reliable computer), I would love to explore the impact of choice of style and content layers on style transfer with various forms of content and style (rather than relying on the observations and conclusions offered by Gatys et al. and the Internet). The approach I would like to take is choose style and content layers based on their ability to maximally communicate information about a given content or style. For style, I envision doing this by collecting many images and grouping them by category (i.e impressionist paintings, cubism, water paints, etc.), feed those images through VGG-19 and collect their activations at each hidden layer, then evaluate each layer by category by how well it clusters the category's images in its output space. My reasoning is that if a certain layer maps a certain category of styles (like impressionist paintings) closer together than other layers, then perhaps that layer has a better embedding for that specific style. The same could be done for content images. 

This project has offered a lot of insight into the potential and difficulty of optimization with neural nets. It is incredible how well these newer methods can work, and can still be pretty fascinating when they don't work as planned:
![](images/me_vg_cw0.5_sw5000_max.jpg)