# Neural Style Transfer

Based on Andrew Ng's Deeplearning.ai course.

## Visualization

Zeiler and Fergus, 2013, Visualizing and understanding Convultional networks.

Pick a unit in layer 1, find the 9 image patches that maximize the unit's activation. Repeat this for other units.

## Cost Function

Gatys et al., 2015, A Neural Algorithm of Artisitc Style.

$$ \text{Minimize: } J(G) = \alpha J_{content}(C, G) + \beta J_{style}(S, G) $$

Where target content is C, style is S, and generated image is G. $\alpha, \beta$ are hyperparameters.

Process:

### Find the generated image G.

1. Initialize G randomly, e.g. 100x100x3
2. Use graident descent to minimize J(G) above.

### Content Cost Function

* Use hidden layer $l$ to copute content cost, $l$ is usually somehwere in the middle of the network.
* Use pre-trained ConvNet, e.g. VGG. 
* Let $a^{[l](C)}$ and $a^{[l](G)}$ be the activation of layer $l$ on the image.
* If $a^{[l](C)}$ and $a^{[l](G)}$ are similar, both images have similar content.

$$ J_{content}(C, G) = \frac{1}{2} \| a^{[l](C)} - a^{[l](G)} \|^2 $$

### Style Cost Function

Define **style** as **correlation** between activations across channels.

**Style Matrix**

Let $a^{[l]}_{i,j,k}$ = activastions at $(i, j, k)$ which index into height, width and channel of an input volume. 

$G^{[l]}$ is $n^{[l]}_C \times n^{[l]}_C$, the correlation matrix, $n_c$ is the number of channels. $G^{[l]}_{kk'}$ measures the correlation between channel $k$ and $k'$, $k \in 1,\cdots,n^{[l]}_c$.

$$ G^{[l]}_{kk'} = \sum_{i=1}^{n^{[l]}_H} \sum_{j=1}^{n^{[l]}_W} a^{[l]}_{i,j,k} \times a^{[l]}_{i,j,k'} $$

Compute the sytle matrices for both the sytle image and the generated image.

Style cost is then:

$$
\begin{aligned}
J^{[l]}_{style} (S, G) =& \frac{1}{(2n^{[l]}_H n^{[l]}_W n^{[l]}_C)^2} \| G^{[l](S)} - G^{[l](G)} \|^2_F \\
=& \frac{1}{(2n^{[l]}_H n^{[l]}_W n^{[l]}_C)^2} \sum_{k} \sum_{k'} \bigg( G^{[l](S)}_{kk'} - G^{[l](G)}_{kk'} \bigg)^2
\end{aligned}
$$

Turns out that the results is betten if you use all layers, so:

$$ J_{style}(S, G) = \sum_{l} \lambda^{[l]} J_{style}^{[l]}(S, G) $$

# 1D & 3D Generalizations

1D input would be convolved sequentially with a filter. E.g. length 14 sequence, length 5 filters, gives length 10 outputs. 

3D examples such as videos or sequence of 2D data. E.g. 14x14x14 input volume, 5x5x5 filter, output is 10x10x10 volume. 

Or if iput is 14x14x14x3 vs filter 5x5x5x3, 16 filters, output 10x10x10x16.