# Convolutional Neural Networks
## Visualization and Neural Style Transfer

Author: Binghen Wang

Last Updated: 25 Dec, 2022

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a> |
    <a href="./Recurrent Neural Networks.ipynb">Recurrent Neural Networks</a> 
    <br>
    <b>CNN navigation:</b> <a href="./Convolutional Neural Networks.ipynb">CNN Basics</a> |
    <a href="./Object Detection.ipynb">Object Detection</a> |
    <a href="./Face Recognition.ipynb">Face Recognition</a>
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents
- [Visualizing Deep CNN](#Visual)
    - [Network Structure](#Visual-NS)
    - [Visualization](#Visual-Vis)
        - [Feature maps at different layers](#Visual-Vis-1)
        - [Feature evolution during training](#Visual-Vis-2)
    - [Other Takeaways](#Visual-OT)
- [Neural Style Transfer](#NST)
    - [Cost Function](#NST-CF)
    - [Content Cost Function](#NST-CCF)
    - [Style Cost Function](#NST-SCF)

<a name = "Visual"></a>
## Visualizing Deep CNN
Source: <a href='https://arxiv.org/pdf/1311.2901.pdf'>Visualizing and Understanding Convolutional Networks</a>, Fergus & Zeiler (2013)

**Approach**: a multi-layered Deconvolutional Network (decovnet) (to project feature activations back to the input pixel space)
<ol>
    <li> set all other activations of a convnet layer to 0
    <li> pass the feature maps as input to the attached decovnet layer
    <li> repeat until input pixel space is reeached:
    <ol>
        <li> <b>unpool</b> (using <b>switch variables</b> passed from the convnet which store the locations of the maxima)
        <li> <b>rectify</b> (reLU)
        <li> <b>filter</b> (using transposed versions of the same filters as those in the convnet)
    </ol>
        to reconstruct the activity in the layer beneath that gave rise to the chosen activations
</ol>

<a name = "Visual-NS"></a>
### Network Structure
<div style = "text-align: center;">
    <img src="./images/Visual CNN network structure.png" style="width:90%;" >
</div>

<a name = "Visual-Vis"></a>
### Visualization

<a name = "Visual-Vis-1"></a>
#### Feature maps at different layers
<div style = "text-align: center;">
    <img src="./images/Visual CNN.png" style="width:100%;" >
</div>

**Takeaways:**
1. **Visualizations (constructed via deconvnet) have lower variation than the corresponding image patches**. They also reveal much more readily the focus of a given activation (e.g. row 1, col 2 in layer 5).
2. **Higher layers show greater invariance** to input deformations compared with lower layers.
3. **There is strong grouping within each feature map**.

<a name = "Visual-Vis-2"></a>
#### Feature evolution during training
The evolution of a randomly chosen subset of model features through training at epochs [1,2,5,10,20,30,40,64] is shown below:
<div style = "text-align: center;">
    <img src="./images/Visual CNN feature evolution.png" style="width:100%;" > <br>
</div>

**Main Takeaway: it is important to let the models train until fully converged.**
- lower layers converge quickly within a few epochs
- upper layers take longer to converge

<a name = "Visual-OT"></a>
### Other Takeaways
- **Samll transformations to the image** (translate & scale) impact the lower layers more than upper layers. (Feature invariance)
- The output of the studied model is **not invariant to rotation**, except for object with rotational symmetry.
- The studied model does establish some degree of **correspondence** (analyzed through occlusion of different parts of an object).
- The **overall depth** of the model is **important** for obtaining good performance.
- **Increasing the size of the middle convolution** layers give a useful gain in performance. However, on top of this, enlarging the fully connected layers results in over-fitting.
- **Using an ensemble of models** typically improve predictions.
- **Transfer learning** of the improved ImageNet feature extractor shows a better performance than leading benchmarks for some other datasets, casting **doubt on the utility of benchmarks with small (i.e. $<10^4$) training sets**.

<a name = "NST"></a>
## Neural Style Transfer
<div style = "text-align: center;">
    <img src="./images/Neural Style Transfer.png" style="width:80%;" > <br>
    Source: <a href = "https://arxiv.org/pdf/1508.06576.pdf">A Neural Algorithm of Artistic Style</a>, Bethge, Ecke &amp; Gatys (2015)
</div>

<br>

To perform a **neural style transfer**, we need a **content image $C$** and a **style image $S$**. The goal is to **generate an image $G$** that has the content of $C$ in the style of $S$. The approach involes the following steps:
1. pick a **pre-trained model**, e.g. VGG-16.
2. define a **cost function** based on $C,S,G$.
3. initiate an image $G$ and run **gradient descent in input space** (rather than the parameters).

<a name = "NST-CF"></a>
### Cost Function
$$
J(G) = \alpha \, J_{\text{content}}(C,G) + \beta \, J_{\text{style}}(S,G)
$$
where $\alpha$ and $\beta$ controls the weights of content and style in the generated image. A larger $\alpha$ (a lower $\beta$) emphasizes content over style.

<a name = "NST-CCF"></a>
### Content Cost Function
An example of a **content cost function** is:
$$
J_{\text{content}}(C,G) = \frac{1}{2} {\left\Vert a^{[l](C)} - a^{[l](G)} \right\Vert}^2
$$
If $a^{[l](C)}$ and $a^{[l](G)}$ are similar, both images have similar content.

<a name = "NST-SCF"></a>
### Style Cost Function
As a measure of the similarity in style, we consider correlation between activations across channels. (E.g. Stripes tend to be in orange and with a certain stroke.)
#### Style Matrix
To define the style cost funciotn, we first introduce the concept of **Style Matrix** (also known as **Gram Matrix**). Let $a^{[l]}_{i,j,k} = $ activation at $(i,j,k)$ (height, width, channel). Then the Style Matrix $G^[l]$ is of shape $n_C^{[l]} \times n_C^{[l]}$ and is defined by:
$$
G^{[l]} = \left[\begin{array}{ccccc}
G_{11}^{[l]} & G_{12}^{[l]} & \cdots & G_{1,n_C^{[l]}-1}^{[l]} & G_{1,n_C^{[l]}}^{[l]} \\
G_{21}^{[l]} & G_{22}^{[l]} & \cdots & G_{2,n_C^{[l]}-1}^{[l]} & G_{2,n_C^{[l]}}^{[l]} \\
\vdots & \vdots & \ddots & \vdots & \vdots\\
G_{n_C^{[l]}-1,1}^{[l]} & G_{n_C^{[l]}-1,2}^{[l]} & \cdots & G_{n_C^{[l]}-1,n_C^{[l]}-1}^{[l]} & G_{n_C^{[l]}-1,n_C^{[l]}}^{[l]} \\
G_{n_C^{[l]},1}^{[l]} & G_{n_C^{[l]},2}^{[l]} & \cdots & G_{n_C^{[l]},n_C^{[l]}-1}^{[l]} & G_{n_C^{[l]},n_C^{[l]}}^{[l]}
\end{array}\right]
$$
where
$$
G_{kk^\prime}^{[l]} = \sum_{i=1}^{n_H^{[l]}}\sum_{j=1}^{n_W^{[l]}} a_{ijk}^{[l]}a_{ijk^\prime}^{[l]}
$$

#### Style Cost Function
Using the style matrixes of $S$ and $G$, we define the **style cost function** of layer l as follows:
$$
J_{\text{style}}^{[l]}(S,G) = \frac{1}{{\left(2n_H^{[l]}n_W^{[l]}n_C^{[l]}\right)}^2} {\left\Vert G^{[l](S)} - G^{[l](G)} \right\Vert}_F^2 = \frac{1}{{\left(2n_H^{[l]}n_W^{[l]}n_C^{[l]}\right)}^2} \sum_{k} \sum_{k^\prime}  {\left(G_{kk^\prime}^{[l](S)} - G_{kk^\prime}^{[l](G)}\right)}^2
$$

We can use information from multiple layers for improved performance:
$$
J_{\text{style}}(S,G) = \sum_{l} \lambda^{[l]} J_{\text{style}}^{[l]}(S,G)
$$
where $\lambda^{[l]}$ governs the contribution of layer l to the style cost function.