# Special applications

## Face recognition

* face verification (input image & person ID are same, 1:1) vs recognition (check input image against DB, 1:n problem, higher accuracy needed)
* one shot learning problem -> recognition based on 1 example (!)
* traditional CNN wont work, similarity function learning does
* d(img1,img2)<= t, it is the same person, otherwise no

### Siamese network

* [paper](https://arxiv.org/pdf/1701.01876)
* CNN, pooling, FC deeper in the network as a representation of a picture
* the difference between images can be calculated as $d(x^{(1)},x^{(2)})= \|{f(x^{(1)})-f(x^{(2)})}\|^2_2$
* the params of the NN are learned such as the distance for the same person is small and large in other cases

### Triplet loss function

* [paper](https://arxiv.org/pdf/1503.03832)
* anchor image A  vs positive example (same person) P vs negative example (different image) N
    * $\| f(A)-f(P) \|^2 \leq \| f(A)-f(N)\|^2 $, that is $d(A,P) \leq d(A,N)$
    * $ \| f(A)-f(P)\|^2  - \| f(A)-f(N) \|^2 <=0$, this can be trivially satisfied with distance func always 0
    * $ \| f(A)-f(P)\|^2  - \| f(A)-f(N) \|^2 + \alpha <=0 $, solved by adding a margin $\alpha$
* loss function
    * $ L(A,P,N) = \max(\| f(A)-f(P)\|^2  - \| f(A)-f(N) \|^2 + \alpha,0) $, ie if objective achieved the loss is 0, otherwise positive loss
* cost function
    * $ J = \sum_{i=1}^n L(A^{(i)},P^{(i)},N^{(i)}) $
* training 
    * multiple images needed (ie 10k pics of 1k persons)
    * choosing hard triplets for training, where $d(A,P) \sim d(A,N)$


### Face verification & binary classification

* deeper image representations into a a logit in the output layer (classifier outputs whether same person or not)
* logit/sigmoid can be applied on the difference of the encoding such as $\hat{y} = \sigma(\sum_{k=1}^{128} w_k|f(x^{(i)})_k-f(x^{(j)}_k)|+b)$
    * multiple way of distance measures can be considers such as chi-quadrat similarity

## Neural transfer

* [paper](https://arxiv.org/pdf/1508.06576)
* learned representation [paper](https://arxiv.org/pdf/1311.2901)
    * pick a unit in layer 1, visualize image patches that maximize units activation
        * repeat for other units
        * color gradients, lines, etc
    * pick a unit in a deeper layer, visualize image patches that maximize units activation
        * repeat for other units
        * textures
    * pick a unit in a deeper layer, visualize image patches that maximize units activation
        * repeat for other units
        * part of objects, part of animals, etc
    * pick a unit in a deeper layer, visualize image patches that maximize units activation
        * repeat for other units
        * whole objects, animals, people, text, ...    

### Cost function

* C content image, S style image, G generated image, exploration of learned representation needed,
* $J(G) = \alpha \cdot J_{content}(C,G) + \beta \cdot J_{style}(S,G)$, denotes the costs weighting content and style elements of the generated image, which is further governed by the parameters,
* algorithm,
    * initiate G randomly,
    * use gradient descent to minimize J(G), $G = G -\frac{\delta}{\delta G} J(G)$.

### Content cost function

* we use a layer $l$ to compute content cost, $l$ is in between the shallow and deep layers,
* pre-trained net can be used (Inception, VGG, ...),
* let $a^{[l],(C)}$ and $a^{[l],(G)}$ be the activation of layer $l$ on the images,
* if the activations are similar, the content is similar,
* $J_{content}(C,G) = \frac{1}{2} \|a^{[l],(C)} - a^{[l],(G)}\|^2$, ie element wise differences between the activations.

### Style cost function

* we use a layer $l$ to measure style,
* style is defined as correlation between activations across conv channels
    * if the channels are correlated, it means that there are tight to somewhat similar type of objects (ie texture and lines)
    * if there are not correlated, they are not usually seen together (texture, shapes, ...)
    * this can be leveraged to comparing style and generated images
* style matrix (measuring the correlations)
    * let $a^{[l]_{i,j,k}}$ activation at (i, j, k) that represents h, w, ch
    * $G^{[l]}$ is a matrix of size $n_c^{[l]}$ x $n_c^{[l]}$, that is computing corr between channels k and k' such as $G^{[l]}_{k,k'}$ 
    * $G^{[l]}_{k,k'} = \sum_i^{n_h^{[l]}} \sum_j^{n_w^{[l]}} a^{[l]}_{i,j,k} a^{[l]}_{i,j,k'}$, un-normalized cross/variance not really corr
    * $G^{[l](S)}_{k,k'} = \sum_i^{n_h^{[l]}} \sum_j^{n_w^{[l]}} a^{[l](S)}_{i,j,k} a^{[l](S)}_{i,j,k'}$
    * $G^{[l](G)}_{k,k'} = \sum_i^{n_h^{[l]}} \sum_j^{n_w^{[l]}} a^{[l](G)}_{i,j,k} a^{[l](G)}_{i,j,k'}$ 
    * $J_{style}^{[l]} (S,G) = \frac{1}{(2 n_h^{[l]} n_w^{[l]} n_c^{[l]})^2} ( \sum_k \sum_{k'} (G_{kk'}^{[l][S]}-G_{kk'}^{[l][G]}))^2$

* it is good to use style cost function for multiple different layers such as $J_{style} (S,G) = \sum_l \lambda^{[l]} J_{style}^{[l]} (S, G)$
* and the goal is to optimize $J(G) = \alpha \cdot J_{content}(C,G) + \beta \cdot J_{style}(S,G)$

### Conv generalization

* convolution can be generalized to 1D, 2D, or 3D data
* for 3D volume, we need 3D filter
    * input 3D volume and a channel 14 x 14 x 14 x 1
    * 16 3D filters (5 x 5 x 5 x 1)
    * output (10 x 10 x 10 x 16)
    * 32 3D filters (5 x 5 x 5 x 16)
    * output (6 x 6 x 6 x 32)
