## Face Recognition 

#### What is face recognition? 

* Face verification vs. face recognition 
   * Verification - given an image and name/ID, output whether the image is the claimed person 
   * Recognition - given you have $K$ persons in a database and their respective names/IDs, when you get an input image, output ID if image is any of the $K$ persons (or "not recognized")
   
#### One Shot Learning 

* One-shot learning - recognize that person given just one image of that person's face
    * aka learn from just **one** example 
    * to make it work, we learn a "similarity" function 
        * $d($img1, img2$)$ $=$ degree of difference between images
            * If $d($img1,img2$) \leq \tau \rightarrow$ "same" else "different"
            * pairwise comparison between the input image and each of the $K$ persons in the database
            
#### Siamese Network 

* Taigman et. al., 2014. DeepFace closing the gap to human level performance
* Siamese Network - running two **identical CNNs** on two different inputs 
* Siamese Network explained: Suppose you have images $x^{(1)}$ and $x^{(2)} $ such that
    * $x^{(1)} \rightarrow$ ConvNet until last FC layer $\rightarrow f(x^{(1)}) =$ "encoding of $x^{(1)}$"
    * $x^{(2)} \rightarrow$ ConvNet until last FC layer $\rightarrow f(x^{(2)}) =$ "encoding of $x^{(2)}$"
    * Then, we can define $d(x^{(1)}, x^{(2)}) = \Vert {f(x^{(1)}) -  f(x^{(2)})}\Vert^2_2$ such that:
        * $\Vert {f(x^{(i)}) -  f(x^{(j)})}\Vert^2_2$ is small if $x^{(i)}$ and  $x^{(j)}$ are the same person 
        * $\Vert {f(x^{(i)}) -  f(x^{(j)})}\Vert^2_2$ is large if $x^{(i)}$ and  $x^{(j)}$ are different people 
            * Use backprop until all these conditions are satisfied


            
#### Triplet Loss 

* Schroff et al. 2015, FaceNet: A unified embedding for face recognition and clustering 
* One way to learn parameters to get a good encoding for faces is to define and apply gradient descent on triplet loss function
* Suppose $A$ = anchor image, $P$ = positive (same person), $N$ = negative (different person)
    * Want $\Vert f(A) - f(P)\Vert^2 \leq \Vert f(A) - f(N)\Vert^2$ 
        * Trivial solution workaround since $\Vert f(A) - f(P)\Vert^2 - \Vert f(A) - f(N)\Vert^2 \leq 0 $ you can just make everything zero, so we add a margin $\alpha$ such that:
           *  $\Vert f(A) - f(P)\Vert^2 - \Vert f(A) - f(N)\Vert^2 + \alpha \leq 0 $
        * Thus, we have $\Vert f(A) - f(P)\Vert^2 + \alpha \leq \Vert f(A) - f(N)\Vert^2$
        
* Triplet loss function: Given 3 images $A, P, N$:
    * Define $ L(A,P,N) = \max(\Vert f(A) - f(P)\Vert^2 - \Vert f(A) - f(N)\Vert^2 + \alpha, 0)$
    * Thus, overall cost is $J = \sum\limits_{i=1}^{m}{L(A^{(i)},P^{(i)},N^{(i)})}$
    * Training set can be: 10k pictures of 1k people with some pairs $A$ and $P$ of the same person 
        * During training, if $A,P,N$ are chosen randomly, then  $d(A,P) + \alpha \leq d(A,N)$ is easily satisfied
        * So, choose triplets that are "hard" to train on s.t.:
            * $d(A,P) \approx d(A,N)$ so that the model has to train "extra hard" so that there is at least a margin $\alpha$ between a positive and negative differential 
        * use gradient descent to minimize the cost as per usual 
* some companies used 100millions of images, just get the parameters they trained 
    

#### Face Verification and Binary Classification

* Have a siamese NN and have two encodings $f(x^{(i)})$ feed into a logistic regression unit to make a prediction $\hat{y}$ where $\hat{y} = 1$ if they are the same person and $0$ if they are not.
    * Alternative to the triplet loss 
    * This makes face recognition into a binary classificaion problem!
    * Let's formulate $\hat{y}$ as follows. Say you have an encoding that has $h$ features and $i$ and $j$ are 2 inputs (faces) : Then 
        * $\hat{y} = \sigma(\sum\limits_{k=1}^{h} w_i \vert f(x^{(i)})_k - f(x^{(j)})_k \vert + b) $
        * Chi-square formula   
    * precompute faces in a database so you only have to compute new images 
    * use different pairs to train the Siamese NN

## Neural Style Transfer 

#### What is neural style transfer?

* Given a content image $C$ and a style image $S$, output a generated image $G$ that is in the style of $S$ and has the contents of $C$
    * Need to look at various features in the ConvNet, both shallow and deep for a better intuition 
    
#### What are deep ConvNets learning? 
* Zeiler and Fergus., 2013, Visualizing and understanding convolutional networks 
* Pick a unit in layer 1. Find 9 image patches that maximize the unit's activation. Repeat for other units. 
    * Figure out what a NN is actually learning in this first layer 
    * Do this also to later layers to visualize what's happening
    
#### Cost Function 

* Gatys et al., 2015. A neural algorithm of artistic style. 
* Remember $C,S,G$
* We create a cost function $J(G)$ to measure how good a generated image is.
    * $J(G) = \alpha J_{\text{content}}(C,G) + \beta J_{\text{style}}(S,G)$
* Process to generate image $G$
    * 1.) initialize $G$ pixels randomly: $G: 100 \times 100 \times 3$
    * 2.) Use gradient descnt to minimize $J(G)$
        * $G := G - \dfrac{\partial}{\partial G}J(G)$
        * i.e. update the pixels 



#### Content Cost Function 

* Remember $J(G) = \alpha J_{\text{content}}(C,G) + \beta J_{\text{style}}(S,G)$
    
* Let's figure out the content cost function $J_{\text{content}}(C,G)$
* Use a hidden layer $l$ to compute content cost 
* Use pre-trained ConvNet 
* Let $a^{[l](C)}$ and $a^{[l](G)}$ be the activation of layer $l$ on the images 
    * if the $a$s are similar, both have similar content 
        * $J_{\text{content}}(C,G) = \dfrac{1}{2}\Vert a^{[l](C)} - a^{[l](G)} \Vert^2$ aka element-wise sum of squares of differences between the activations in layer $l$ of the content and generated image 
            * perform gradient descent on this content cost function 
            * incentivize the algorithm to find image $G$ so that the hidden layers are like the content image  



#### Style Cost Function 

*  What **is** the style of an image? 
    * Defined as the **correlation between activations across channels** 
        * 'correlated' activations across different channels - if some part of an image has some characteristic, then the image will likely have another characteristic 
            * Ng's example uses vertical lines and the color orange
        * 'uncorrelated' activations across different channels - if some part of an image has some characteristic, then the image will likely not have another characteristic  
    * We can use the degree of correlations as a measure of the style
    
* Style matrix, basically we use multiplication to measure correlation between channels  
    * Let $a^{[l]}_{i,j,k} =$ activation at $(i,j,k)$. $G^{[l]}$ is $n_c^{[l]} \times n_c^{[l]}$
        * We denote $G^{[l]}_{kk'}$ that will measure how correlated chanel $k$ is to channel $k'$ where $k = 1,...,n_c^{[l]}$
        * style matrix:  $G^{[l]}_{kk'} = \sum\limits_{i=1}^{n_H^{[l]}}\sum\limits_{j=1}^{n_W^{[l]}} a^{[l]}_{i,j,k} a^{[l]}_{i,j,k'}$
            * large if two channels are correlated, and small if two channels are not correlated 
    * We have a style matrix for both the style image and the generated image, s.t.:
        * style image matrix: $G^{[l](S)}_{kk'} = \sum\limits_{i=1}^{n_H^{[l](S)}}\sum\limits_{j=1}^{n_W^{[l](S)}} a^{[l](S)}_{i,j,k} a^{[l](S)}_{i,j,k'}$
        * generated image matrix: $G^{[l](G)}_{kk'} = \sum\limits_{i=1}^{n_H^{[l](G)}}\sum\limits_{j=1}^{n_W^{[l](G)}} a^{[l](G)}_{i,j,k} a^{[l](G)}_{i,j,k'}$ 
        * 'Gram matrix' in linear algebra 
* Thus, the style cost function for a layer $l$ is: 
    * $J^{[l]}_{\text{style}}(S,G) = \dfrac{1}{(2n_h^{[l]}n_W^{[l]}n_C^{[l]})^2}\Vert G^{[l](S)} - G^{[l](G)}\Vert^2_F = \dfrac{1}{(2n_h^{[l]}n_W^{[l]}n_C^{[l]})^2} \sum\limits_k\sum\limits_{k'}(G^{[l](S)}_{kk'} - G^{[l](G)}_{kk'})^2$
        * the normalization constant $\dfrac{1}{(2n_h^{[l]}n_W^{[l]}n_C^{[l]})^2}$ doesn't matter that much because this cost is multiplied by $\beta$, the hyperparameter we use during optimization 

* Thus, the style cost function over all the layers is:
    * $J_{\text{style}}(S,G) = \sum\limits_l \lambda^{[l]}J_{\text{style}}^{[l]}(S,G)$
        * where $\lambda$ is some hyperparameter to be weighted against each layer in the style cost function 

* THEREFORE, the overall cost function is: 

$$J(G) = \alpha J_{\text{content}}(C,G) + \beta J_{\text{style}}(S,G)$$


#### 1D and 3D Generalizations

* 1D data, as you would expect
    * $14 \times 1 * 5 \times 1 \rightarrow 10 \times 16$ where $*$ is a convolution operator 
    * alot of 1D data uses Recurrent Neural Networks, models designed specifically for sequence data 
*  3D, as you'd expect 

### Programming Assignment: Art Generation with Neural Style Transfer

#### Tensorflow notes

* tf.eval() vs tf.run() 
    * tf.eval() fetches the value of only one tensor 
        * equivalent to calling tf.get_default_session().run(t) for some Tensor t
    * tf.run() can fetch the values of multiple tensors in a single step 
* tf.reduce_sum() by default somes all elements in a tensor 
* though tf.set_random_seed(i) sets a random seed, the value of the tensor changes even though no operations have been done to it. Shouldn't the data be static at the time of evaluation?
* use tf.square(x) to square x element-wise  

#### Programming notes 

* remember:

``
hello = (1,2)
print(type(hello)) # tuple in python
``

#### Lessons learned 

* Style matrix is Gram matrix! from linear algebra! ayyy :) 
    * (should probably take a linear algebra course and prob and stats course)
* Perception change: 
    * 'One important part of the gram matrix is that the diagonal elements such as $G_{ii}$ also measures how active filter $i$ is. For example, suppose filter $i$ is detecting vertical textures in the image. Then $G_{ii}$ measures how common  vertical textures are in the image as a whole: If $G_{ii}$ is large, this means that the image has a lot of vertical texture.'

* Content cost function used in assignment:
$$J_{content}(C,G) =  \frac{1}{4 \times n_H \times n_W \times n_C}\sum _{ \text{all entries}} (a^{(C)} - a^{(G)})^2 $$

* Style cost function used in assigment:
$$J_{style}^{[l]}(S,G) = \frac{1}{4 \times {n_C}^2 \times (n_H \times n_W)^2} \sum _{i=1}^{n_C}\sum_{j=1}^{n_C}(G^{(S)}_{ij} - G^{(G)}_{ij})^2 $$
    * As such, the goal is to minimize the 'distance' of the style image from the generated image i.e. how far the generated image is from determining the measure of prevalence of a feature from how that feature's prevalence measure in the style image 
    
* Why does the style matrix have dimensions $n_C \times n_C$?
    * style matrix is a measure of correlation between the activations of different channels for an image
    
    
* What you should remember:
    - The style of an image can be represented using the Gram matrix of a hidden layer's activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
    - Minimizing the style cost will cause the image $G$ to follow the style of the image $S$. 

* Need to go back and look at optimizer functions and their benefits/differences