# I- The paper
### 1. Learning to align visual and language data.
Figure 2.  
- key insight: "people make frequent references to some particular, but unknown location in the image."
- build upon Karpathy et al. [24]: learn to ground dependency tree relations to image regions with a ranking objective.  
-> use of bidirectional RNN (BRNN) -> word representations.  
-> simplified objective.
##### 1.1 representing images.
- map images to 20 $h$-dimesional vectors, $\{v_i | i = 1, ..., 20\}$.
- pre-trained Region CNN (RCNN) on ImageNet + finetuned on the top 200 classes of the ImageNet Detection Challenge.
- 19 top regions + 1 for the whole image -> $\forall i \in [|1, 20|], v_i = W_m[CNN_{\theta}(I_{b_i})] + b_m$ (1)
  - $W_m$: $h \times 4096$ dimensional matrix
  - $CNN_{\theta}$: maps bounding boxes pixels to $4096$-dimensional vectors.
  - $\theta$: 60 million parameters
##### 1.2 representing sentences: same $h$-dimensional embedding space.
- BRNN: 1-hot encoding of N words over an alphabet -> $h$-dimensional vector.
```
                                                                    +--------------------------------+
                                                               .--->| f(e_t + W_b x h_{t+1}^b + b_b) |--> h_t^b --.
      +-----------+          +--------------------+           /     +--------------------------------+             \    +-----------------------------+
1_t --| W_m * 1_t |--> x_t --| f(W_e * x_t + b_e) |--> e_t --*                                     (4)              *-->| f(W_d(h_t^f + h_t^b) + b_d) |--> s_t
      +-----------+          +--------------------+           \     +--------------------------------+             /    +-----------------------------+
                (2)                             (3)            `--->| f(e_t + W_f x h_{t-1}^f + b_f) |--> h_t^f --'                                 (6)
                                                                    +--------------------------------+
                                                                                                   (5)
```
$W_w$, $W_e$, $W_b$, $W_f$ and $W_d$ are learned.  
$b_e$, $b_b$, $b_f$ and $b_d$ are learned.  
Figure of the whole pipeline (fig. 3?)
##### 1.3 alignment objective.
$S_{kl} = \sum_{t \in g_l}\sum_{i \in g_k}\max(0, v_i^T s_t)$ (7)  
$t \in g_l$ is a sentence fragment in sentence $l$.  
$i \in g_l$ is an image fragment in image $k$.  
--> similarity when vectors are positively aligned.

$S_{kl} = \sum_{t \in g_l}\max_{i \in g_k}(v_i^T s_t)$ (8)  
"every word $s_t$ aligns to the single best image region."

- the max-margin, structured loss:
$C(\theta) = \sum_k\left[\sum_l\max(0, S_{kl} - S_+{kk} + 1) + \sum_l\max(0, S_{lk} - S_+{kk} + 1)\right]$ (9)  
sum of rank images + rank sentences.
##### 1.4 decoding text segment alignments to images.
- generating snippets of text instead of single words.
define a Markov Random Field:
- sentence with N words.
- image with M bounding boxes.
  - $\forall j \in [1, N], a_j \in [1, M]$
  - $E(a) = \sum_{j=1}^N\psi_j^U(a_j) + \sum_{j=1}^{N-1}\psi_j^B(a_j, a_{j+1})$ (10)
  - $\psi_j^U(a_j) = v_i^T s_t$ (11)
  - $\psi_j^B(a_j, a_{j+1}) = \beta 1[a_j = a_{j+1}]$ (12)
### 2. TODO

# II- The code
##### 1. Basic torch initialization.
##### 2. Create a data loader instance.
##### 3. Initialize the networks
- from file
- from scratch:
  - (1). the language model (`protos.lm`)
  - (2). the convolutional network (`protos.cnn`)
  - (3). the feature expander (`protos.expander`)
  - (4). the language model criterion (`protos.crit`)
  - use clone network to be able to write smaller checkpoints.
##### 4. Validation evaluation (`eval_split`)
- (1). fetch a batch of data, pre-process it, do not augment.
- (2). forward pass :
```
         +-----+            +----------+                      +----+               +------+
images --| cnn |--> feats --| expander |--> expanded_feats -,-| lm |--> logprobs --| crit |--> loss
         +-----+            +----------+                   /  +----+               +------+
                                                 labels --'
```
- (3). sample generation samples for each image.
- (4). return `loss_sum / loss_evals, predictions={(id, caption)}. lang_stats`
##### 5. Loss function (`lossFun`)
- (1). forward pass to transform images into "*back-propagatable*" losses.
```
         +--------+
images --| protos |--> loss
         +--------+
```
- (2). backward pass: criterion, `lm` and `cnn` only if finetuning.
- (3). clip gradients.
- (4). apply L2 regularization.
##### 6. Main loop
- (1). eval loss and gradients.
- (2). save checkpoints: opt, iter, loss_history, val_predictions.
- (3). decay learning rates for `lm` and `cnn` $`\epsilon =2^{-frac{i - i_0}{T}}`$
- (4). parameters update.
- (5). update `cnn` if not finetuning nor warming up.
- (6). exploding loss or max iterations -> stop



$$