### 1. Learning Checkpoint: Transformers

#### I. Recap
- Recall that a limitation of RNNs is the lack of long range dependencies.
- LSTMs aim to solve the problem with specific gates, but the context is not explicitly modeled.
- In practice, RNNs/LSTMs face the vanishing gradient issue, where long range context eventually result in negligible gradients.
- In addition, the variable length input makes the computation impossible to parallelize.

#### II. Attention Model for Embeddings
- Original Attention Paper: https://arxiv.org/pdf/1409.0473.pdf
- Attention for Transformers: https://arxiv.org/abs/1706.03762


- To simulate variable length inputs, we simply use an <empty> token to pad out the input up to a token limit.
- $$A(q, K, V) = \sum_i \frac{ \exp(q^T k_i) }{ \sum_j \exp(q^T k_j) } v_i = softmax(Q K^T) V$$
    - $q_i$ is the ith input embedding
    - $k_j, v_j$ are key, value pairs that represent hidden state representations, which serve as candidates; specifically, $v_j$ is the jth possible output, which is produced using key $k_j$. 
- The interpretation of the attention output is that it returns a linear combination of the candidate values.
- Note the dimensionality of the model; $q_i$ and $k_j$ must be the same dimensions for the dot product to work; $v_j$ can be a different dimension.
    - Assume an input size $M$ and the set of candidates is size $C$. If $q, k$ is of dimension $d_q$ and $v$ is of dimension $d_v$, then the output dimension is $[M \times d_v]$
- Observe that when $d_q$ gets larger, the variance of $d_q$ increases, which increases the softmax and makes the gradient very small. We must scale our attention model down by a factor of $\sqrt{d_q}$.
- $$A = softmax(\frac{Q K^T}{\sqrt{d_q}}) V$$

Knowledge Check:
- (i) Show the individual operations of the attention model such that the output dimension of the scaled dot-product attention model has an output of $[M \times d_v]$
- (ii) Explain why the attention model helps improve parallelism over RNNs/LSTMs. (Hint: think about the autoregressive property.)
- (iii) Explain the attention model in the lens of global vs local context and how this relates to the story generation step in our project.

#### III. Self-Attention
- Instead of using an annotated set of key, value candidates, use the set of keys as all the other input embeddings.
- By using other inputs, we can attend to the relevant parts of the input without a range restriction for the best context to generate an optimal candidate; this is controlled via the dot product.

#### IV. Multi-Head Attention
- Recall the concept of convolutions for CNNs; Multi-Head attention does the same thing: we project onto different heads, and then concatenate the result as the output.
- $$MHA(Q, K, V) = Concat(head_1, ..., head_h) W^O$$
- $$head_i = A(Q W_i^Q, K, W_i^K, V W_i^V)$$
    - $d_{in}$ is the input embedding size; previously $d_{in} = d_q$, but now $d_q = d_{in} / h$
    - $d_{out}$ is the output embedding size; previously $d_{out} = d_v$, but not $d_v = d_{out} / h$

Knowledge Check: 
* (i) Show that the weight matrices of the Multi-Head Attention Model are the following dimensions to get the final model output of $[M \times d_{out}]$:
  - $W_i^Q = [d_{in} \times d_q]$
  - $W_i^K = [d_{in} \times d_q]$
  - $W_i^V = [d_{out} \times d_v]$
  - $W_i^O = [h * d_v \times d_{out}]$


#### V. Encoders and Decoders
- An encoder translates the input to a hidden state. At encoding time, you can use both past and present context because you are given the full input.
- A decoder translates the hidden state to an output. At decoding time, you can only use the past context because the future context has not been generated yet. 
- Applying the attention model, an encoder self-attention model can use what we discussed. However, the decoder cannot see past the generated token, so we must mask those tokens with 0. 

#### VI. Transformers
- Additional Components
    - Residual Connection: addition of input without attention
    - Layer Normalization: change input to have 0 mean and unit variance per each layer per each training point
    - Positional embeddings, as we discussed before, to encode positions using an oscillating function; formula not important

- Encoder Block
    - Multihead Attention 
    - Add and Norm
    - Feedforward Neural Net
    - Add and Norm
- Decoder Block
    - Masked Multi-Head Partial Attention
    - Add and Norm
    - Multihead Attention
    - Add and Norm
    - Feedforward Neural Net
    - Add and Norm