# Sequence models & attention mechanism

## Various seq2seq architectures


* sequence-to-sequence models (ie machine translation)
    * [paper](https://arxiv.org/pdf/1409.3215)
    * [paper](https://arxiv.org/pdf/1406.1078)
    * architecture
        * encoder network (input sentence, outputs a vector representation)
        * decoder network (input a vector representation, outputs translation)
* image captioning
    * [paper](https://arxiv.org/pdf/1412.6632)
    * [paper](https://arxiv.org/pdf/1411.4555)
    * [paper](https://arxiv.org/pdf/1412.2306)
    * architecture
        * encoder network (headless cnn ie alexnet)
        * decoder network (input a pic representation, outputs caption)

* the most likely sentence
    * machine translation ("conditional language model")
        * $P(y^{<1>},...,y^{<T_x>} | x^{<1>},...,x^{<T_x>})$
            * probability of english sentence conditioned on input french sentence
            * in the translation problem, we are maximizing the probability of the translation
            * done by beam search, greedy search wont work as we optimize for whole sentence, not token

* beam search algorithm
    * try to pick a first $b$ number of words based on $P(y^{<1>}|x)$, $b$ is considered beam width
        * using encoding net on whole sentence, from the first step of the decoder net pick $b$ most likely
    * based on selected words, pick most likely following words using $P(y^{<2>}|x, y^{<1>})$ 
        * using encoding net on whole sentence, continue with the second step of the decoder based on a first word
        * $P(y^{<1>},y^{<2>}|x) = P(y^{<1>}|x) P(y^{<2>}|x, y^{<1>})$ 
        * evaluating for the vocabulary, picked only $b$ words in total (might remove previous option if a chosen word is not part of the sentence)
    * continuing until hitting \<EOS\> token $b$ times 

* beam search refinements
    * utility func
        * $arg max \ y\ \Pi_{t=1}^{T_y} P(y^{<t>} | x, y^{<1>},...,y^{<t-1>} )$, see parts above
        * $arg max \ y\ \Pi_{t=1}^{T_y} log \ P(y^{<t>} | x, y^{<1>},...,y^{<t-1>} )$, to battle num overflow
        * $arg max \ y\ \frac{1}{T_y^{\alpha}}\Pi_{t=1}^{T_y} log \ P(y^{<t>} | x, y^{<1>},...,y^{<t-1>} )$, to normalize for length $\alpha$ between 0 and 1
        * evaluate all generated sentences with the utility func
    * how to set $b$?
        * if $b$ large better results but slow, if $b$ small worse results, but fast 

* beam search error analysis
    * RNN vs beam components
    * RNN computes $P(y|x)$, compute human translation (better) and model translation, compare the two
        * if human translation scores higher P, work on beam search, otherwise on RNN
        * table for multiple translations

* bleu (bi-lingual evaluation understudy) score
    * [paper](https://aclanthology.org/P02-1040.pdf)
    * a tool for evaluating multiple good answers (translations)
    * modified precision: (max # words observed)/(# words predicted)
        * bi-grams: getting # max bi-grams observed, # bi-grams predicted
        * modified bi-gram precision: (sum # max bi-grams observed)/(sum # bi-grams predicted)
        * can be generalized for n-grams, conventionally calculated for 1-4 grams
        * $BP \ e^{\frac{1}{4}\sum_{n=1}^{4}PRE_n}$, where $BP = 1$ if the machine translation longer than reference, else $BP = e^{1-l_{ref}/l_{mt}}$

* attention model
    * [paper](https://arxiv.org/pdf/1409.0473)
    * addresses problems with longer texts, where encoder-decoder approach shows diminishing returns
    * intuition
        * input foreign text
        * bidirectional RNN, compute features for each of the tokens
        * attention to particular tokens (to translate the a token, we need to understand its context)
        * output RNN, takes its state, already translated  tokens and context of a current token (features of relevant token for a particular token we want to translate) as inputs to translate current token
    * algorithm
        * input foreign text
        * bidirectional RNN
            * for every step we consider activations $a^{<t'>}= (\overleftarrow{a}^{<t'>}, \overrightarrow{a}^{<t'>})$
        * attention
            * for each of the input tokens, it looks at the bidirectional RNN outputs and computes attention $\alpha^{<t,t'>}$, which informs the output RNN how much of a context from each token is needed
            * $\sum_{t'} \alpha^{<t,t'>} = 1$, for $\alpha^{<t,t'>} \geq 0$
            * $c^{<t>} = \sum_{t'} \alpha^{<t,t'>} a^{<t'>}$, that is a weighted sum of attention and activations from bidirectional RNN
        * traditional RNN
            * inputs context $c$
            * inputs state $S^{<t'>}$
            * inputs a previous token
            * outputs a token
    * computing attention
        * $\alpha^{<t,t'>}$ = amount of attention $y^{<t>}$ should pay to $a^{<t'>}$
        * $\alpha^{<t,t'>} = \frac{exp(e^{<t,t'>})}{\sum_{<t'>}^{<T_x>} exp(e^{<t,t'>})}$, softmax to make sure $\sum_{t'} \alpha^{<t,t'>} = 1$
        * $e^{<t,t'>}$ is computed by feed-forward net using previous output RNN state $s^{<t-1>}$ and $a^{<t'>}$ as its inputs
        * unfortunately quadratic costs

## Speech recognition

* input audio clip -> text transcript
* spectrograms (audio clip representation), shows time (x) vs frequency (y) vs energy (coloring)
* phonemes -> basic units of voice sounds (hand-engineered)
* large datasets allows for e2e speech recognition systems
* attention vs ctc models
    * ctc -> large inputs, outputs (transcripts) limited -> collapse repeated characters not separated by "blank"
* trigger word detection (wake up word)
    * spectrogram, fed into RNN, labels are 1 after the trigger word, 0 otherwise
    * usually imbalanced