
# Recurrent neural networks

* sequence problems
    * speech recognition (sound -> transcription)
    * music generation (genre -> )
    * sentiment classification (sentence -> sentiment)
    * DNA sequence analysis
    * machine translation (lang1 -> lang2)
    * video activity recognition (video frames -> activity)
    * name entity recognition (sentence -> persons)
* notation
    * x (sentence), y (binary vector with 1 where the word is part of a name)
    * word in a sequence from $x^{<1>}$ to $x^{<t>}$, where $t$ is the number of tokens
    * scalar in a target sequence from $y^{<1>}$ to $y^{<t>}$
    * for $i$ example and its $t$ word we use notation $x^{(i)<t>}$, similarly for target vector
* vocabulary/dictionary
    * tokens/words used to build a model
    * one-hot representation can be used for encoding the word within the sentence

## Recurrent neural network model

* using standard feed-forward NN
    * does not shared features learned across positions
    * inputs and outputs can be different lengths
* architecture
    * $x^{<1>}$ -> NN -> $\hat y^{<1>}$
    * $x^{<2>}$ -> NN + $a^{<1>}$ -> $\hat y^{<2>}$, activation from previous step passed
    * $x^{<3>}$ -> NN + $a^{<2>}$ -> $\hat y^{<3>}$, activation from previous step passed
    * ...
    * $x^{<t>}$ -> NN + $a^{<t-1>}$ -> $\hat y^{<t>}$
    * internal parameters are shared across steps for both vertical and horizontal connections
    * connecting only earlier layers (addressed by bi-directional RNNs)
* forward pass
    * hidden activations tanh/ReLU, output activations sigmoid/softmax
    * $a^{<0>} = 0$, $a^{<1>} = g(W_{aa}a^{<0>}+ W_{ax}x^{<1>}+b_a)$, $\hat y^{<1>} = g(w_{ya}a^{<1>}+b_y)$
    * $a^{<t>} = g(W_{aa}a^{<t-1>}+ W_{ax}x^{<t>}+b_a)$, $\hat y^{<t>} = g(w_{ya}a^{<t>}+b_y)$
    * can be simplified by horizontally stacking $W$ mats, and vertically stacking $x$ and $a$
        * $a^{<t>} = g(W_a[a^{<t-1>},x^{<t>}]+b_a)$
* backward pass
    * $L^{<t>}(\hat y^{<t>}, y^{<t>}) = -y^{<t>} \log \hat y^{<t>} - (1-y^{<t>}) \log (1-\hat y^{<t>})$
    * $L(\hat y^{<t>}, y^{<t>}) = \sum_{i=1}^{T} L^{<t>}(\hat y^{<t>}, y^{<t>}) $, that is loss sum over the seq
    * partial derivates in opposite direction of the forward pass, recurrent path most significant (also called backprop through time)
* more architectures
    * one-to-one (standard MLP)
    * many-to-many (described above)
    * many-to-many (different lengths of x and y)
        * encoder reads in the whole sentence (outputs starts at the end)
        * decoder outputs the translation 
    * many-to-one (sentiment analysis)
        * input layer -> sentence
        * output layer -> sentiment
        * architecture -> similar to the first one, output at the end of the sentence
    * one-to-many (music generation)
        * input layer -> genre (int)
        * output layer -> set of notes
        * architecture -> first input generates a note, that is fed into the next layer (incl prev activations), outputs another note, ...

    



## Language model

* speech recognition
    * P(the apple and pair salad) = 3.2 x 10^-13
    * P(the apple and pear salad) = 5.7 x 10^-10
* language model
    * solving $P(y^{<1>},y^{<2>},...,y^{<Ty>}) = ?$
    * training set: large corpus of text
    * tokenize sentence into one-hot vector
        * adding tokens describing start and end of sentence \<EOS>
        * adding tokens describing unknown tokens \<UNK>
* model
    * inner workings
        * input $x^{<t>} = y^{<t-1>}$
        * Sentence "Cats average 15 hours of sleep a day <EOS>"
        * $x^{<1>} = 0$ (Nothing) -> NN + $a^{<0>}$ -> $\hat y^{<1>}$, where the output layer utilize softmax to predict probability of a word from the dictionary
        * $x^{<2>} = y^{<1>}$ (Cats)-> NN + $a^{<1>}$ -> $\hat y^{<2>}$, where the net outputs P(...|Cats)
        * $x^{<3>} = y^{<2>}$ (average)-> NN + $a^{<2>}$ -> $\hat y^{<3>}$, where the net outputs P(...|Cats average)
        * ...
        * $x^{<9>} = y^{<8>}$ (day)-> NN + $a^{<8>}$ -> $\hat y^{<9>}$, where the net outputs P(<EOS>|...)
    * cost function
        * $L^{<t>}(\hat y^{<t>}, y^{<t>}) = - \sum_i y^{<t>} log(\hat y^{<t>})$
        * $L = \sum_{t=1}^{Ty} L^{<t>}(\hat y^{<t>}, y^{<t>})$
    * resulting $p(y^{<1>}, y^{<2>}, y^{<3>}) = p(y^{<1>}) p(y^{<2>} | y^{<1>}) p(y^{<3>} | y^{<1>}, y^{<2>})$
* sampling novel sequences
    * word-level
        * output of the language model are probabilities across the dictionary
        * we can sample from the softmax layer using the outputs as probabilities and feed the sample further
        * \<UNK> samples rejected
    * character-level
        * similar to the approach above
        * dont have to worry about \<UNK>
        * very long sentences/long range dependencies
        * computationally intensive
* vanishing gradient problem
    * languages have long-term dependencies
    * hard for backpropagation to deal with the earlier dependencies (initial words)
    * local influence of tokens

## Gated Recurrent Units (GRU)

* [paper](https://arxiv.org/pdf/1409.1259)
* [paper](https://arxiv.org/pdf/1412.3555)
* traditional RNN
    * computing activation at time t $a^{<t>} = g(W_a[a^{<t-1>},x^{<t>}]+b_a)$
    * inputs $a^{<t-1>}$ , $x^{<t>}$
    * outputs
        * to the next unit $a^{<t>}$
        * to the softmax $a^{<t>}$
* GRU
    * memory cells c, $c^{<t>} = a^{<t>}$
    * candidate for replacing $c^{<t>}$, we obtain as $\tilde c^{<t>} = tanh(W_c [c^{<t-1>, x^{<t>}}] + b_c)$
    * update gate $\Gamma_u = \sigma(W_u [c^{<t-1>, x^{<t>}}] + b_u)$, with sigmoid activation func (result 1/0)
    * gate deciding when to update the memory cell, memory cell propagating "the memory" further
    * equation for updates $c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1-\Gamma_u) * c^{<t-1>}$, that is if $\Gamma_u = 1$ candidate value accepted, otherwise $\Gamma_u = 0$  and old value kept
    * structure of a unit
        * inputs $a^{<t-1>} = c^{<t-1>}$, $x^{<t>}$
        * internally $\tilde c^{<t>}$, $\Gamma_u$ calculated
        * update/non-update memory cell with candidate values $\tilde c^{<t>}$
        * outputs 
            * to the next unit $c^{<t>} = a^{<t>}$
            * to the softmax $a^{<t>}$
    * the gate can store value through many layers, thus reduce vanishing gradient problem (can propagate activations from early layers almost exactly)
    * size of $\tilde c$, $c$, and $\Gamma_u$ same, update is computed element-wise, allows to update some bits and keep other bits

* Full GRU
    * $\tilde c^{<t>} = tanh(\Gamma_r * W_c[c^{<t-1>},x^{<t>}]+b_c)$,
        * $\Gamma_r = \sigma (W_r[c^{<t-1>}, x^{<t>}]+b_r)$, a gate which describes relevance of the new params
    * $ \Gamma_u = \sigma(W_u[c^{<t-1>}, x^{<t>}]+b_u)$
    * $c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1-\Gamma_u)*c^{<t-1>}$

## Long Short Term Memory (LSTM)

* [paper](https://www.bioinf.jku.at/publications/older/2604.pdf)
* structure of a unit
    * inputs $a^{<t-1>}$, $c^{<t-1>}$, $x^{<t>}$
    * $\tilde c^{<t>} = tanh(\Gamma_r * W_c[a^{<t-1>},x^{<t>}]+b_c)$, change from $c^{<t-1>}$
    * $ \Gamma_u = \sigma(W_u[a^{<t-1>}, x^{<t>}]+b_u)$, update gate
    * $ \Gamma_f = \sigma(W_f[a^{<t-1>}, x^{<t>}]+b_f)$, forget gate
    * $ \Gamma_o = \sigma(W_o[a^{<t-1>}, x^{<t>}]+b_o)$, output gate
    * update $c^{<t>} = \Gamma_u * \tilde c^{<t>} + \Gamma_f * c^{<t-1>}$
    * outputs $a^{<t>} = \Gamma_o * tanh(c^{<t>})$, $c^{<t>}$
* with chained units, we can propagate $c$ through the chain with many time steps
* "peephole connection" is a variant where $c^{<t-1>}$ are propagated through the gates too

## Others

**Bidirectional RNN (BRNN)**  
* looking back and forward using the RNN, traditionally we look only backwards
* example
    * $x^{<1>}$
        * $\overrightarrow a^{<1>}$, informs $\overrightarrow a^{<2>}$
        * $\overleftarrow a^{<1>}$, based on $\overleftarrow a^{<2>}$
        * softmax out $y^{<1>}$, based on both units
    * $x^{<2>}$
        * $\overrightarrow a^{<2>}$, informs $\overrightarrow a^{<3>}$
        * $\overleftarrow a^{<2>}$, based on $\overleftarrow a^{<3>}$
        * softmax out $y^{<2>}$, based on both units
    * ...
* forms acyclic graph
* blocks can be RNN, GRU or LSTM 
* $\hat y^{<t>} = g(W_y [\overrightarrow a^{<t>}, \overrightarrow a^{<t>}] +b_y)$
* whole sequence needed to produce prediction

**Deep RNN**
* horizontal & vertical propagation of the signal, same logic as in the shallow RNNs
* usually less deep than other networks (computationally intensive due to temporal the connections)
* possibly have RNN extraction connected to feed-forward layers /wo the temporal connections