# Recurrent Neural Networks

* so far our vanilla NNs expected a fixed-length input and predicted a fixed-length output (e.g., animacy)

* let's now look at neural networks that **handle variable length input**


* the exciting idea behind **Recurrent Neural Networks** (RNN) is that they allows us to work on **sequences** of input, output, or both

What do we mean by a *variable length input*?

A variable length input is a sequence where each input $x$ has a different length.

For instance, the first training instance has $l$ dimensions (say, $l$ tokens), the second input sequence has $m$ dimensions (e.g., tokens).

Mathematically, inspired by the notation of Cho (2015):
$$ \mathbf{x_1} = \langle x_1, ..., x_l \rangle$$ 
$$ \mathbf{x_2} = \langle x_1, ..., x_m \rangle$$
where $l\neq m$


### Detour: very simple example

From Cho (2015). Assume we have a binary vector $\mathbf{x}$ (it contains 0s and 1s). We want to count the number of 1s.



In [1]:
def add1(el,s):
    if el==1: return s+1
    else: return s

In [2]:
v=[0,1,0,0,1,1]
s=0
for el in v:
    s=add1(el,s)
print("count(1):", s)

count(1): 3


Two important components:
* memory $s$
* function $add1$ is applied to each symbol in input *one at a time* together with memory $s$

$\rightarrow$ input of any length

### Sequences

In language technology we often work with sequences, e.g. of words or characters, e.g., $$ \mathbf{x_2} = \langle x_1, ..., x_m \rangle$$

#### Approach 1: We have already seen one approach to handle such sequences

taking the **mean** of all word vectors in a sentence; but then the order is lost.

#### Approach 2: Recurrent neural networks (RNNs)

### RNNs ([Elman, 1990](https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog1402_1)): Dependence on previous step

* RNNs are called **recurrent** because they predict the next output being dependent on the previous output (i.e., like having a *memory* of what has been seen so far)


More formally, following the notation in Goldberg (2015):

* $\mathbf{x_{1:n}}$ input sequence
* $\mathbf{s_0}$ starting state (inital state)
* function $R$ ("memory so far") that takes a state vector $\mathbf{s_{i-1}}$ and an input vector $x_i$ and produces a new state $ \mathbf{s_{i}}$
* function $O$ maps from state to output $\mathbf{y}$

Formulation of an RNN (Goldberg 2015): <img src="pics/rnn0.png" width=400> 

Graphical representation of an RNN (Goldberg 2015): <img src="pics/rnn1.png" width=400> 

### Unrolling over time

<img src="pics/rnn2.png">

Note: $\theta$ shared parameters over time!

##### Expansion at state 4: 
<img src="pics/rnn3.png">
Note $s_i$ based on all $s_0,..,s_{i-1}$.

##### Another visualization

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/rnn.jpg" alt="illustration from WildML">
A recurrent neural network and the unfolding in time of the computation involved in its forward computation (LeCun et al., 2015).

However, basic RNNs (Elman) tend to not work well past a few recent time steps (vanishing or exploding gradients; one trick: gradient clipping for exploding gradients; otherwise: alternative models)

### Different formulations of R,O

lead to different instantiations of RNNs:

* LSTM (Long Short-Term Memory) ([Hochreiter and Schmidhuber, 1997](https://www.bioinf.jku.at/publications/older/2604.pdf))
* GRU (Gated Recurrent Units) ([Cho et al., 2014](https://arxiv.org/abs/1409.1259))


#### LSTM 

memory cells (read, write, forget) - gates that control flow (continous decisions: maybe read/write/forget)

Helps to keep information longer

* [excellent introduction to LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
* [article on dropout in RNNs](https://arxiv.org/abs/1409.2329)

### Example: Name Classification

Predicting a class label from the last RNN state:
<img src="pics/many2one.png" width="20%">



[Classifying names with a character-level RNN, by Sean Robertson](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html) NB: takes around 5 minutes to run on a 4-core CPU.


### RNNs:  **sequences** of input, output, or both

Karpathy's illustration of RNNs:
<img src="https://tensorlayer.readthedocs.io/en/stable/_images/karpathy_rnn.jpeg">

<!--* From left to right: (1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.*-->

##### One to many: Image caption generation, [Karpathy and Li (2014)](http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.pdf)
<img src="pics/karpathy-li-2014.png" width=500>

##### Many to many: sequence to sequence Machine Translation ([Sutskever et al., 2014](http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf))
<img src="pics/seq2seq.png">

##### many to many: POS tagging

bidirectional RNN [(Plank et al., 2016)](https://www.aclweb.org/anthology/P16-2067):

<img src="pics/bilstm.png">

## References

* D. Rao & B. McMahan's NLP with PyTorch (chapters 6 and 7).
* [Karpathy's blog on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)    