# Recurrent Neural Networks

* so far our vanilla NNs expected a fixed-length input and predicted a fixed-length output (e.g., sentiment classification example, animacy detection)

* we are now looking at neural networks that **handle variable length input**


* the exciting idea behind **Recurrent Neural Networks** (RNN) is that they allows us to work on **sequences** of input, output, or both

What do we mean by a *variable length input*?

A variable length input is a sequence where each input $x$ has a different length.

For instance, the first training instance has $l$ dimensions (say, $l$ tokens), the second input sequence has $m$ dimensions (e.g., tokens).

Mathematically, inspired by the notation of Cho (2015):
$$ \mathbf{x_1} = \langle x_1, ..., x_l \rangle$$ 
$$ \mathbf{x_2} = \langle x_1, ..., x_m \rangle$$
where $l\neq m$


### Detour: very simple example

From Cho (2015). Assume we have a vector $\mathbf{x}$ containing zeros and ones. We want to count the number of 1s.



In [1]:
def add1(el,s):
    if el==1: return s+1
    else: return s

In [2]:
v=[0,1,0,0,1,1]
s=0
for el in v:
    s=add1(el,s)
print("count(1):", s)

count(1): 3


Two important components:
* memory $s$
* function `add1` is applied to each symbol in input *one at a time* together with memory $s$

$\rightarrow$ input of any length

### Sequences

In language technology we often work with sequences, e.g. sequences of words or sentences, e.g., $$ \mathbf{x_2} = \langle x_1, ..., x_m \rangle$$

#### Approach 1: We have already seen one approach to handle such sequences

taking simply the **mean** of all word vectors in a sentence (e.g., CBOW, see Goldberg's primer); but what about word order?

#### Approach 2: Recurrent neural networks (RNNs)

### RNNs (Elman, 1990): Dependence on previous step

* RNNs are called **recurrent** because they predict the next output being dependent on the previous output (i.e., like having a *memory* of what has been seen so far)


More formally, following the notation in Goldberg (2015):

* $\mathbf{x_{1:n}}$ input sequence
* $\mathbf{s_0}$ starting state (inital state)
* function $R$ ("memory up so far") that takes a state vector $\mathbf{s_i}$ and input vector $x_i$ and produces a new state $ \mathbf{s_{i+1}}$
* function $O$ maps from state to output $\mathbf{y}$

Formulation of an RNN (Goldberg 2015): <img src="pics/rnn0.png" width=400> 

Graphical representation of an RNN (Goldberg 2015): <img src="pics/rnn1.png" width=400> 

### Unrolling over time

<img src="pics/rnn2.png">

Note: $\theta$ shared parameters over time!

##### Expansion at state 4: 
<img src="pics/rnn3.png">
Note $s_i$ based on all $s_0,..,s_i$.

##### Another visualization (Le Cun et al, 2015)

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/rnn.jpg" alt="illustration from WildML">
A recurrent neural network and the unfolding in time of the computation involved in its forward computation (LeCun et al., 2015).

However, basic RNN tend to not work well past a few recent time steps (vanishing or exploding gradients; one trick: gradient clipping for exploding gradients; otherwise: alternative models)

### Different formulations of R,O

lead to different instantiations of RNNS:

* LSTM (Long Short-Term Memory) (Hochreiter and Schmidhuber, 1997)
* GRU (Gated Recurrent Units) (Cho et al., 2014)


#### LSTM 

memory cells (read, write, forget) - gates that control flow (continous decisions: maybe read/write/forget)

Helps to keep information longer

* excellent introduction to LSTMS: [http://colah.github.io/posts/2015-08-Understanding-LSTMs/](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
* [article on dropout in RNNs](https://arxiv.org/abs/1409.2329)

### Example: Sentiment Analysis

Predicting a class label from the last LSTM state (sometimes also known as LSTM **acceptor**, cf. Goldberg, 2015):
<img src="pics/many2one.png">



Implementation: imdb_lstm example from Keras
https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py  (NB: takes time to run!!)

**N.B.** The examples has only a train/test split; don't use test for validation, use a dedicated dev set.

In [4]:
### 
'''Trains a LSTM on the IMDB sentiment classification task.
The dataset is actually too small for LSTM to be of any advantage
compared to simpler, much faster methods such as TF-IDF+LogReg.
Notes:
- RNNs are tricky. Choice of batch size is important,
choice of loss and optimizer is critical, etc.
Some configurations won't converge.
- LSTM loss decrease patterns during training can be quite different
from what you see with CNNs/MLPs/etc.
'''
from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, SimpleRNN, GRU
from keras.datasets import imdb

max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
# --> typically get this from data

batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=2, # reduce fro 15 to 2 just to save time
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 80)
x_test shape: (25000, 80)
Build model...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2
Test accuracy: 0.83304



### RNNs:  **sequences** of input, output, or both

Karpathy's illustration of RNNs:
<img src="http://benjaminbolte.com/resources/attention_rnn/karpathy_rnn.jpeg">

* From left to right: (1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.*

##### One to many: Image caption generation, Karpathy and Li (2014)
<img src="pics/karpathy-li-2014.png" width=500>
http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Karpathy_Deep_Visual-Semantic_Alignments_2015_CVPR_paper.pdf

##### Many to many: sequence to sequence Machine Translation (Sutskever et al., 2014)
<img src="pics/seq2seq.png">
http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

##### many to many: POS tagging

bidirectional RNN (biRNN/biLSTM) - (Plank et al., 2016):

<img src="pics/bilstm.png">

## References

* [Kyunghyun Cho's excellent lecture notes, chapter 4](http://arxiv.org/abs/1511.07916)
* [Karpathy's blog on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
* [WildMl blog](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)
    