# Course 5. Sequence models

## Recurrent neural networks

### Why sequence models

Speech recognition (input audio clip, output sequence of words), music generation (no input, output notes), sentiment classification (input sentance, output rating), DNA sequence analysis (input string of AGCT ... output sequence of matching protein), machine translation, video activity recognition, name entity recognition etc.

Its also intersting to note that RNN are first introduced in 1986. (https://en.wikipedia.org/wiki/Recurrent_neural_network)

### Notation

Name entity recognition example:
<img src="imgs/name_entity_recognition.png">

Notation:

$x^{(i)}$ - $i$-th training example 

$x^{(i)<t>}$ - $t$-th element in the sequence of $i$-th training example

$T_{x}^{(i)}$ - length of the $i$-th training example sequence

$y^{(i)<t>}$ - $t$-th element of the output sequence of $i$-th training example

$T_{y}^{(i)}$ - length of the output sequence of $i$-th training example

<img src="imgs/representing_words.png">


### RNN

Why not use standard nn for solving for example name entity recognition based on a sentance?
1. Input sentances won't always have same length
2. Even if we find a way around that (for example define maximum length for an input layer) this type of model doesnt share learned features across different positions of the text 

The better way to apporach this problem is with reccurent neural nets.
<img src="imgs/rnn.png">

We define the notation $w_{ax}$ in such that left subscript letter suggest what is being computed - in this example it is activation $a$ and right subscript letter suggest what is the multiplier of the parameters - int this example it is some input vector $x$ 

<img src="imgs/forwardproprnn.png">

In order to simplify notation a bit we would stack $w_{aa}$ and $w_{ax}$ togeather into a $[w_{aa}; w_{ax}] = w_a$. This also implies that we have to stack $a_{t-1}$ and $x_t$ into a single column vector $[a_{t-1}, x_t]$ as well.
<img src="imgs/simplified_rnn_notation.png">

What we got at the end is following notation:
\begin{align}
a_t = g(w_a [a_{t-1}, x] + b_a) \\
y_t = g(w_y a_t + b_y)
\end{align}

### Backpropagation through time

After forward propagation is finished (from left to right) we then have to go back and do backprop steps for inputs that happen before and thats why rnn back prop algorithm is called back propagation through time.
<img src="imgs/back_prop_through_time.png">


### Different types of RNN

Reference: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

There are several types of RNN architectures:
* one to one - standard nn
* one to many - music generation (for a given small set of notes or genre for example nn generates music piece)
* many to one - sentiment classification (for example if the movie review is positive or negative)
* many to many where $T_x = T_y$ - for example name entity recognition
* many to many where $T_x \neq T_y$ - machine translation

<img src="imgs/rnn_types3.png">


### Language Model and sequence generation

If you say "The apple and pair salad" vs "The apple and pear salad" good speech recognition system should be able to tell the difference and figure out that second sentance has much more sense. The way speech recognition system know what sentance to use is by using a language model ie it assignes each sentance a probability.

Language model is a RNN which produces prediction what would be the next word based on a given word. Building language model steps assumes having large corpus and tokenazing text (EOS, UNK additional tokens).

<img src="imgs/language_model.png">

### Sampling novel sequences

<img src="imgs/sampling_from_lm.png">


### Vanishing gradients with RNNs

As with deep neural nets there is a chance that gradients increase (explode) or decrease (vanish) exponentialy. To address exploding gradient problem we can try gradient clipping techinque. To address vanishing gradient problem we are introducing Gated Reccurent Unit ... 

### Gated Recurrent Unit GRU (watch again)

This is how rnn unit looks like:
<img src="imgs/rnn_unit.png">

Now lets explain gated recurrent unit (references [On the Properties of Neural Machine Translation: Encoder-Decoder Approaches](https://arxiv.org/abs/1409.1259), [Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling](https://arxiv.org/abs/1412.3555)).

**The cat, which already ate ... was full**

In this example we want for our network to memorize that subject in the sentance (cat) was singular and based on that to generate text ... _was full_ vs _were full_.

<img src="imgs/gru_unit.png">

Full GRU

<img src="imgs/full_gru_unit.png">


### Long Short Term Memory (LSTM)

<img src="imgs/lstm.png">

Both, GRU and LSTM has their advandages and disadvantages:
* GRU is much simpler model so its easier to build much bigger network
* LSTM is more powerful and flexible

Reference: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

### Bidirectional RNN

<img src="imgs/bidirectional_rnn_1.png">
<img src="imgs/bidirectional_rnn_2.png">



### Deep RNNs

<img src="imgs/deep_rnn.png">

Comparing with deep neural nets that we saw in previous lectures having 3 hidden layers for RNN is already huge because of the amount of computations that is needed.