# 1 - Sequence Models

Sequence models are a special form of neural networks that take their input as a sequence of tokens. They are often applied in ML tasks such as speech recognition, Natural Language Processing or bioinformatics (like processing DNA sequences).

Previously seen models processed some sort of input (e.g. images) which exhibited following properties:
- It was uniform (e.g. an image of a certain size)
- It was processed as a whole (i.e. an image was not partially processed)
- It was often multidimensional (e.g. an image with 3 color channels yields a matrix of $H \times W \times 3$)

Sequence models are a bit different in that they require their input to be a sequence of tokens. Examples for such sequences could be:

- audio data (sequence of sounds)
- text (sequence of words)
- video (sequence of images)
- $\cdots$

The length of the individual input elements (i.e. their number of tokens) does not need to be of the same length, neither for training nor prediction. These input tokens are processed one after the other, whereas at each time step a certain token is processed. Processing can be stopped at any point. A form of sequence models are Recurrent Neural Networks (RNN) which are often used to process speech data (e.g. speech recognition, machine translation), generative models (e.g. generating music) or NLP (e.g. sentiment analysis, named entity recognition (NER), etc.).

The notation for an input sequence $x$ of length $T_x$ or an output sequence $y$ of length $T_y$ is as follows (note the new notation with chevrons around the indices to enumerate the tokens):

$$x = x^{<1>}, x^{<2>}, \cdots, x^{<t>}, \cdots, x^{<T_x>}$$

$$y = y^{<1>}, y^{<2>}, \cdots, y^{<t>}, \cdots, y^{<T_y>}$$

As stated above, the input and the output sequence don't need to be of the same length $\left( T_x^{(i)} \neq T_y^{(i)} \right)$. Also the length of the individual training samples can vary $\left( T_y^{(i)} \neq T_y^{(j)} \right)$.

# 2 - Recurrent Neural Networks

The previously seen approach of a NN with an input layer, several hidden layers and an output layer is not feasible for the following reasons:

- input and output can have different lengths for each sample (e.g. sentences with different numbers of words)
- the samples don't share common features (e.g. in NER, where the named entity can be at any position in the sentence)

Because RNNs process their input token-by-token, they don't suffer from these disadvantages. A simple RNN only has one layer through which the tokens pass during training/processing. However, the result of this processing has an influence on the processing of the next token. Consider the following sample architecture of a simple RNN:

<div style="text-align:center">
    <img src="images/rnn.png" width=800>
    <caption><center><font color="purple">Example of an RNN</font></center></caption>
</div>

A side effect of this kind of processing is that an RNN requires far less parameters to be optimized than e.g. a ConvNet would to do the same task. This especially comes in handy for sentence processing where each word (token) can be a vector of dimension e.g. $10,000 \times 1$ (one-hot encoding from a vocabulary of 10,000 words).

## 2.1 - Unidirectional RNN

As seen in the picture above the RNN processes each token $x^{<t>}$ individually from left to right, one after the other. In each step $t$, the RNN tries to predict the output $\hat{y}^{<t>}$ from the input token $x^{<t>}$ and the previous activation $a^{<t−1>}$. To determine the influence of the activation and the input token and the two weight matrices $W_{aa}$ and $W_{ax}$ are used. There is also a matrix $W_{ya}$ that governs the output predictions. Those matrices are the same for each step, i.e. they are shared for a single training instance. This way the layer is recursively used to process the sequence. A single input token can therefore not only directly influence the output at a given time step, but also indirectly the output of subsequent steps (thus the term recurrent). Vice versa a single prediction at time step $t$ not only depends on a single input token, but on several previously seen tokens (we will see how to expand this so that the following tokens are taken into consideration in bidirectional RNNs).

## 2.2 - Forward propagation

The activation $a^{<t>}$ and prediction $\hat{y}^{<t>}$ for a single time step $t$ can be calculated as follows (for the first token the zero vector is often used as the previous activation):

$$a^{<t>} = g_1 \left( W_{aa} a^{<t-1>} + W_{ax} x^{<t>} + b_a \right)$$

$$\hat{y}^{<t>} = g_2 \left( W_{ya} a^{<t>} + b_y \right) \tag{1}$$
    
Note that the activation functions $g_1$ and $g_2$ can be different. The activation function to calaculate the next activation $(g_1)$ is often $\text{Tanh}$ or $\text{ReLU}$. The activation function to predict the next output $(g_2)$ is often the $\text{Sigmoid}$ function for binary classification or else $\text{Softmax}$. The notation of the weight matrices is by convention as that the first index denotes the output quantity and the second index the input quantity. $W_{ax}$ for example means <i>"use the weights in $W$ to compute some output $a$ from input $x$".</i>
    
This calculation can further be simplified by concatenating the matrices $W_{aa}$ and $W_{ax}$ into a single matrix $W_a$ and stacking:
    
$$W_a = \left[ W_{aa} \mid W_{ax} \right]$$
    
$$\left[ a^{<t-1>}, x^{<t>} \right] = \left[ \matrix{a^{<t-1>} \\ x^{<t>}} \right]$$
    
The simplified formula to calculate forward propagation is then:

$$a^{<t>} = g_1 \left( W_a \left[ a^{<t-1>}, x^{<t>} \right] + b_a \right)$$
    
$$\hat{y}^{<t>} = g_2 \left( W_{y} a^{<t>} + b_y \right)$$
    
Note that the formula to calculate $\hat{y}$ only changed in the subscripts used for the weight matrix. This simplified notation is equivalent to (1) but only uses one weight matrix instead of two.
    
## 2.2 - Backpropagation

Because the input is read sequentially and the RNN computes a prediction in each step, the output is a sequence of predictions. The loss function for backprop for a single time step $t$ could be:

$$\mathcal{L} \left( \hat{y}^{<t>}, y^{<t>} \right) = -y^{<t>} \log{\hat{y}^{<t>}} - \left(1-y^{<t>}\right) \log{\left( 1- \hat{y}^{<t>} \right)}\tag{2}$$

The formula to compute the overall cost for a sequence of $T_x$ predictions is therefore:

$$\mathcal{L} \left(\hat{y}, y\right) = \sum_{t=1}^{T_y} \mathcal{L}^{<t>} \left(\hat{y}^{<t>}, y^{<t>} \right)\tag{3}$$

# 3 - RNN architectures

There are different types of network architectures for RNN in terms of how the length of the input relates to the length of the output. A RNN can take a sequence of several tokens as an input and only produce a single value as an output. Such an architecture is called **many-to-one** and is used for tasks like **sentiment analysis** where the RNN e.g. tries to predict a movie rating based on a textual description of the critics.

<br>

<div style="text-align:center">
    <img src="images/many-to-one.png" width=350>
    <caption><center><font color="purple"><b><u>Sentiment Analysis:</u></b> Predict the sentiment (y = 0/1) from a sequence of text</font></center></caption>
</div>

The opposite is also possible: A RNN can take only a single value as input and produce a sequence as an output by re-using the previous outputs to make the next prediction. Such an architecture is called **one-to-many.** It could be used for example in a RNN that **generates music** by taking a genre as an input and generates a sequence of notes as an output.

<br>

<div style="text-align:center">
    <img src="images/one-to-many.png" width=350>
    <caption><center><font color="purple"><b><u>Music Generation:</u></b> Generate a sequence of music from given genre</font></center></caption>
</div>

There is theoretically also a one-to-one architecture. However, such an architecture is rarely encountered since it essentially corresponds to a standard NN.

Finally, there are networks which take an input sequence of length $T_x$ and produce an output of length $T_y$. This is called a **many-to-many** architecture. In the above example, the length of the input was equal to the length of the output. However, input and output sequences need not to be of the same length. This property is especially important for tasks like **machine translation** where the translated text might be longer or shorter than the original text.

<br>

<div style="text-align:center">
    <img src="images/many-to-many.png" width=350>
    <caption><center><font color="purple"><b><u>Machine Translation:</u></b> Translate input sequence into output sequence</font></center></caption>
</div>


## 3.1 - Encoder-Decoder Networks

Models with a many-to-one architecture might be implemented as encoder-decoder models. This is perhaps the most commonly used framework for sequence modelling with neural networks. Like the name suggests, an Encoder-Decoder model consists of two RNNs.

- The encoder maps the input sequence $X$ to a hidden representation $H$ of the same length as the input. 
- The decoder then consumes this hidden representation to produce $Y$, i.e. make a prediction.

\begin{align}
H &= encode(X) \\
Y &= p(Y \mid X) = decode(H) \\
\end{align}

<br>

<div style="text-align:center">
    <img src="images/encoder-decoder.png" width=550>
</div>

# 4 - Language Model and Sequence generation

RNN can be used for NLP tasks, e.g. in speech recognition to calculate for words that sound the same (homophones) the probability for each writing variant. Such tasks usually require large corpora of text which is tokenized. A token can be a word, a sentence or also just a single character. The most common words could then be kept in a dictionary and vectorized using one-hot encoding. Those word vectors could then be used to represent sentences as a matrix of word vectors. A special vector for the unknown word $(<unk>)$ could be defined for words in a sentece that is not in the dictionary plus an $<EOS>$ vector to indicate the end of a sentence.

The RNN can then calculate in each step the probabilities for each word appearing in the given context using softmax. This means if the dictionary contains the 10,000 most common words, the prediction $\hat{y}$ would be a vector of dimensions $(10,000 \times 1)$ containing the probabilities for each word. This probabaility is calculated using Bayesian probability given the already seen previous words:

$$\hat{y}^{<t>} = P\left( x^{<t>} \mid x^{<t-1>}, x^{<t-2>}, \cdots, x^{<1>} \right)$$

This output vector indicates the probability distribution over all words given a sequence of $t$ words. Predictions can be made until the $<EOS>$ token is processed or until some number of words have been processed. Such a network could be trained with the loss function (2) and the cost function (3) to predict the next word for a given sequence of words. This also works on character level where the next character is predicted to form a word.