# Sequence Models

## Week 1: Recurrent Neural Networks

Examples in which sequence models such as recurrent neural networks (RNNs) are useful are the following:

+ Speech recognition: input is audio $\rightarrow$ output is a sequence of words
+ Music generation: input can be none or some parameters $\rightarrow$ output is a sequence of sounds
+ Sentiment classification: input is a comment $\rightarrow$ output is a score
+ DNA sequence analysis: inpute is a sequence of letters AGCCCCTGTGAAGGCTAG $\rightarrow$ output is detecting which part of the sequence corresponds to what
+ Machine translation: input is a sentence $\rightarrow$ output is a sentence
+ Video activity recognition: input is a sequece of frames $\rightarrow$ output is the recognized activity
+ Name entity recognition: input is a sentence $\rightarrow$ output are the people's names in it

### Notation

Consider a name entity recognition example.

The i-th input is $x^{(i)}$: "Harry Potter and Hermione Granger invented a new spell". Each element of the $x^{(i)}$ sample is denoted by $x^{(i)<t>}$, where $t = 1,2, ...T^{(i)}_x$. In this example $x^{(i)<1>}$ is "Harry".

The i-th target is $y^{(i)}:[1 \space 1 \space 0 \space 1 \space 1 \space 0 \space 0 \space 0 \space 0]$, where $y^{(i)<t>} = 1$ if the $t$ element is a person's name and 0 otherwise. The length of $y^{(i)}$ is denoted by $T^{(i)}_y$.

Notice that the length of each entry can differ, and also the the length of the input and the output can be different.

Consider a dictionary that takes the form of a vector whose elements are all the words admitted (generally around 50.000 words). 

Words can be represented as vectors of the same length of the dictionary with a 1 in the corresponding position of that word in the dictionary and zeros elsewhere.

### Recurrent Neural Network Model

#### Why not a standard network?

+ Inputs and outputs can have different lengths in different samples
+ It would not share features learned across different positions of text (if "Harry" is recognized as a word in a sample it should share this knowledge with other samples too) $\rightarrow$ this is the same thing with CNN with images
+ Each element (word) of each input (sentence) has the length of the dictionary, so it's very big

### Basic Recurrent Neural Network

In a one-directional RNN the information is processed from left to right and the parameters are shared:

<img src="images/RNN.png" width="800px" />

Starting from some vector of hidden units $a^{<0>}$ generally equal to zero, each element $x^{<t>}$ (a one-hot vector) is combined with some parameters $W_{ax}$ and $b_a$ and an activation function $g_1()$ (generally a $tanh$ function) to compute the hidden layer $a^{<t>} = g_1(W_{aa}a^{<t-1>} + W_{ax}x^{<t>}+b_a)$, this layer is then used to make a prediction $\hat{y}^{<t>} = g_2(W_{ya}a^{<t>}+b_y)$, where $g_2()$ is another activation function (maybe a sigmoid function if the problem is binary or softmax is the output has many classes).

<img src="images/description-block-rnn-ltr.png" width="600px" />


**The important thing is that the parameters $W_{aa}, W_{ax}, b_a, W_{ya}, b_y$ are the same across the whole sequence**. They will be updated through the backpropagation step.


So **each activation value $a^{<t-1>}$ is passed to the next one to make the following prediction**. However, one weakness of this RNN is that it only uses the information that is earlier in the sequence to make a prediction. In particular, when predicting $y^{<3>}$, it doesn't use information about the words $x^{<4>}$, $x^{<5>}$ and so on.

To simplify the notation let  $a^{<t>} = g(W_a [a^{<t-1>} , x^{<t>}]+b_a)$, where $[a^{<t-1>} , x^{<t>}]$ means stacking the two vectors one on top of the other. This way the matrix $W_a$ is the horizontal stacking of the matrices $[W_{aa};W_{ax}]$.

For example, if $a^{<t-1>}$ is a vector of length 100 and $x^{<t>}$ a vector of length 10.000, the new vector $[a^{<t-1>} , x^{<t>}]$ has length 10.100.




### Backpropagation

Consider again the forward propagation:
    
- Starting from some value of $a^{<0>}$, the initialized parameters $W_a,b_a$ and the first input $x^{<1>}$ we compute $a^{<1>}$
- We use $a^{<1>}$ together with initialized parameters $W_y,b_y$ to compute the probability $\hat{y}^{<1>}$
- We pass the same parameters $W_a,b_a$ and $W_y,b_y$ (together with $a^{<t-1>}$) to compute every $a^{<t>}$ and so $\hat{y}^{<t>}$

For every prediction $\hat{y}^{<t>}$ we compute the loss function $L^{<t>}(\hat{y}^{<t>} - y^{<t>}) = -y^{<t>}\log(\hat{y}^{<t>}) - (1-y^{<t>})\log(1-\hat{y}^{<t>})$, the typical logistic regression loss.

The overall loss is given by $L(\hat{y} - y) = \sum_{t=1}^{T_y} L^{<t>}(\hat{y}^{<t>} - y^{<t>})$

Backpropagation requires to do computations (passing messages) in the opposite direction to take derivatives with respect to the parameters $W_a, b_a, W_y, b_y$ in order to update them with gradient descent.

The most significant message to by passed is the one from $a^{<t>}$ to $a^{<t-1>}$, called **backpropagation through time**.

### Examples of sequence data and RNN architectures

|Type of RNN|Illustration|Example|
|-|-|-|
| One-to-one $$T_x = T_y = 1$$ |<img src="images/rnn-one-to-one-ltr.png" width="200px" /> |Traditional neural network|
|One-to-many $$T_x = 1, T_y > 1$$|<img src="images/rnn-one-to-many-ltr.png" width="400px" />|Music generation|
|Many-to-one $$T_x > 1, T_y = 1$$|<img src="images/rnn-many-to-one-ltr.png" width="400px" /> |Sentiment classification|
|Many-to-many $$T_x = T_y $$|<img src="images/rnn-many-to-many-same-ltr.png" width="400px" />|Name entity recognition|
|Many-to-many $$T_x \neq T_y $$ |<img src="images/rnn-many-to-many-different-ltr.png" width="400px" /> |Machine translation|


### Language Model and Sequence Generation

An application of sequence models is speach recognition where the input is an audio and the output is a sentence among all possible sentences. In this example the language model estimates the probability of a sentence, being the probability of a sequence of words $P(y^{<1>}, y^{<2>}, ...,y^{<T_y>})$, and outputs the sentence with the highest probability. 

To train a languange model you need a large corpus of text. Then you need to tokenize each sentence by mapping every word to a vector of zeros and a 1 to the corresponding position in the dictionary. Punctuation such as "." can also be tokenized as "end of sentence" $<EOS>$. Words not present in the dictionary are tokenized as "unknown" $<UNK>$.

Consider the example 

$$\text{Cats average 15 hours of sleep a day.}$$

The sequence generation has a one-to-many structure in the form

<img src="images/rrn-sequence-generation.png" width="800px" />

First we feed $a^{<0>} = 0$ and $x^{<0>} = 0$ to compute $a^{<1>}$ and to make a prediction $\hat{y}^{<1>}$ given $a^{<1>}$ and $x^{<0>} = 0$. Notice that $\hat{y}^{<t>}$ is a vector of probabilities (through softmax) across all the words in the dictionary. Then you sample a $y^{<t>}$ from that distribution (or take the one with highest probability) and pass it to $x^{<t+1>}$

Assuming thet $y^{<1>}$ = "Cats", we then feed $y^{<1>} = x^{<1>}$ = "Cats" in the following computation of $a^{<2>}$ and $\hat{y}^{<2>}$, so that this time $\hat{y}^{<2>}$ will be the probability of every word in the dictionary **conditional on** $y^{<1>}$ = "Cats" and so on. Notice that the probability of a sequence of 3 words is $P(y_1, y_2, y_3) = P(y_1)P(y_2|y_1)P(y_3|y_1, y_2)$.

The model works by minimizing the loss $L(\hat{y} - y) = \sum_{t=1}^{T_y} L^{<t>}(\hat{y}^{<t>} - y^{<t>})$ where for every word  $L^{<t>}(\hat{y}^{<t>} - y^{<t>}) = - \sum_{i=1}^M y_i^{<t>} log(\hat{y}_i^{<t>})$ where $M$ is the total number of feasible words and $y_j^{<t>}$ is the tokenized version of the word in position $j$ (a vector which has 1 in position $j$ and zeros elsewhere). This way if the true word is in position $j$ minimizing $- \sum_{i=1}^M y_i^{<t>} log(\hat{y}_i^{<t>})$ is equivalent to maximinzing $log(\hat{y}_j^{<t>})$, the probability of $y^{<t>} = j$.

### Vanishing gradients with RNNs

One problem of basic RNNs is that they're not very good in capturing long term dependence. For example, in the sentence

$$\text{The cat, which already [...], was full.}$$

the word "was" depends on the singular word "cat", which was much earlier in the sequence.

The **vanishing gradient problem**, typical of deep NNs means that it's difficult for the error computed at the end of the sequence on $\hat{y}^{<T_y>}$ to affects the computations that are earlier such as in $a^{<1>}$. On the contrary, there is much more influence on the closest words of the sequence.

Although less common, there may also be the problem of exploding gradients which makes parameters become very large and computation results in numerical overflow. 

A solution to that is called "gradient clipping" and consists in putting a cap to the maximum value for the gradient.

### Gated Recurrent Unit (GRU)

One way of dealing with vanishing gradients is to use **Gated Recurrent Units (GRU)** (instead of the normal hidden unit) which are based on the concept of gate $\Gamma$ and memory cell $c$:

$$\Gamma = \sigma(W x^{<t>} + U a^{<t-1>} + b)$$

where $\sigma$ is the sigmoid function which outputs a vector whose values are mostly very close to zero or one, and $W,U,b$ are parameters. The GRU will output an activation value equal to the memory cell $c^{<t>} = a^{<t>}$, which tells us how much to consider previous information.

In every GRU we will be considering overwriting the memory cell $c^{<t>}$ with a candidate 

$$\tilde{c}^{<t>} = \tanh(W_c[\Gamma_r * c^{<t-1>},x^{<t>}]+b_c)$$

where $\Gamma_r = \sigma(W_r[c^{<t-1>},x^{<t>}]+b_r)$ tells us if to drop previous infomation, the $r$ stands for "relevant".

We then use the candidate $\tilde{c}^{<t>}$ and another gate $\Gamma_u$, where $u$ stands for "update"

$$\Gamma_u = \sigma(W_u[c^{<t-1>},x^{<t>}]+b_u)$$

(vector with values mostly very close to zero or one) to update the elements of the memory cell $c^{<t>}$ if the corresponding element of $\Gamma_u$ is close to one

$$c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + (1-\Gamma_u) * c^{<t-1>}$$

where $*$ is element-wise multiplication: notice that $c^{<t>}$ is a vector with same length of the hidden layer, so are $\tilde{c}^{<t>}$, $\Gamma_r$ and $\Gamma_u$.



In summary:

- starting from $a^{<t-1>} = c^{<t-1>}$ and $x^{<t>}$ compute the gates $\Gamma_r$ and $\Gamma_u$
- with the gate $\Gamma_r$ compute the candidate $\tilde{c}^{<t>}$
- use $\Gamma_u$ and the candidate $\tilde{c}^{<t>}$ to eventually update $c^{<t>}$
- pass the new $c^{<t>} = a^{<t>}$ to the next unit


<img src="images/gru-ltr.png" width="400px" />

For example, in the sentence

$$\text{The cat, which already ate [...], was full}$$

it's optimal to update the memory cell at "cat" ($\Gamma_u = 1$), then keeping $\Gamma_u = 0$ for the consecuitve words until reaching the word "was". After "was", there is no need to memorize the word "cat" anymore, so we can set again $\Gamma_u = 1$. What these element-wise multiplications do is it just tells you GRU which are the dimensions of your memory cell vector to update at every time step. You can choose to keep some bits constant while updating other bits. For example, maybe you'll use one-bit to remember the singular or plural cat, and maybe you'll use some other bits to realize that you're talking about food. Because we talked about eating and talk about foods, then you'd expect to talk about whether the cat is full later. You can use different bits and change only a subset of the bits at every point in time.



### Long Short Term Memory (LSTM)

Another way of dealing with vanishing gradients is the **Long Short Term Memory (LSTM)** unit, even more powerful than GRU.

In LSTM the candidate memory cell and the update gate are computed in the same way of GRU but this time $a^{<t>}$ is different from $c^{<t>}$

$$\tilde{c}^{<t>} = \tanh(W_c[\Gamma_r * a^{<t-1>},x^{<t>}]+b_c)$$

$$\Gamma_u = \sigma(W_u[a^{<t-1>},x^{<t>}]+b_u)$$


The memory cell $c^{<t>}$ is update not only through $\Gamma_u$ but also through a "forget" gate $\Gamma_f$

$$\Gamma_f = \sigma(W_f[a^{<t-1>},x^{<t>}]+b_f))$$

$$c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + \Gamma_f * c^{<t-1>}$$

This gives the cell the option of keeping old values ($c^{<t-1>}$) and adding them to the candidate ($\tilde{c}^{<t>}$)

The new activation value is computed using a fourth "output" gate

$$\Gamma_o = \sigma(W_o[a^{<t-1>},x^{<t>}]+b_o))$$

such that

$$a^{<t>} = \Gamma_o * \tanh(c^{<t>})$$

where $*$ is element-wise multiplication.

In summary,

- You use $a^{<t-1>}$ and $x^{<t>}$ to compute the gates $\Gamma_f$, $\Gamma_u$ and $\Gamma_o$ and the candidate $\tilde{c}^{<t>}$ (the "relevant" gate $\Gamma_r$ is omitted)
- You compute $c^{<t>}$
- You output $a^{<t>}$

<img src="images/lstm_unit.jpg" width="400px" />

By stacking many LSTM units one after the other it is relatively easy that the memory cell $c^{<t>}$ carries information of many previous memory cells.

In a variation of LSTM the gates depend also on the previous memory cell value $c^{<t-1>}$, this is called "peephole connection".

In the history of Deep Learning LSTN were invented before, and GRU were derived as simplifications of the more complicated LSTM model. There isn't a superior algorithm between the two. However, since GRU are a simpler model (only two gates) it is easier to build a bigger network.

### Bidirectional RNN

No matter if the units are standard RNN, GRU or LSTM, it's still be difficult to predict a word without knowing the following part of the sentence. For example estimating if the word "Teddy" refers to a name in the two sentences:

$$\text{Teddy bears are on sale!}$$
$$\text{Teddy Roosevelt was a great President}$$

In Bidirectional RNN in addition to activations going forward $\vec{a}^{<t>}$ where each one is pass to the next unit, there are other activations starting from the end and going backward $\overleftarrow{a}^{<t>}$:

<img src="images/brnn.jpg" width="600px" />

So the prediction is made by

$$\hat{y}^{<t>} = g(W_y[\vec{a}^{<t>},\overleftarrow{a}^{<t>}]+b_y)$$

This way the prediction of $y^{<3>}$ depends both on $x^{<1>}$ through $\vec{a}^{<1>}, \vec{a}^{<2>}, \vec{a}^{<3>}$, but also on $x^{<4>}$ through $\overleftarrow{a}^{<4>}$ and $\overleftarrow{a}^{<3>}$, taking also information from the future.

It seems to be common to use BRNN with LSTN units. The disadvantage is that you need to go through the whole sequence of data before making a prediction. In speak recognition it would need to wait for the end of the speach before processing it.

### Deep RNNs

For learning very difficult functions sometimes it's useful to stack many layers together. Instead of having only one activation layer $a^{<t>}$ between each $x^{<t>}$ and $y^{<t>}$ in Deep RNNs we stack multiples hidden layers on top of each other:

<img src="images/deep-rnn-ltr.png" width="400px" />


This way each hidden layer denoted by $a^{[l]<t>}$ is computed as 

$$a^{[l]<t>} = g(W_a^{[l]}[a^{[l]<t-1>},a^{[l]<t>}]+b_a^{[l]}$$

where the parameters $W_a^{[l]}$ and $b_a^{[l]}$ are shared horizontally on the layer.

For RNNs having three layers is already a lot because of computation. However, there are versions of DRNNs where on top of three RNNs layers instead of outputting directly $\hat{y}^{<t>}$ there are many other layer not connected horizontally between the last $a^{[k]<t>}$ and the prediction.

## Week 2: Natural Language Processing & Word Embeddings

### Word Representation

The traditional way of representing words is through a dictionary, that is a vector containing all the unique words. Each word is then represented as one-hot representation of the dictionary: $O_k$ represents the one-hot vector with a 1 in position $k$ (the corresponding position in the dictionary) and zeros elsewhere.

In this way each word is represented on its own. The problem is that for example if the algorithm learns tha after "orange" it's likely to be the word "juice", it would not be able to generalize if he gets teh word "apple", because "orange" and "apple" have no connection in this one-hot representation

**Word embeddings** consists in representing the words in vectors of dimension $k$ in a way that words carrying similar concepts will be represented in vectors whose elements will be similar (in some of the $k$ dimensions).

For example the words "orange" and "apple" will have some elements very similar. So for example, in a name entity recognition problem, suppose that the model (a bidirectional RNN) learned that in the sentence 

$$\text{Sally Johnson is an orange farmer}$$

"Sally Johnson" is the name of a person, the model will be better in generalizing that in the sentence 

$$\text{Robert Lin is a durian cultivator}$$

also "Robert Lin" is the name of a person, because of the similarity between the words "orange farmer" and "durian cultivator".

The way to use word embedding is through **Transfer Learning**:

- get a pre-trained word embedding form a very large dictionary (between 1 and 100 billion words) in $k$ dimensions
- transform you dictionary in the $k$-dimension representation $\rightarrow$ it's also smaller than one-hot enconding for the dimension of the dictionary
- optional: fine tune the word embeddings with your data

### Analogies using word vectors

In an embedding representation of the words "man", "woman", "king" and "queen", the difference between the vectors of "man" and "woman" will be ideally a vector similar to the difference between the vectors "king" and "queen". That is 

$$e_{man} - e_{woman} \approx e_{king} - e_{queen}$$

This allows us to find analogies: given the relationship between "man" and "woman", the word corresponding to "king" is

$$e_w = \text{arg max}_w \quad \text{sim}(e_w, e_{man} - e_{woman} + e_{king} )$$

for a given measure of similarity.

A typical measure of similarity is the cosine similarity

$$\text{sim}(w_1,w_2) = \frac{w_1^T w_2}{\lVert w_1 \lVert_2 \lVert w_2 \lVert_2}$$

where the numerator is the inner product. It is the cosine of the angle between the two vectors which is zero if the angle is zero and is 1 if the angle is 90 degrees:

<img src="images/cosine_sim.png" width="800px" />

The more the two vectors are similar the larger will be their inner product.




A reduced 2D representation of the embeddings is achieved through a t-SNE representation (t-distributed Stochastic Neighbor Embedding), this way similar words should be seen closest in the 2D space.

<img src="images/t-sne.png" width="300px" />

It is a non-linear dimensionality reduction technique.

### Embedding matrix

Given $O_w$ the one-hot representation of a word in a dictionary of dimension $D$, that is a vector of 1 in position $w$ and zero elsewhere, the same word can be represented is a dense vector $e_w$ of length $K$ such that

$$\underbrace{e_w}_{K \times 1} = \underbrace{E}_{K \times D} \underbrace{O_w}_{D \times 1}$$

where $E$ is the embedding matrix of dimension $K \times D$.

### Learning Word Embeddings

In the history of word embeddings researchers started with complex models and then realizes how to simplify them.

The idea was that given a sentence, for example 

$$\underbrace{\text{I want a glass of orange}}_{\text{input}} \underbrace{\text{juice}}_{\text{target}}$$

- Transform the words in a one-hot representation of vectors of length $D$, where $D$ is the number of words in the dictionary
- Multiply the one-hot vectors by a matrix $E$ to obtain a $k$ dimensional representation, $E$ has dimension $D \times k$, where $k << D$
- Use the parameters of $E$ as input of a NN to predict the output $y$ via soft-max
- Use gradient descent to find the optimal parameters of $E$ that gives the $k$ embeddings

The input, called *context* can be just the last 4 words, or the last and following 4 words. It was shown that already using as context just the previous word or a nearby word was enough to get meaningful embeddings.

### Word2vec

**Word2vec** is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words.

In the **Skip-gram model** you choose randomly a *context* word and a *target* word in a certain window (before and after the context word).

- Start from the one-hot representation of the context word $o_c$
- Pass $o_c$ througha a matrix $E$ to obtain the embeddings $e_c = Eo_c$
- Pass the embeddings $e_c$ through a soft-max function to compute $\hat{y}$: is a vector of the probabilities for each word of the dictionary to be the target word $y$

The architecture is the following:

$$o_c \rightarrow E \rightarrow \underbrace{e_c}_{Eo_c} \rightarrow \underbrace{\otimes}_{\text{soft-max}} \rightarrow \hat{y}$$

Notice that $\hat{y}$ is a vector of probabilities of dimension $D$ where each element $y_j$ is p(y = word j in the dictionary | c)


The soft-max for element $t$ of $\hat{y}$ is defined as

$$p(t|c) = \frac{e^{\theta_t^Te_c}}{\sum_{j=1}^D e^{\theta_j^Te_c}}$$

where in the denominator you sum among all words in the dictionary of length $D$ and $\theta_t$ is a parameter associated to target $t$. The bias term is omitted.

The loss funcion is the negative of the log-likelihood

$$L(\hat{y},y) = - \sum_{i=1}^D y_i \log \hat{y_i}$$

The parameters to learn are those in $E$ and $\theta$. The problem of this model is calculating the soft-max because it involves the sum over all values of the dictionary and it's computationally expensive.

**Negative sampling** is a more efficient algorithm for word2vec. It works by generating a new training sample where for each context word there is a chosen word that combines well, and that couple has target = 1 and $k$ other words that are not related and for which the target = 0. The number of negative samples $k$ is generally chosen between 5 and 20 for small datasets and between 2 and 5 for large datasets.

The model will train the matrix $E$ to predict if the couple of words $(c,t)$ are related or not, which is a simply binary classification problem with a sigmoid activation function

$$p(y=1|c,t) = \sigma(\theta_t^Te_c)$$

for only $k+1$ samples at each time instead of all the words in the dictionary.

The choice of the negative samples is was set a $p(w_i) = \frac{f(w_i)^{3/4}}{\sum_{j=1}^D{f(w_j)^{3/4}}}$, $f(w_i)$ is the frequency of word $i$. This method is less computationally expensive than the skip-gram model.

**GloVe** (Global Vectors for word representation) is a word embedding algorithm that uses a co-occurence matrix $X$ ($D \times D$) where each $X_{ij}$ denotes the number of times that a target $j$ occurred with a context $i$ for a certain window of words.

It works by minimizing the cost function

$$J(\theta) = \sum_{i=1}^D \sum_{j=1}^D f(x_{ij}) (\theta_i^T e_j + b_i - b'_j - \log(X_{ij}))^2$$

where $f(x_{ij})$ is a weighting function that is zero if $X_{ij} = 0$ (in order to allow $\log(X_{ij}$) and to balance very frequent words such as *stop words* (e.g. "is", "the", "at") with other unfrequent words.

The matrices $\theta$ and $E$ end up to be symmetric. The vectors $\theta_i$ and $e_j$ are initialized as uniformly random vectors and then after gradient descent the final embedding for word $w$ is

$$e^*_w = \frac{e_w + \theta_w}{2}$$

### Sentiment Classification

An application of word embeddings and RNN is sentiment classification, where the input is a text and the target is a number (e.g. 1-5). Thanks to the RNN's property each word will be treated by carrying information from other words, so that the word "good" becomes negative if associated with the words "lacking in"

<img src="images/sentiment_classification.jpg" width="600px" />

Embeddings allow us to generalize the model so that if trained of the sentence "Completely lacking in ..." it will still be able to predict "Completely absent of ...".

### Debiasing Word Embeddings

Word embeddings can reflect the gender, ethnicity, age, sexual orientation, and other biases of the text used to train the model. So that if asked for analogy it may output

$$\text{Man : Doctor as Woman : Nurse}$$

Suppose wwe want to reduce gender bias. The first step is to understand in which dimension there is this bias. So we compute the average of the differences (in reality is SVD) of $(e_{male} - e_{female})$, $(e_{he} - e_{she})$, $(e_{man} - e_{woman})$, and so on. The next step is to get rid of the bias in every word that is not definitional (e.g. grandmother) by shifting all these words in the opposite direction of the bias. The last step is to equalize pairs, that is that the distance between "doctor" and "woman" should be the same of the distance between "doctor" and "man", this is achieved by further shifting the words in the embedding space.