# Deeplearning.ai - Sequence Models

## Notation

For a given sequence, use $x^{<t>}$ to index into position $t$ in the sequence.

$T_x$: **length** of the input sequence $x$. 

$x^{(i)<t>}$: training example $i$, position $t$.

$T^{(i)}_x$: **length** of the $i$th training example equence $x$.

## Word Representations

Build a **Volcabulary** of all possible words.

### One-Hot Encoding

For sequence $x$, $x^{<t>}$ is an one-hot vector, with the vector length of the vocab.

## RNN

FC network doesn't work well, problems:

* inputs, outputs can be of different lengths.
* doesn't share features learned across different positions of text
* can have lots of params, e.g. if vocab is 50k or 100k.

Typically for RNNs, the initial state $a^{<0>}$ is initialized with zeros. Parameters are shared for all time steps in the same example.


$$ a^{<t>} = g_t \bigg( W_{aa} a^{<t-1>} + W_{ax} x^{<t>} + b_a \bigg) $$

**Notation**: $W_{ax}$ means weight $W$ for computing $a$ by mutiplying $x$.

Alternatively, 

$$
\begin{aligned}
a^{<t>} &= g_t \bigg( W_{a} [a^{<t-1>}, x^{<t>}] + b_a \bigg) \\
\hat{y}^{<t>} &= g \bigg( W_{ya} a^{<t>} + b_y \bigg)
\end{aligned}
$$

Where:

```
# concat horizontally
W_a = np.concatenate([W_aa, W_ax], axis=0) 

c, d = W_aa.shape
c, e = W_ax.shape

assert(W_a.shape == (c, d + e))

# for a and x, concat vertically
ax = np.concatenate([a, x], axis=1)
```

### RNN Backward pass

See course assignment 1 for maths equations for the backpass. Here I write down the code for an RNN cell backpass which I find more useful in understanding the shapes of matrices. 

For detailed backpass code for a full RNN, see assignment. Basically, gradients is **accumulated** from all timesteps in reverse order. Another thing to watch out for is that in a multi-step RNN, upstream gradients can come from 2 places (`da_next` and `da_prev`), therefore the upstream gradient is the **sum of these two sources**.

```
# Wax.shape == (n_a, n_x), xt.shape == (n_x, m)
# Waa.shape == (n_a, n_a), a_prev.shape == (n_a, m), b.shape = (n_a, 1)

# foward
z = np.dot(Wax, xt) + np.dot(Waa, a_prev) + b  # z.shape == (n_a, m)
a_next = np.tanh(z)

# backward
dz = (1 - a_next**2) * da_next  # da_next is gradient from upstream
dWax = np.dot(dz, xt.T)  # dWax.shape == (n_a, n_x)
dWaa = np.dot(dz, a_prev.T) # dWaa.shape == (n_a, m) * (m, n_a) == (n_a, n_a)
db = np.sum(dz, axis=1, keepdims=True) # db.shape == (n_a, 1)
dxt = np.dot(Wax.T, dz) # dxt.shape == (n_x, n_a) * (n_a, m) == (n_x, m)
da_prev = np.dot(Waa.T, dz) # da_prev.shape == (n_a, m)
```

## BPTT

Example, binary crossentropy loss, the loss is the sum of individual losses.

$$
\begin{aligned}
\mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>}) &= -y^{<t>} \log \hat{y}^{<t>} - (1 - y^{<t>})\log (1-\hat{y}^{<t>}) \\
&= \sum^{T_y}_{t=1} \mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>})
\end{aligned}
$$

## Examples of RNN Architectures

Inspired by Andrea Karpathy's blog post The Unresonable Effectiveness of RNN.

Many-to-Many, e.g. `T_x == T_y`, many-to-one, one-to-many (music generation).

Many-to-Many but `T_x != T_y`, e.g. machine translation.

## Language Model 

A language model returns/estimates `Prob(sentence)`.

Training over large corpus of english text. `<EOS>` token appened to every sentence. `<UNK>` represents unknown words, ie. not in vocab.

RNN to predict next word in sentence, loss function: softmax loss, $\mathcal{L}(\hat{y}^{<t>}, y^{<t>}) = -\sum_i y^{<t>}_i \log\hat{y}^{<t>}_i$. 

## Sample from a Trained RNN

Use `np.random.coice()` with the probability output from $\hat{y}^{<t>}$ to sample from possible outputs (e.g. vocab) and feed into the next time step.

Character level RNN is more computationally expensive, but it can model rare words not in the vocabulary easily.

## Vanishing Gradients

Simple RNNs do not handle long term dependencies very well. Vanishing gradient more common for RNNs. Exploding gradients can be handled by clipping gradients.

## GRU - Gated Recurrent Unit

**Notation**: memory cell $C$, $C^{<t>} = a^{<t>}$. 

### Simplified GRU

$u$ subscript below indicates **update** gate.

$$
\begin{aligned}
\tilde{C}^{<t>} &= \tanh \big( W_c [ C^{<t-1>}, x^{<t>}] + b_c \big) \\
\Gamma_u &= \sigma \big( W_u [C^{<t-1>}, x^{<t>}] + b_u \big) \\
C^{<t>} &= \Gamma_u \odot \tilde{C}^{<t>} + (1 - \Gamma_u) \odot C^{<t-1>} 
\end{aligned}
$$

$C^{<t>}$ can be high dimensional vectors, in which case $\odot$ above is the **Hadamard** operator for element-wise multiplication. 

### Full GRU
$$
\begin{aligned}
\tilde{C}^{<t>} &= \tanh \big( W_c [ \Gamma_r \odot C^{<t-1>}, x^{<t>}] + b_c \big) \\
\Gamma_u &= \sigma \big( W_u [C^{<t-1>}, x^{<t>}] + b_u \big) \\
\Gamma_r &= \sigma \big( W_r [C^{<t-1>}, x^{<t>}] + b_r \big) \\
C^{<t>} &= \Gamma_u \odot \tilde{C}^{<t>} + (1 - \Gamma_u) \odot C^{<t-1>} 
\end{aligned}
$$

$\Gamma_r$ is based on lots of research on the different variations of GRUs, it's shown to handle long term dependency better with $\Gamma_r$. 

Based on Andrew's experience, GRUs is computationally easier, therefore easier to build a bigger network with GRUs than LSTMs. 

## LSTM

LSTM is more powerful than GRU but computationally more expensive to use.

Here $C^{<t>} != a^{<t>}$. LSTM has a **forget gate** denoted by subscript $f$, and **output gate** denoted by $o$.

Comparing to `cs231n`, which uses `ifog` for all the gates:

* `i`-gate is $\Gamma_u$ here, 
* `g`-gate is $\tilde{C}^{<t>}$. 

$$
\begin{aligned}
\tilde{C}^{<t>} &= \tanh \big( W_c [ a^{<t-1>}, x^{<t>}] + b_c \big) \\
\Gamma_u &= \sigma \big( W_u [a^{<t-1>}, x^{<t>}] + b_u \big) \\
\Gamma_f &= \sigma \big( W_f [a^{<t-1>}, x^{<t>}] + b_f \big) \\
\Gamma_o &= \sigma \big( W_o [a^{<t-1>}, x^{<t>}] + b_o \big) \\
C^{<t>} &= \Gamma_u \odot \tilde{C}^{<t>} + \Gamma_f \odot C^{<t-1>} \\
a^{<t>} &= \Gamma_o \odot \tanh(C^{<t>})
\end{aligned}
$$


### Peephole Connection
Insert $C^{<t-1>}$ in **all** gate updates. E.g. for forget gate:

$$ \Gamma_o = \sigma \big( W_o [C^{<t-1>}, a^{<t-1>}, x^{<t>}] + b_o \big) $$

### LSTM Backward Pass

Again, code the LSTM cell here, for full LSTM RNN, see assignment 1.

```
# ot, ft, it are gamma_o, gamma_f, gamma_u, shapes are all (n_a, m)
# cct is c_tilde
# da_next is upstream gradient
# a_prev.shape == (n_a, m) == c_prev.shape
# xt.shape == (n_x, m)

tc = np.tanh(c_next)
z = 1 - tc**2

dot = da_next * a_next * (1 - ot)  # dot.shape == (n_a, m)
dcct = dc_next * it + ot * z * it * da_next * cct * (1 - np.tanh(cct)**2)
dit = dc_next * cct + ot * z * cct * da_next * it * (1 - it)
dft = dc_next * c_prev + ot * z * c_prev * da_next * ft * (1 - ft)

dit
```

## Bidirectional RNN

Useful for language translation where order of timesteps is less useful or inadquate to solve the problem at hand.

BRNN is a **Acyclic graph**, it trains two RNNs and combines their outputs together. Each timestep output prediction is therefore: 

$$ \hat{y}^{<t>} = g \big( W_y [\overrightarrow{a}^{<t>}, \overleftarrow{a}^{<t>}] + b_y \big) $$

<img src='./pics/brnn.png' width="600">

## Deep RNN

Formula for computing the cells:

$$ a^{[2]<3>} = g \big( W^{[2]}_a [a^{[2]<2>}, a^{[1]<3>}] + b^{[3]}_a \big) $$

<img src='./pics/deep_rnn.png' width='600'>

## Word Embedding

One hot vectors are binary, sparse, high dimensional. Word embeddings are low-dimensional floating-point vectors.

### Properties

**Analogies**

[2013 Mikolov et. al. Linguistic regularities in continuous space word representations]

Question: Man -> Woman as King -> ?

Find vectors so that $e_{man} - e_{woman}$ is close to $e_{king} - e_{queen}$. Alternatively, find a word $w$, such that:

$$\underset{x}{argmax} \text{ Similarity}(e_w, e_{king} - e_{man} + e_{woman}) $$

Commonly used similarity function: cosine, squared difference, etc.

**Cosine Similarity** (angle between vector $u$ and $v$):

$$sim(u, v) = \frac{u^T v}{\|u\|_2 \|v\|_2}$$

### Embedding Matrix

Vocab size 10k, embedding dimension 300, results in embedding matrix $E$ shape is (300, 1000).

Given a one-hot $o_j$ vector,  $E \cdot o_j = e_j$, gives the embedding vector for $j$.



### Word2Vec

#### Skip Gram Model:

**Skip gram**: context around a word you are trying to predict, e.g. given 4 words before and after, predict the word in the middle.

From embedding matrix to embedding vector: $\hat{y} = \text{Softmax}(E \cdot o_j)$, target levels $y$ are also one-hot vectors. 

Assume **vocab size is 10k**. $\theta$ is parameters in the softmax layer. $\theta$ and $e_c$ have vectors of the same dimension. 

$$\hat{y} = p(t \mid c) = \frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10000} e^{\theta_j^T e_c}}$$

Loss function $\mathcal{L}(\hat{y}, y) = -\sum_{i=1}^{\text{vocab_size}} y_i \log \hat{y}_i$

Problems: slow. When vocab is large, computing $\hat{y}$ is slow. Solution: use hierarchical softmax. 

### Negative Sampling

Mikolov et, al. 2013

Context: $c$, word: $t$, target: $y$.

From a data point with $y=1$, generate $K$ negative examples with $y = 0$. 

$P(y = 1 \mid c, t) = \sigma( \theta_t^T e_c)$

Turning 10k-way softmax problem into 10k binary classification problem. 

How to choose the negative examples?

Sample with empirical frequency of words $f(w_i)$ in corpus. Not so great. 

Suggestion: $P(w_i) = \frac{f(w_i)^{3/4}}{\sum_{j=1}^{10000}f(w_j)^{3/4}}$

### GloVe Word Vectors

$X_{ij} =$ number of times $i$ appears in the context of $j$. 

$$\text{Minimize} \sum_{i=1}^{10000}\sum_{j=1}^{10000} f(X_{ij}) \big(\theta_i^T e_j + b_i + b_j - \log X_ij\big)^2$$

where $f(X_{ij}) = 0$ when $X_{ij} = 0$.

$\theta_i$ and $e_j$ are symmetric. For a n-dimensional GloVe embedding, each vector is of shape (n,).

## Sentiment Classification

Method 1: 

Average all word embedding vectors, apply softmax classification. Pros: can handle variable lengh inputs. Cons: ignores word order.

### RNN

Feed embedding vectors to a many-to-one RNN. 

## Debiasing Word Embeddings

Bolukbasi et. al., 2016

1. Identify bias direction
2. Neutralize: for every word that is not definitional, project to get rid of bias.
3. Equalizae pairs. 


<img src='pics/debiasing_embedding.png', width='800'>

## Seq2Seq Architectures

Sutskever et al., 2014. Cho et al., 2014. 

**Seq2Seq: encoding network + decoding network**. 

**Image Captioning**: CNN to extract image features, which is then fed to an RNN to learn the captions.

Machine translation as building a **conditional language model**. $P(y^{<1>}, \cdots, y^{<T_y>} \mid x)$, where $y^{<1>}, \cdots, y^{<T_y>}$ are the English targets, and $x$ are French inputs. 

Objective:

$$\underset{y^{<1>}, \cdots, y^{<T_y>}}{argmax} P(y^{<1>}, \cdots, y^{<T_y>} \mid x) $$

Greedy search doesn't work here, i.e. cannot try to maximize $P(y^{<t>})$ greedily.

## Beam Search

**Beam Width**, $B = 3$, keeps tract of the first $B$ likely targets / words for each word. At each step, only $B$ combinations are stored. 

Compute:

$$ p(y^{<1>}, y^{<2>} \mid x) = p(y^{<1>} \mid x) p(y^{<2>} \mid x, y^{<1>}) $$

$p(y^{<1>} \mid x)$ is stored based on the beam width setting above. $y^{<1>}$ is hard-wired as inputs to the network for computing $y^{<2>}$. 

$B = 1$ equates greedy search.

### Length Normalization 

Objective function: 

$$ \underset{y}{argmax} \sum_{t=1}^{T_y} \log P(y^{<t>} \mid x, y^{<1>}, \cdots, y^{<T_y>}) $$

This objective results in preference for **shorter** sentences, (less less than 1 probabilities). Trick is to use the below:

$$ \underset{y}{argmax} \frac{1}{T_y^{\alpha}} \sum_{t=1}^{T_y} \log P(y^{<t>} \mid x, y^{<1>}, \cdots, y^{<T_y>}) $$

Where $\alpha \in [0, 1]$. 

### Beam Width

**Large beam width**, $B$: better results, more memory usage, slower.

**Small** $B$: worse results, faster. 

Production systems usually see $B = 10$. In research you'd see large B such as 3000.

Unlike exact search algos such as breath first search and depth first search, Beam Search runs faster but is **not** guaranteed to find exact maximum for this objective function.

### Error Analysis for Beam Search

Pick an incorrectly predicted example, $y^*$ is **ground truth**, $\hat{y}$ is **predicted** result. 

Feed input through RNN to compute $P(y^* \mid x)$ and $P(\hat{y} \mid x)$. 

If $P(y^* \mid x) > P(\hat{y} \mid x)$: Beam search is at fault, it did not find the desired result.

If $P(y^* \mid x) \leq P(\hat{y} \mid x)$: RNN model is at fault.

Perform this analysis for incorrectly predicted examples, then compute the fraction of errors due to beam search vs RNN model. **Only if a large number is due to beam search, would you consider increasin beam width.**

<img src='pics/beam_error_analysis.png', width='800'>

## Bleu Score

A single row number evaluation metric. 

Measures how good a machine generated translation is. **BLEU = Bi-Lingual Evaluation Understudy**.

Papineni et. al., 2002, Bleu: A method for automatic evaluation of machine translation.

**Modified Precision**: measures the frequency of a ground truth word's occurance in the predicted result, ie. count / nunique($\hat{y}$). 

**Bigrams** does the above with two words, count / total number of bigrams.

For **n-grams**:

$$ P_n = \frac{\sum_{n-grams \in \hat{y}} Count_{clip} (n-gram)}{\sum_{n-gram \in \hat{y}} Count(n-gram)} $$

Compute $n \in \{1, 2, 3, 4\}$, the combine Bleu score:

$$ BP \times \exp \bigg(\frac{1}{4} \sum_{n=1}^{4} P_n \bigg)$$

$BP$ is a brevity penlty factor, as shorter sentences are more likely to have higher precision. 

```
# MT = machine translation, y_hat
if MT_output_length > reference_output_length:
    BP = 1
else:
    BP = np.exp(1 - MT_output_length / reference_output_length)
```

## Attention Model

Bahdanau et. al. 2014. Neural machine trnslation by jointly learning to align and translate.

Problem with Long sequences: **Bleu score declines with the length of the sentence for machine translation systems**.

The normal translation is done by a Bidirectionarl RNN. 

Attention model computes attention weights for words in the input with a uni-directional RNN. 

**Context** $C$ defined as weighted sum of features from the Bidirectional RNN. 

$t$ is the timestep in the **attention** RNN, $t'$ is the tiemstep in the usual **translation** RNN.

$\alpha^{<t, t'>}$ is amount of attention $y^{<t>}$ should pay to $a^{<t'>}$. 

$f()$ is a small single layer network. 

This algo runs in **quadratic time**... 

$$
\begin{aligned}
\forall \alpha^{<i, t'>} &\geq 0 \\
\sum_{t'} \alpha^{<i, t'>} &= 1, \forall i \in T_s\\
C^{<i>} &= \sum_{t'} \alpha^{<i, t'>} a^{<t'>}  \\
\alpha^{<t, t'>} &= \frac{\exp\big( e^{<t, t'>} \big)}{\sum_{t'=1}^{T_x} \exp\big( e^{<t, t'>}\big)} \\
e^{<t, t'>} &= f(s^{<t-1>}, a^{<t'>}) 
\end{aligned}
$$

<img src='pics/attention.png' width='800'>

## Speech Recognition

[Spectrogram](https://en.wikipedia.org/wiki/Spectrogram): y-axis is time, x-axis is frequency. 


### CTC Cost for speech recognition 

Alex Graves et al. 

CTC = Connectionist temporal classification

Basic rule: collapse repeated characters not separated by "black", e.g. ttt_h_eee_______blank_____qqq

## Trigger Word Detection

Label as 1 when the trigger word is said. **Problem**: training data is imbalanced. **Hack**: repeat 1 label multiple times. 

<img src='pics/trigger_word.png' width='800'>