# Language model

![](images/rnn_1.png)

# Machine translation

## Statistical machine translation

![](images/translation_1.png)

### Learn the translation model

- From parallel corpus/data (same text with multiple translations)
- Break down P(x|y) to P(x,a|y), with a is **alignment** (correspondence b/t French sentence x and English sentence y, at word level)

![](images/translation_2.png)

Alignment can be many-to-one (>1 words in E are represented by 1 word in F) or one-to-many (often called fertile word), for example 1 single F word **entarte** means: **hit someone with a pie**

Even many-to-many translation (phrase-level)

![](images/translation_3.png)

### How to compute argmax?

![](images/translation_4.png)

Example of a decoding approach:
https://youtu.be/XXtpJxZBa2c?t=830

### Cons of statistical machine translation

![](images/translation_5.png)

## Neural machine translation

### Definition

Do machine translation with a single neural network

The **architecture** is called: **sequence-to-sequence**

Other uses of sequence-to-sequence (beside NMT):
- Summarization (long text -> short text)
- Dialogue (previous utterances -> next utterance)
- Parsing (input text -> output parse as sentence)
- Code generation (natural language -> Python code)

![](images/translation_6.png)

- **This is test time behavior**. We don't know how to train this yet
- The final hidden state of encoder RNN (orange box) become the initial hidden state of decoder RNN
- We fed ENGLISH WORD EMBEDDINGS to encoder RNN, and FRENCH WORD EMBEDDINGS to decoder RNN. Aka we have 2 different sets of word embeddings

### A CONDITIONAL LANGUAGE MODEL

![](images/translation_7.png)

### How to train these 2 RNNs

- The encoder RNN is just a language model (predicting the next English word), but there are no loss yet.

- For decoder RNN: use the final hidden state of encoder RNN to initialize the hidden state, then train it as if it's a language model, on French words (with losses and all)

![](images/translation_8.png)

- **End-to-end backprop**: so BOTH the hidden state of decoder RNN and encoder RNN are updated during training. Even further, you can unfreeze the word embeddings of the 2 systems and finetune them as well.

- Though of course you can use pretraining language models for encoder RNN and/or decoder RNN

### Beam search for neural translation inference (test time)

![](images/translation_9.png)

Example: https://youtu.be/XXtpJxZBa2c?t=2132

A finished beam search with 2 hypotheses with same length (k=2)

![](images/translation_11.png)

Note that normally beam search should result in k hypotheses (from beam search), and each hypotheses ideally will have same size. That's why we don't  normalize the score for each of them (see more below)

### Stopping criterion for beam search + pick the best hypotheses

![](images/translation_10.png)

Since now we have specific stop criterions, thus considered hypotheses won't have same length anymore

![](images/translation_12.png)

## Pros and cons of neural MT over statistical MT

![](images/translation_13.png)

Disadvantages:
- less interpretable
    - hard to interpret the neurons and weights of the RNN
    - in contrast NMT has subcomponents, which is understandable since human design them
- hard to control
    - hard to put or reinforce hard-coded linguistic rule
    - hard to put safety guidelines (safety concern: controversial translation, swear words ...)

# Attention

A problem with seq2seq model when you only use the final hidden state vector of encoder RNN to feed in decoder RNN, aka the **information bottleneck**

![](images/translation_14.png)

## Definition

On each step of the decoder, use **direct connection to the encoder** to **focus on particular part** of **source sentence**

## Steps

![](images/translation_15.png)

Why 'dot product'? Since hidden state of decoder tries to predict next English word, it will contain all the information to do so, and **with a high dot product (high similarities based on cosine similarity formula)** of that hidden state to the hidden state of encoder (of some French word), **it will say: "oh this English word I am trying to predict might be highly associated with this French word"**
- note that dot product is **basic** to calculate the attention scores, which assumes 2 vectors have the same size (hidden states of encoder and decoder)
- there is also **multiplicative attention** way to calculate the scores
    - use a learnable weight matrix to learn the best way to calculate those scores
    ![](images/translation_20.png)
- also, **additive attention**, see more here: https://youtu.be/XXtpJxZBa2c?t=4517

...continue to attention steps...

![](images/translation_16.png)

![](images/translation_17.png)

## math equation

![](images/translation_18.png)

For the last step, **'proceed as in the non-attention seq2seq model'** = basically treat the [at;st] as the concat[embedding,old_h] => orange arrow (matmul weight + tanh + dropout) => blue arrow (matmul weight + softmax) to have vocab probability and do cross entropy loss on actual y

## Pros of using attention

- **Solving the bottleneck problem**: instead of counting on only last hidden state of encoder RNN, **with attention, the decoder RNN can look at all the hidden states of encoder and make decision on which to focus on** (based on attention distribution)

- **direct connection between encoder and decoder**, similar to shortcut connection (skip connection): as hidden state of decoder will connect to multiple hs of encoders => lots of connection => gradient flowing back easier since there is no bottleneck => **reduce vanishing gradient**

- **a sense of interpretability** by looking at the attention distribution for each word. Similar to hard alignment of SMT, but this time the neural net learns the alignment by itself. TODO: reproduce this

![](images/translation_19.png)

## General definition of attention

https://youtu.be/XXtpJxZBa2c?t=4273