# 3. Neural Machine Translation

## 3.1. Feed Forward Neural Networks

* **Recap on FFNN**

>$$y^{(k)} = \phi(U^{(k)}) = \phi(W^{(k)} X^{(k)} + b^{(k)})$$

>* $\phi$: element-wise non-linear activation $\phi_i(U^{(k)}) = \phi (U^{(k)}_i)$
>* Parameters: $\Theta = \{ W^{(k)}, b^{(k)} \}^K_{k=1}$

* **FFNN for Symbol Sequence Transduction**

>$$y^{(1)} = \phi \left( W^{(1)}_{d \times V_s} [s]_{V_s \times 1} + b^{(1)}_{d \times 1} \right) \in \mathbb{R}^d$$

>* **One-Hot Representations**
>  * Source Vocabulary: $\mathcal{V}_S = \{s_1,...,s_{V_S}\} \rightarrow [s_i] \in \mathbb{R}^{V_S}$  
>  * Target Vocabulary: $\mathcal{V}_T = \{s_1,...,s_{V_T}\} \rightarrow [f_i] \in \mathbb{R}^{V_T}$

* **FFNN for Sentence Translation Probabilities** (hidden layer: bottleneck)

><img src = 'images/image3_01.png' width=500>

>$$P(t^J_1 | s^I_1, A) = \prod_{(i,j) \in A} P(t_j|s_i) = \prod_{(i,j)\in A} y_j (s_i)$$

>* **Including Source Context**

>$$P(t^J_1 | s^I_1, A) = \prod_{(i,j) \in A} P(t_j|s_{i-1},s_i,s_{i+1}) \equiv \prod_{(i,j)\in A} y_j (c_i)$$

>* **Including Target Context**

>$$P(t^J_1 | s^I_1, A) = \prod^J_{j=1} \prod_{i:(i,j) \in A} P(t_j|t_{j-1},s_{i-1},s_i,s_{i+1}) \equiv \prod^J_{j=1} \prod_{i:(i,j) \in A} y_j (c_i,j)$$

* **Training FFNN**

>* Word aligned parallel text $\rightarrow$ training instances, $\{(t^p,c^p)\}^n_{p=1}$

>$$E(\Theta) = - \sum^n_{p=1} \log p(t^p|c^p)$$

>* Since $[t]$ is one-hot-encoded, $U^{(K)}_j = [t]' U^{(K)} = e^{-[t]'W^{(K)}X^{(K)}}$

>$$y_j (c) = \frac{e^{-U_j^{(K)}}}{\sum_{j'} e^{-U_{j'}^{(K)}}} = \frac{e^{-[t]'W^{(K)}X^{(K)}}}{z(c;\Theta)} = p(t|c)$$

>* **Self-normalized softmax:** set $z(c;\Theta) \approx 1.0 \; \forall c$ or minimize the variance of $z$ (eliminating the need for softmax operation in evaluation, which is computationally expensive)

>$$E(\Theta) = \sum^n_{p=1} [t^p]' W^{(K)} X^{(K)} + \alpha [\log z(c^p;\Theta)]^2$$

* **FFNN LM** (e.g. trigram)

>$$P(t^J_1) = \prod^J_{j=1} p(t_j|t_{j-1},t_{j-2})$$

>* **Implement using FFNN** (weight matrix shared for different time-frames)

><img src = 'images/image3_02.png' width=500>

>* **Implement using WFSA**

>$$(t_{j-2},t_{j-1}) \overset{t_j / p(t_j|t_{j-1},t_{j-2})}{\longrightarrow} (t_{j-1},t_{j}) $$

## 3.2. RNN LM

* **RNNLM**

><img src = 'images/image3_03.png' width=300>

>* **Unrolled**

><img src = 'images/image3_04.png' width=300>

>\begin{align}
P(t_j|t_{<j}) &= \text{softmax} (Rh_{j-1} + c) \in [0,1]^{V_T} \\
h_j &= \tanh (W[t_j] + Uh_{j-1} + b) \in \mathbb{R}^d
\end{align}

>* **Bidirectinal RNN** - concatenation of forward & reverse states

>$$h(s^{T_s}_1) = \left( \begin{matrix} \overrightarrow{h_{T_s}} \\ \overleftarrow{h_1} \end{matrix} \right)$$

* **Stacked RNN**

><img src = 'images/image3_05.png' width=300>

>\begin{align}
P(t_j|t_{<j}) &= \text{softmax} (Rh^L_{j-1} + c) \\
h^l_j &= \tanh (W^l h^{l-1}_j + U^l h^l_{j-1} + b^l) \\
h^0_j &= [t_j]
\end{align}

>* $[h^l_j]$: representation of $t^j_1$
>* Single vector $h_J \in \mathbb{R}^{d\times L}$: representation of $t^J_1$

* **GRU - Gated Recurrent Units**

>* **Reset Gate**

><img src = 'images/image3_06.png' width=250>

>\begin{align}
r_j &= \sigma(W_r x_j + U_r h_{j-1} + b_r) \\
h_j &= \tanh (W x_j + U_r (r_j \odot h_{j-1}) + b_r)
\end{align}

>* **Update Gate**

><img src = 'images/image3_07.png' width=300>

>\begin{align}
u_j &= \sigma(W x_j + U_u h_{j-1} + b_u) \\
h_j &= u_j \odot \tilde{h}_{j-1} + (1-u_j) \odot h_{j-1} \\
\tilde{h}_j &= \tanh (W x_j + U(r_j \odot h_{j-1}) + b)
\end{align}

* **LSTMs - Long Short-Term Memory units**

><img src = 'images/image3_08.png' width=250>

>\begin{align}
h^l_j &= \mathcal{F}(h^l_{j-1}, h^{l-1}_j) \\
c^l_j &= f \odot c^l_{j-1} + i \odot g \\
h^l_j &= \tanh(c^l_j) \odot o
\end{align}

>* Sigmoids: i,o,g,f
>* $g\approx 1, f\approx 0, o\approx 1 \rightarrow$ RNN unit
>* **LSTM with Dropout:** $g=\sigma (WD(h^{l-1}_j) + uh^l_{j-1} + b)$
>  * Applied to non-recurrent connections