# 4. Statistical Models for Word Alignment in Translation

## 4.1. Word Alignment Models

* **Word Alignment**

>* **Alignment link:** $(i,j) \Leftrightarrow e_i \leftrightarrow f_j\;\;\;$ ($e_0$: NULL)
>* **Alignment process:** $a_j = i \Leftrightarrow e_{a_j} \leftrightarrow f_j$
>* **NULL:** used when simple word-to-word translation is inadequate

* **Statistical Model**

>$$P(f^J_1, a^J_1, J|e^I_0) = \underset{\text{(word translation distribution)}}{\prod^J_{j=1} p_T (\;f_j|e_{a_j})} \times \underset{\text{(alignment distribution)}}{P_A (a^J_1 | J,I)} \times \underset{\text{(sentence length distribution)}}{p_L(J|I)}$$

>* $p_L(J|I)$: simple but informative / tuned to language pairs and translation domains
>* $p_T(f|e)$: huge table of probabilities / back-off prob. distributed equally over remaining words
>* $P_A(a^J_1|J,I)$: completely unlexicalised / many models exist
>* **Automatic Alignments:** $\hat{a}_1^J = \text{argmax}_{a^J_1} P(f^J_1,a^J_1,J|e^I_1)$
>* **Translation Probability:** $P(f^J_1|e^I_1) = \sum_{a^J_1} P(f^J_1,a^J_1,J|e^I_1)$
>  * **Viterbi procedure** for Model-1 & Model-2
>  * **Forward-backward algorithm** for HMM

* **IBM Model-1** (flat)

>$$p_{M2}(a_j|j,J,I) = \frac{1}{I} \;\;\;\rightarrow\;\;\; P_A(a^J_1|J,I) = \frac{1}{I^J}$$

* **IBM Model-2** (alignment links are independent)

>$$p_{M2}(a_j|j,J,I) \text{: table of dimension } I\times J\times I \;\;\;\rightarrow\;\;\; P_A(a^J_1|J,I) = \prod^J_{j=1} p_{M2}(a_j|j,J,I)$$

* **HMM Model** (1st order Markov Process)

>$$P_A (a^J_1|J,I) = \prod^J_{j=1} P(a_j|a^{j-1}_1, J, I) \approx \prod^J_{j=1} p_{HMM} (a_j|a_{j-1},I)$$

>* $p_{HMM} (\cdot)$: Markov Position Alignment Distribution

><img src = 'images/image4_01.png' width=400>

* **IBM Model-4** (implemented in GIZA++ toolkit)

>* **Generation Steps**
>  * **Step 1:** Create a tablet for each source word (allow **Null translation**)
>  * **Step 2:** **Fertility** (no. of target words each source word can generate)
>  * **Step 3:** Fill in tablet positions with target words, sampled from **translation table**
>  * **Step 4:** **Distortion model** (how target words are distributed throughout the sentence)

>* **Strength**
>  * Fertility & distortion $\rightarrow$ capture **phrases** in translation
>* **Weakness**
>  * Deficiency: the distortion model may assign prob. to non-sentences (prob. over alignments and generated sentences do not sum to one)
>  * Parameter estimation & word alignment $\rightarrow$ difficult to implement (DP-based algorithms not available)

## 4.2. Parameter Estimation for Alignment Models

* **Translation Sentence Pairs**

>$$\{F^{(r)},E^{(r)}\}^R_{r=1}$$

* **Sentence Length Distribution**

>$$P_L(J|I) = \frac{\sum^R_{r=1} \mathbb{1}(J=J^{(r)}, I=I^{(r)})}{\sum^R_{r=1} \mathbb{1}(I=I^{(r)})}$$

* **Word Translation Distribution** (require word alignment)

>$$\#_T(f\leftrightarrow e) = \sum^R_{r=1} \sum^{J^{(r)}}_{j=1} \sum^{I^{(r)}}_{i=1}
\mathbb{1}(e=e^{(r)}_i \mathbb{1}(f=f^{(r)}_j) \mathbb{1}(a^{(r)}_j = i)$$

>$$P_T(f|e) = \frac{\#_T(f\leftrightarrow e)}{\sum_{f'} \#_T(f'\leftrightarrow e)}$$

* **Alignment Distribution (Model-2)**

>$$\#_{M2}(i,j,J,I) = \sum^R_{r=1} \mathbb{1}(J=J^{(r)},I=I^{(r)}) \sum^J_{j=1} \mathbb{1}(i=a^{(r)}_j)$$

>$$p_{M2}(i|j,J,I) = \frac{\#_{M2}(i,j,J,I)}{\sum^I_{i'=1} \#_{M2}(i',j,J,I)}$$

>$$P_A(a^J_1|J,I) = \prod^J_{j=1} p_{M2}(a_j|j,J,I)$$

* **Alignment Distribution (HMM)**

>$$\#_{HMM} (i,i',I) = \sum^R_{r=1} \mathbb{1}(I=I^{(r)}) \sum^{J^{(r)}}_{j=1} \mathbb{1}(i=a^{(r)}_j, i'=a^{(r)}_{j-1}) $$

>$$p_{HMM}(i|i',I) = \frac{\#_{HMM}(i,i',I)}{\sum^I_{i''=1} \#_{HMM}(i'',i',I)}$$

>$$P_A (a^J_1|J,I) = \prod^J_{j=1} P(a_j|a^{j-1}_1, J, I) \approx \prod^J_{j=1} p_{HMM} (a_j|a_{j-1},I)$$


## 4.3. Automatic Word Alignment - Viterbi Training

* **Viterbi Alignment**

>$$\hat{a}^J_1 = \underset{a^J_1}{\text{argmax}} P(f^J_1, a^J_1 | e^I_0)$$

* **Flat Start Training Procedure**

>* **1. Model-1**

>>* **1.1. Model-1 Initialization** - set $p_T(f|e)$ to be uniform
>>* **1.2. Model-1 Viterbi alignment** (to generate word-aligned parallel text)
>>* **Repeat** (until some stopping criteria is met)
>>  * **1.3. Update $p_T(f|e)$** (from the word-aligned parallel text)
>>  * **1.4. Viterbi alignment**
>>* **1.5. Output:** Model-1 word-aligned parallel text

>* **2. Model-2**

>>* **2.1. Model-2 Initialization** - find $p_{M2}(i|j,J,I)$ and $p_T(f|e)$ from **1.5.**
>>* **2.2. Model-2 Viterbi alignment**
>>* **Repeat**
>>  * **2.3. Update $p_{M2}(i|j,J,I)$ and $p_T(f|e)$** 
>>  * **2.4. Viterbi alignment**
>>* **2.5. Output:** Model-2 word-aligned parallel text

>* **3. HMM**

>>* **3.1. HMM Initialization** - find $p_{HMM}(i|i',I,J)$ and $p_T(f|e)$ from **2.5.**
>>* **3.2. HMM Viterbi alignment**
>>* **Repeat**
>>  * **3.3. Update $p_{HMM}(i|i',I,J)$ and $p_T(f|e)$** 
>>  * **3.4. Viterbi alignment**
>>* **3.5. Output:** HMM parameters & word-aligned parallel text




## 4.4. Iterative Alignment Model Parameter Estimation by EM

* Replace hard counts with **link posteriors**

>$$\#(f\leftrightarrow e) = \sum^R_{r=1} \sum^{J^{(r)}}_{j=1} \sum^{I^{(r)}}_{i=1}
\mathbb{1}(e=e^{(r)}_i \mathbb{1}(f=f^{(r)}_j) \underset{\text{(link posterior)}}{P(a^{(r)}_j = i | E^{(r)},F^{(r)})}$$

>$$P(f|e) = \frac{\#(f\leftrightarrow e)}{\sum_{f'} \#(f'\leftrightarrow e)}$$

* **Model-2**

>\begin{align}
\text{sentence-level posterior:}\;\;\; P(f^J_1|e^J_1) &= \prod^J_{j=1} \sum^I_{i=0} p_{M2}(i|j,I,J) p_T(f_j|e_i) \\
\text{link posterior:}\;\;\; P(a_j=i|f^J_1,e^I_1) &= \frac{p_{M2}(i|j,I,J)p_T(f_j|e_i)}{\sum^I_{i'=0} p_{M2}(i'|j,I,J)p_T(f_j|e_{i'})} \equiv A_j(a_j) \\
P(a^J_1|f^J_1,e^I_1) &= \prod^J_{j=1} A_j(a_j)
\end{align}

* **Model-1** (special case of Model-2)

>\begin{align}
P(f^J_1,a^J_1|e^I_1) &= \frac{1}{(I+1)^J} \prod^J_{j=1} p_T(f_j|e_i) \\
P(a_j=i|e^I_1,f^J_1) &= \frac{p_T(f_j|e_i)}{\sum^I_{i'=0} p_T(f_j|e_i')}
\end{align}

* **HMM**

>$$P(f^J_1,a^J_1|e^I_1,J) = \prod^J_{j=1} p_T(f_j|e_{a_j}) p_{HMM}(a_j|a_{j-1},I)$$

>* **Forward Probabilities**

>\begin{align}
\alpha_j(i) &= P(a_j=i,f^j_1|e^I_1) \\
&= \sum_{i'} p_T(f_j|e_i) p_{HMM}(a_j=i|a_{j-1}=i') \alpha_{j-1}(i')
\end{align}

>* **Backward Probabilities**

>\begin{align}
\beta_j(i) &= P(f^J_{j+1}|a_j=i,e^I_1) \\
&= \sum_{i'} \beta_{j+1}(i') p_T(f_{j+1}|e_{i'}) p_{HMM}(a_{j+1}=i'|a_j=i)
\end{align}

>* **Link Posteriors**

>$$P(a_j=i,f^J_1|e^I_1) = P(a_j=i,f^j_1|e^I_1) P(f^J_{j+1}|a_j=i,e^I_1) = \alpha_j(i) \beta_j(i)$$

>\begin{align}
P(a_j=i,a_{j-1}=i'|f^J_1,e^I_1) &= \frac{P(a_j=i,a_{j-1}=i',f^J_1,e^I_1)}{P(f^J_1|e^I_1)} \\
&= \frac{\alpha_{j-1}(i') p_{HMM}(i|i') p_T(f_j|e_i) \beta_j(i) }{ \sum_i \alpha_J(i)}
\end{align}

* **Model 4**

>* No Viterbi algorithm / No EM algorithm
>* There are computationally tractable HMMs that capture Model-4 features of fertility and word-to-phrase translation

* **Parallel EM - Hadoop/MapReduce**

>* EM algorithm: easily parallelizable
>* **Mappers:** compute counts (**E-steps**)
>* **Reducers:** merge counts & produce probabilities (**M-steps**)
>* Uses many CPU $\rightarrow$ better than using 1 GPU


## 4.5. Document and Sentence Alignment

* **Segment Alignment** specified by a path on **Alignment Grid**

><img src = 'images/image4_02.png' width=300>

>* **Score 1:** Prob. distribution defined over alignment paths
>  * Paths near the diagonal with shorter segments are preferred
>* **Score 2:** Translation prob. defined over aligned segment pairs
>  * Employs simple model such as IBM Model-1
>* **Combine two scores** to assign a prob. to a particular segmentation and alignment
>* Assume **monotonic alignment** $\rightarrow$ reduce search space

* **Binary Divisive Clustering**

>* From coarse to fine / allow reordering
>* At each iteration, choose the single most likely splitting point