# 2. Statistical Machine Translation

## 2.1. What is SMT?

* **Translation:** search for the best hypothesis under a statistical model

>$$\hat{t} = \underset{t \in \mathcal{T}}{\text{argmax}} \; P(t|s) = \underset{t \in \mathcal{T}}{\text{argmax}} \; P(s|t) P(t)$$

>* $s$: input source sentence / $t$: target word sequence
>* $P(t|s)$: **Direct Translation (NMT)**
>* $P(s|t)$: **Translation Model** / estimated from parallel texts
>* $P(t)$: **Language Model** / estimated from monolingual texts

* **SMT and ASR**

><img src = 'images/image2_01.png' width=400>

>* Both can be formulated using a **Source-Channel model**
>* Both are found as a **MAP estimates**

## 2.2. Alignment

* **Parallel Texts**

>* Aligned elements in parallel texts are translations of each other

* **Alignment**

>* Multi-level, hierarchical process (words $\in$ sentences $\in$ paragraphs $\in$ documents $\in$ collections)
>* Can be obtained manually / semiautomatically / automatically

>$$\text{Alignment variable: } a_j = i \;\;\;\Leftrightarrow\;\;\; e_i \leftrightarrow f_j$$

>$$\text{Alignment model: } P(A|s,f) \;\;\;\text{where}\;\;\; A = \{ (i,j) : 1 \leq i \leq I, 1 \leq j \leq J\}$$

>* **Sentence alignment:** markers indicate start & end of translation segments (can be sub-sentence units)

## 2.3. Quality of MT

* **Human Metrics** (e.g. binary, scaled, preference scores) $\rightarrow$ slow and costly
* **Humans in the Loop**

><img src = 'images/image2_02.png' width=300>

>* Adapt systems to human feedbacks (e.g. preferences, suggestions)
>* Typically, only 0.5% of the feedbacks are useful, but they are extremely useful
>* Agreement between experts $\approx$ Weighted agreement between 5 non-experts

* **BLEU:** MT metric based on n-gram precision

>* **NOTE:** BLEU is not an absolute measure. It only indicates **relative quality**

>$$BLEU(T,R) = \gamma(T,R) \exp \left( \sum^N_{n=1} \frac{1}{N} \log p_n (T,R) \right)$$

>* **Geometrical mean** of the **n-gram precisions** (typically $N=4$)

>$$\log p_n (T,R) = \frac{\sum_i \bar{c}^i_n}{\sum_i c^i_n}$$

>* $c^i_n$: no. of hypothesized n-grams
>* $\bar{c}^i_n$: no. of correct n-grams, contribution of each distinct n-gram is clipped to the maximum no. of occurences in any one reference

>* **Brevity penalty** 

>$$r = \sum_i \min \big\{ \big|R^i_{(1)}\big|,\big|R^i_{(2)}\big|,\big|R^i_{(3)}\big|,\big|R^i_{(4)}\big| \big\} 
\;\;\;,\;\;\; c = \sum_i |T^i|$$

>$$\gamma(T,R) = \Bigg\{ \begin{matrix}1 && c>r \\ \exp(1-\frac{r}{c}) && c\leq r \end{matrix}$$

* Other **Automatic Metrics - Lexical Similarity** 

>* **WER**(Word Error Rate): 

>$$WER(T,R) = \frac{Ins + Del + Sub}{N}$$

>* **TER**(Translation Edit Rate): 

>$$TER(T,R) = \frac{Ins + Del + Sub + Shft}{N}$$

>* **METEOR** (harmonic mean of unigram precision and recall): penalty $p \propto$ no. of alignment chuncks between hyp and refs / accepts synonyms and considers stemming

>$$\frac{10PR}{R+9P} \times (1-p)$$

>* **NIST** (variant of BLEU): assigns different value to each matching n-gram according to information gain statistics / less sensitive to brevity penalty (gaussian length distribution near average ref. length)

>* **GTM**(Geometric Translation Mean), **WNM**(Weighted N-gram Model), **CER**(Classification Error Rate), **ROUGE** & **ORANGE** (summary evaluation),...

* Other **Automatic Metrics - Linguishtically-informed** 

>* **Shallow syntactic similarity** - part-of-speech, lemmas, phrase chunks
>* **Syntactic similarity** - dependency trees, head-word chains, constituency parsing
>* **Semantic similarity** - named entities, semantic roles, discourse representation


* **Variability in References $\rightarrow$ HyTER** 

>* Meaning-equivalent semantics for translation evaluation
>* Creates **FSAs** of translations $\rightarrow$ measure quality by **string edit distance**
>* Median no. of translations: $+100M/sentence$