In [4]:
%%capture
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('word_mt.ipynb')

<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

# (Word-based) Machine Translation

##  Machine Translation (MT)

* Machine Translation (MT): one of the canonical NLP applications
* Paradigms: 
  * Rule-based vs **statistical**
  * **word** vs phrase vs syntax units
  * **feature-engineering** vs neural

##  Word Based MT

* foundational to all current approaches (e.g. neural methods)
* subcomponent in more complex systems (for alignments)

## MT as Structured Prediction

* **source** sentence \\(\source\\), usually tokenized
* **target** sentence \\(\target\\), usually tokenized
* a **model** \\(s_\params(\target,\source)\\) to measure match of \\(\target\\) to $\source$, 
* learn the parameters \\(\params\\) from data 
* predict highest-scoring translation:

\begin{equation}
\argmax_\target s_\params(\target,\source)
\end{equation}

* MT models differ primarily in how \\(s\\) is defined, \\(\params\\) are learned, and how the \\(\argmax\\) is found.

## Noisy Channel

In [1]:
%%tikz
\input{../fig/noisy_channel.tex}

## A Naive Baseline Translation Model
The most straightforward translation model translates words one-by-one, in the order of appearance:
$$
\prob_\params^\text{Naive}(\ssource|\starget) = \prod_i^{\length{\source}} \param_{\ssource_i,\starget_i}
$$
where \\(\param_{\ssource,\starget} \\) is the probability of translating \\(\starget\\) as \\(\ssource\\). \\(\params\\) is often referred to as *translation table*.

For many language pairs one can acquire training sets $\train=\left( \left(\source_i,\target_i\right) \right)_{i=1}^n $ of paired source and target sentences. For example, for French and English the [Aligned Hansards](http://www.isi.edu/natural-language/download/hansard/) of the Parliament of Canada can be used. Given such a training set \\(\train\\) we can learn the parameters \\(\params\\) using the [Maximum Likelhood estimator](/template/statnlpbook/02_methods/0x_mle). In the case of our Naive model this amounts to setting
$$
\param_{\ssource,\starget} = \frac{\counts{\train}{s,t}}{\counts{\train}{t}} 
$$
Here \\(\counts{\train}{s,t}\\) is the number of times we see target word \\(t\\) translated as source word \\(s\\), and \\(\counts{\train}{t}\\) the number of times we the target word \\(t\\) in total.