# 11. Machine Translation — Lab solutions

## Preparations

### Introduction

In this lab, we will be using [Python Natural Language Toolkit](http://www.nltk.org/) (`nltk`) again to get to know the IBM models better. There are proper, open-source MT systems out there (such as [Apertium](https://www.apertium.org/index.eng.html?dir=eng-cat#translation) and [MOSES](http://www.statmt.org/moses/)); however, getting to know them would require more than 90 minutes.

### Infrastructure

For today's exercises, you will need the docker image again. Provided you have already downloaded it last time, you can start it by:

- `docker ps -a`: lists all the containers you have created. Pick the one you used last time (with any luck, there is only one)
- `docker start <container id>`
- `docker exec -it <container id> bash`

When that's done, update your git repository:

```bash
cd /nlp/python_nlp_2017_fall/
git pull
```

If `git pull` returns with errors, it is most likely because some of your files have changes in it (most likely the morphology or syntax notebooks, which you worked on the previous labs). You can check this with `git status`. If the culprit is the file `A.ipynb`, you can resolve this problem like so:

```
cp A.ipynb A_mine.ipynb
git checkout A.ipynb
```

After that, `git pull` should work.

And start the notebook:
```
jupyter notebook --port=8888 --ip=0.0.0.0 --no-browser --allow-root
```

If you started the notebook, but cannot access it in your browser, make sure `jupyter` is not running on the host system as well. If so, stop it.

### Boilerplate

The following code imports the packages and defines the functions we are going to use.

In [1]:
import os
import random
import shutil
import urllib

import nltk
import numpy

def download_file(url, directory=''):
    real_dir = os.path.realpath(directory)
    if not os.path.isdir(real_dir):
        os.makedirs(real_dir)

    file_name = url.rsplit('/', 1)[-1]
    real_file = os.path.join(real_dir, file_name)
    
    if not os.path.isfile(real_file):
        with urllib.request.urlopen(url) as inf:
            with open(real_file, 'wb') as outf:
                shutil.copyfileobj(inf, outf)

## Exercises

### 1. Corpus acquisition

We download and preprocess a subset of the [Hunglish corpus](http://mokk.bme.hu/resources/hunglishcorpus/). It consists of English-Hungarian translation pairs extracted from open-source software documentation. The sentences are already aligned, but it lacks word alignment.

#### 1.1 Download

Download the corpus. The url is `ftp://ftp.mokk.bme.hu/Hunglish2/softwaredocs/bi/opensource_X.bi`, where `X` is a number that ranges from 1 to 9. Use the `download_file` function defined above.

In [None]:
for i in range(1, 10):
    download_file('ftp://ftp.mokk.bme.hu/Hunglish2/softwaredocs/bi/opensource_{}.bi'.format(i))

#### 1.2 Conversion

Read the whole corpus (all files), but do not read it all into memory. Write a function that

- reads all files you have just downloaded
- is a generator that yields tuples (Hungarian snippet, English snippet)

Note:
- the files are encoded with the `iso-8859-2` (a.k.a. `Latin-2`) encoding
- the Hungarian and English snippets are separated by a tab
- don't forget to strip whitespace from the returned snippets
- throw away pairs with empty snippets

In [2]:
def read_files(directory=''):
    for i in range(1, 10):
        with open(os.path.join(directory, 'opensource_{}.bi'.format(i)), encoding='iso-8859-2') as inf:
            for line in inf:
                snippets = tuple(map(str.strip, line.split('\t')))
                if len(snippets[0]) and len(snippets[1]):
                   yield snippets

#### 1.3 Tokenization

The text is not tokenized. Use `nltk`'s `word_tokenize()` function to tokenize the snippets. Also, lowercase them. You can do this in `read_files()` above if you wish, or in the code you write for [1.4](#1.4-Create-the-training-corpus) below.

Note:
- The model for the sentence tokenizer (`punkt`) is not installed by default. You have to `download()` it.
- NLTK doesn't have Hungarian tokenizer models, so there might be errors in the Hungarian result
- instead of just lowercasing everything, we might have chosen a more sophisticated solution, e.g. by first calling `sent_tokenize()` and then just lowercase the word at the beginning of the sentence, or even better, tag the snippets for NER. However, we have neither the time nor the resources (models) to do that now.

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

def preprocess(snippets):
    for hu_snippet, en_snippet in snippets:
        yield (list(map(str.lower, word_tokenize(hu_snippet))),
               list(map(str.lower, word_tokenize(en_snippet))))

#### 1.4 Create the training corpus

The models we are going to expect a list of [`nltk.translate.api.AlignedSent`](http://www.nltk.org/api/nltk.translate.html) objects. Create a `bitext` variable that is a list of `AlignedSent` objects created from the preprocessed, tokenized corpus.

Note that `AlignedSent` also allows you to specify an *alignment* between the words in the two texts. Unfortunately (but not unexpectedly), the corpus doesn't have this information.

In [4]:
from nltk.translate.api import AlignedSent
bitext = [AlignedSent(*t) for t in preprocess(read_files())]

assert len(bitext) == 135439

Have a look at a few sentences from the corpus. Play around with the indices until you find another 3 interesting sentences.

In [None]:
display(bitext[100])
display(bitext[100000])

### 2. IBM Models

NLTK implements IBM models 1-5. Unfortunately, the implementations don't provide the end-to-end machine translation systems, only their alignment models. Also, as you will see, the API leaves a lot to desired $-$ it will be your task to smoothen the rought edges. However, if you complete exercise 2, you will have a working IBM Model 1 system!

#### 2.1 IBM Model 1

Train an [IBM Model 1](http://www.nltk.org/api/nltk.translate.html#module-nltk.translate.ibm1) alignment. We do it in a separate code block, so that we don't rerun it by accident – training even a simple model takes some time.

In [None]:
from nltk.translate import IBMModel1
ibm1 = IBMModel1(bitext, 5)

The IBM models learn the **inverse translation probability**, so
- the **source** sentence is the second one in `AlignedSent` (here: English)
- the **target** sentence is the first one in `AlignedSent` (here: Hungarian).

In effect, our `ibm1` object learned an English-to-Hungarian translation model.

The model has two outputs:
1. The main output of the model is the _translation table_ (`ibm1.translation_table`), which stores the (word-to-word) translation probabilities
1. It also assignes an alignment to each `AlignedSent` object in the input corpus.

To see the [`Alignment`](http://www.nltk.org/api/nltk.translate.html#nltk.translate.api.Alignment)s, display the sentences you have chosen above again in the cell below.

In [None]:
print('Alignment in text:', bitext[100].alignment)
display(bitext[100])
display(bitext[100000])

The other output is the word-to-word translation table. Unfortunately, it is not structured in a very well thought-out way. Even though it represents $P(t|s)$ word translation probabilities, the table is structured as

```Python
ibm1.translation_table[target_word][source_word] == probability
```

In other words, it is easy to get the list of source words that translate to a target word, but it is very inefficient to enumerate all target words a source word can translate to, as one has to enumerate all entries of the form `ibm1.translation_table[*][source_word]`.

#### 2.2 Inverse translation table

In this exercise, we are going to fix this issue by creating another table that can be indexed by source first. Write a function that adds a new field `source_target_table` to `ibm1`, where $\forall s,t:$ `ibm1.source_target_table[s][t] == ibm1.translation_table[t][s]`.

In [None]:
def create_inverse_table(model):
    d = dict()
    for t, sd in model.translation_table.items():
        for s, p in sd.items():
            d.setdefault(s, {})[t] = p
    model.source_target_table = d
    
# Let's test it
create_inverse_table(ibm1)

for t in random.sample(ibm1.trg_vocab, 1000):
    # Cannot sample ibm1.src_vocab, because the table is sparse
    s, p = random.choice(list(ibm1.translation_table[t].items()))
    assert ibm1.translation_table[t][s] == ibm1.source_target_table[s][t]

#### 2.3 Most probable word translations

With our new table, we can query word translation probabilities efficiently. Write a function that has three arguments:
- `word` is a word
- `source` is a boolean; the default is `True`
- `k` is the number of translations to return; the default is 5.

The function returns the most probable $k$ target words `word` translates to if `source` is `True`; otherwise, it returns the $k$ source words most likely to translate to the target `word`. In other words, `word` can both be source and target word depending on the second parameter.

The function should return a list of `(word, probability)` tuples. If the word is not found in the model, return an empty list.

In [None]:
def most_probable_words(word, source=True, k=5):
    if source:
        translations = ibm1.source_target_table.get(word, {})
    else:
        translations = ibm1.translation_table.get(word, {})
    translations = {k: v for k, v in translations.items() if k and v}
    return sorted(translations.items(), key=lambda kv: (-kv[1], kv[0]))[:k]

assert most_probable_words('home') == most_probable_words('home', True) == most_probable_words('home', True, 5)
assert [w for w, _ in most_probable_words('home', k=4)] == ['saját', 'otthoni', 'kezdőlap', 'honlap']
assert len(most_probable_words('otthoni')) == 0
assert len(most_probable_words('signature', k=1000)) == 481

#### More about the IBM Model 1...

While the model doesn't have a `translate()` function, it does provide a way to compute the **translation probability** $P(T|S)$ with some additional codework. That additional work is what you have to put in.

As a reminder, here are the formulas for computing the translation probability. Following `IBMModel1`, instead of the $F$ (_foreign_) and $E$ (_English_) in the lecture, we are going to use $T$ for the _target_ and $S$ for the _source_ languages.

The formulas of the translation probability is $P(T|S) = \sum_A P(T,A|S)$,<br>
where $P(T,A|S) = P(T|S,A) \times P(A|S)$.

The right side is computed as:
- alignment probability: $P(A|S) = \frac{\epsilon}{(J + 1)^K}$, where $J$ and $K$ are the lengths of the source and target sentences, respectively;
- probability of the source sentence given $A$: $P(T|S,A) = \prod_{k=1}^KP(t_k|s_{a_k})$, where $s_{a_k}$ is the source word that the alignment $A$ maps to the target word $t_k$.

The functions available in `IBMModel1` of interest to us are:
- `prob_alignment_point(s, t)`: $P(t|s)$.
- `prob_t_a_given_s(alignment_info)`: it claims to be $P(T,A|S)$, but actually it computes $P(T|S,A)$.

Of the two functions, we will need the second one, which will invoke the former to compute $P(T|S,A)$.

#### 2.4 Alignment conversion

There is a slight problem with `prob_t_a_given_s()`: instead of an [`AlignedSent`](http://www.nltk.org/api/nltk.translate.html) (which we already have at this point), it expects an [`AlignmentInfo`](http://www.nltk.org/api/nltk.translate.html#nltk.translate.ibm_model.AlignmentInfo) object that contains about the same information: the source and target sentences as well as the aligment between them.

Unfortunately, `AlignmentInfo`'s representation of an alignment is completely different from the [`Alignment`](http://www.nltk.org/api/nltk.translate.html#nltk.translate.api.Alignment) object's. In a nutshell, given the 100th sentence displayed below,


In [None]:
print('Alignment:', bitext[100].alignment)
display(bitext[100])

  - Alignments:
      - `Aligment` in `AlignedSent` is a list of 0-based index pairs, `[(0, 4), (1, 0), (2, 1), (3, 0)]`
      - Don't forget that the IBM model translates from the second to the first sentence, so this in effect is a **target-to-source** alignment
      - The alignment in the `AlignmentInfo` objects is a `tuple` (!), where the `i`th position is the index of the source word that is aligned to the `i`th target word $-$ or `0`, if the `i`th target word is unaligned. The indices are **1-based**, because the `0`the word is `NULL` on both sides (see lecture page 35, slide 82). The tuple must also make space for the (unused)  alignment for this `NULL` word. So the alignment for our sentence is `(0, 5, 1, 2, 1)` (e.g. the 3rd word, _otthoni_ in the target sentence is aligned with the 2nd source word, _home_, so `alignment[3] == 2`).
  - Sentences:
      - The sentences in `AlignedSent` are just list of words
      - Th sentences in `AlignmentInfo` are tuples of words, and they both contain an extra `None` as the 0th item to account for the `NULL` element, above.
  
Your first is task to do the conversion from `AlignedSet` to `AlignmentInfo`. Take extra care not to mix up the source and target languages! In the constructor, leave _cepts_ empty.

**Note**: you can find test cases for tasks 2.4 $-$ 2.7 [at the end of the exercise](#Test-cases-for-Exercises-2.5-%E2%80%93-2.7.).

In [None]:
from nltk.translate.ibm_model import AlignmentInfo

def alignment_to_info(aligned_sent):
    source_sent = tuple([None] + aligned_sent.mots)
    target_sent = tuple([None] + aligned_sent.words)
    alignment = [0 for _ in range(len(target_sent))]
    for t, s in dict(aligned_sent.alignment).items():
        alignment[t + 1] = s + 1
    return AlignmentInfo(tuple(alignment), source_sent, target_sent, None)

ai_100 = alignment_to_info(bitext[100])
ai_100k = alignment_to_info(bitext[100000])

assert ai_100.alignment == (0, 5, 1, 2, 1)
assert ai_100.src_sentence == (None, 'groupwise', 'home', 'screen', 'name', '1')

assert ai_100k.alignment == (0, 1, 3, 2, 4)
assert ai_100k.trg_sentence == (None, 'nem', 'található', 'aláírás', '.')

Finally, with the `AlignmentInfo` objects, we can compute $P(T|S,A)$ for the most probable alignments (those stored in the `AlignedSent` objects):

In [None]:
ibm1.prob_t_a_given_s(ai_100)

#### 2.5. Compute $P(T,A|S)$

As a reminder, $P(T,A|S) = P(T|A,S) \times P(A|S)$.

Since `prob_t_a_given_s()` only computes $P(T|A,S)$, you have to add the $P(A|S)$ component. See page 38, slide 95 and page 39, side 100 in the lecture (don't forget that the notation is different: $S \equiv E$ and $T \equiv F$). Treat $\epsilon$ as 1.

The function should take the `model` and an `AlignmentInfo` as parameters. Don't forget that the sentences in `AlignmentInfo` contain an extra "word" when computing their lengths.

In [None]:
def real_prob_t_a_given_s(model, alignment_info):
    return model.prob_t_a_given_s(alignment_info) / pow(
        len(alignment_info.src_sentence), len(alignment_info.trg_sentence) - 1)

assert real_prob_t_a_given_s(ibm1, ai_100) == ibm1.prob_t_a_given_s(ai_100) / 1296

#### 2.6. Compute $P(F, A_{best}|E)$

Using the function you just implemented, write another one that, given an `AlignedSent` object, computes $P(T,A|S)$. Since `IBMModel1` aligns the sentences of the training set with the most probable alignment, this function will effectively compute $P(T,A_{best}|S)$.

In [None]:
def prob_t_best_a_given_s(model, aligned_sent):
    return real_prob_t_a_given_s(model, alignment_to_info(aligned_sent))

assert numpy.allclose(prob_t_best_a_given_s(ibm1, bitext[100]), real_prob_t_a_given_s(ibm1, ai_100))

#### 2.7. Compute $P(T|S)$

Write a function that, given an `AlignedSent` object, computes $P(T|S)$. It should enumerate all possible alignments (in the tuple format) and call the function you wrote in Exercise 2.5 with them.

Note: the [`itertools.product`](https://docs.python.org/3.5/library/itertools.html#itertools.product) function can be very useful in enumerating the alignments.

In [None]:
from itertools import product

def prob_t_given_s(model, aligned_sent):
    ali = alignment_to_info(aligned_sent)
    p = 0
    for alig in product(range(len(ali.src_sentence)), repeat=len(ali.trg_sentence) - 1):
        ali.alignment = tuple([0] + list(alig))
        p += real_prob_t_a_given_s(model, ali)
    return p
        
prob_t_given_s(ibm1, bitext[100])

#### Test cases for Exercises 2.5 – 2.7.

In [None]:
testext = [
    AlignedSent(['klein', 'ist', 'das', 'haus'], ['the', 'house', 'is', 'small']),
    AlignedSent(['das', 'haus', 'ist', 'ja', 'groß'], ['the', 'house', 'is', 'big']),
    AlignedSent(['das', 'buch', 'ist', 'ja', 'klein'], ['the', 'book', 'is', 'small']),
    AlignedSent(['das', 'haus'], ['the', 'house']),
    AlignedSent(['das', 'buch'], ['the', 'book']),
    AlignedSent(['ein', 'buch'], ['a', 'book'])
]
ibm2 = IBMModel1(testext, 5)

# Tests for Exercise 2.4
testalis = [alignment_to_info(s) for s in testext]

assert testalis[2].alignment == (0, 1, 2, 3, 3, 4)
assert testalis[5].alignment == (0, 1, 2)

# Tests for Exercise 2.5
assert numpy.allclose(real_prob_t_a_given_s(ibm2, testalis[5]), 0.08283000979778607)
assert numpy.allclose(real_prob_t_a_given_s(ibm2, testalis[0]), 0.00018256804431244556)

# Tests for Exercise 2.6
assert numpy.allclose(prob_t_best_a_given_s(ibm2, testext[4]), 0.059443309368677)
assert numpy.allclose(prob_t_best_a_given_s(ibm2, testext[2]), 1.3593610057711997e-05)

# Tests for Exercise 2.7
assert numpy.allclose(prob_t_given_s(ibm2, testext[4]), 0.13718805082588842)
assert numpy.allclose(prob_t_given_s(ibm2, testext[2]), 0.0001809283308942621)

### 3. Phrase-based translation

NLTK also has some functions related to phrase-based translation, but these are all but finished. The components are scattered into two packages:
- [phrase_based](http://www.nltk.org/api/nltk.translate.html#module-nltk.translate.phrase_based) defines the function `phrase_extraction()` that can extract phrases from parallel text, based on an alignment
- [stack_decoder](http://www.nltk.org/api/nltk.translate.html#module-nltk.translate.stack_decoder) defines the `StackDecoder` object, which can be used to translate sentences based on a phrase table and a language model

#### 3.1. Decoding example

If you are wondering where the rest of the training functionality is, you spotted the problem: unfortunately, the part that assembles the phrase table based on the extracted phrases is missing. Also missing are the classes that represent and compute a language model. So in the code block below, we only run the decoder on an example sentence with a "hand-crafted" model.

Note: This is the same code as in the documentation of the decoder (above).

In [None]:
from collections import defaultdict
from math import log

from nltk.translate import PhraseTable
from nltk.translate.stack_decoder import StackDecoder

# The (probabilistic) phrase table
phrase_table = PhraseTable()
phrase_table.add(('niemand',), ('nobody',), log(0.8))
phrase_table.add(('niemand',), ('no', 'one'), log(0.2))
phrase_table.add(('erwartet',), ('expects',), log(0.8))
phrase_table.add(('erwartet',), ('expecting',), log(0.2))
phrase_table.add(('niemand', 'erwartet'), ('one', 'does', 'not', 'expect'), log(0.1))
phrase_table.add(('die', 'spanische', 'inquisition'), ('the', 'spanish', 'inquisition'), log(0.8))
phrase_table.add(('!',), ('!',), log(0.8))

# The "language model"
language_prob = defaultdict(lambda: -999.0)
language_prob[('nobody',)] = log(0.5)
language_prob[('expects',)] = log(0.4)
language_prob[('the', 'spanish', 'inquisition')] = log(0.2)
language_prob[('!',)] = log(0.1)
# Note: type() with three parameters creates a new type object
language_model = type('',(object,), {'probability_change': lambda self, context, phrase: language_prob[phrase],
                                     'probability': lambda self, phrase: language_prob[phrase]})()

stack_decoder = StackDecoder(phrase_table, language_model)

stack_decoder.translate(['niemand', 'erwartet', 'die', 'spanische', 'inquisition', '!'])

#### 3.2. Train the phrase table*

Run through the parallel corpus (already aligned by an IBM model), and extract all phrases from them. You can limit the length of the phrases you consider at 2 (3, ...) words, but you have to do it manually, because the `max_phrase_length` argument of `phrase_extraction()` doesn't work. Once you have all the phrases, create a phrase table similar to the one above. Don't forget that the decoder expects _log_ probabilities.