# IBM Model 2, Variational Bayes
#### Authors: Adriaan de Vries, Féliciën Veldema, Verna Dankers

This notebook implements the variational bayes training algorithm for IBM Model 2. Run the cells in order to run the algorithm.

### 1. Requirements

In [1]:
# Python libraries to install
from __future__ import print_function, division
from collections import defaultdict, Counter
from tqdm import tqdm
from random import random
from scipy import special
from scipy.special import digamma, loggamma, gammaln

import numpy as np
import matplotlib.pyplot as plt
import pickle
import math
import os

# Custom imports
from aer import read_naacl_alignments, AERSufficientStatistics
import data

### 2. Read in the data

Please set the paths to the data and run the code below. Functions for reading in the data have been placed outside of the notebook, as they are re-used by other notebooks.

In [2]:
english_train = 'training/hansards.36.2.e'
french_train = 'training/hansards.36.2.f'
english_val = 'validation/dev.e'
french_val = 'validation/dev.f'
fname = 'naacltest.txt'

# Read training data and replace words occurring once by -UNK-
training_data = data.read_data(english_train, french_train, True)
ext_data = list(zip(*training_data))

# Replace words in validation data that do not appear in the training data
validation_data = data.read_data(english_val, french_val, True, ttype='validation', eng_data=ext_data[0], fre_data=ext_data[1])
test_data = data.read_data("testing/test/test.e", "testing/test/test.f", True, ttype='test', eng_data=ext_data[0], fre_data=ext_data[1])

Adding the -UNK- token to the data.


  0%|                                   | 603/231164 [00:16<1:47:08, 35.86it/s]

KeyboardInterrupt: 

  0%|                                   | 603/231164 [00:30<3:11:13, 20.10it/s]

### 3. Implementation of IBM2 VB

First, we implement the training algorithm, and the functions to calculate alignments, log likelihood and elbo.

In [None]:
def elbo(data, t, jumps, f_vocab, alpha, lambdas):
    """Calculate the ELBO for the training data.

    Args:
        data: zipped object with pairs of e and f sentences
        t: dictionary with translation probabilities e to f
        jumps: alignment probabilities
        f_vocab: set of French words
        alpha: value for dirichlet prior
        lambdas: adapted counts from last iteration

    Returns:
        float: elbo
    """
    # Start by calculating log likelihood
    ll = log_likelihood(data, t, jumps)
    
    # Add -KL to the log likelihood
    elbo = ll
    gammaln_alpha = gammaln(alpha)
    c = gammaln(alpha * len(f_vocab))
    for e in tqdm(t):
        a = sum([(math.log(t[e][f]) if t[e][f] != 0 else 0) * (alpha - lambdas[e][f])
                 +  gammaln(lambdas[e][f]) - gammaln_alpha for f in lambdas[e] if f != "-REST-"])
        b = gammaln(sum([(lambdas[e][f] if f in lambdas[e] else alpha) for f in f_vocab]))
        elbo += a - b + c
    return elbo

def log_likelihood(data, translate_dict, jump_dict):
    """Calculate the log likelihood for the training data.

    Args:
        data: zipped object with pairs of e and f sentences
        translate_dict: dictionary with translation probabilities e to f
        jump_dict: alignment probabilities

    Returns:
        float: log likelihood
    """
    log_likelihood = 0
    for e, f in data:
        alignment = VB_align(e, f, translate_dict, jump_dict, True)
        logprob = 0
        for i, j in alignment:
            logprob += math.log(translate_dict[e[i]][f[j-1]] * jump_dict[get_jump(j-1, i, len(e), len(f))])
        log_likelihood += logprob
    return log_likelihood

def VB_align_all(data, translate_dict, jump_dict, f_vocab, fname=None):
    """Create alignments for pairs of English and French sentences.
    Both save them as sets per sentence and pair and save to file.
    
    Args:
        validation: zipped object with pairs of e and f sentences
        translate_dict: dictionary with translation probabilities e to f
        fname: filename to save alignments in, in NAACL format

    Returns:
        list of sets
    """
    file = open(fname, 'w')
    alignments = []
    for k, (english_words, french_words) in enumerate(data):
        alignment = VB_align(english_words, french_words, translate_dict, jump_dict, f_vocab, False)
        for pos1, pos2 in alignment:
            file.write("{} {} {}\n".format(str(k+1), str(pos1), str(pos2)))
        alignments.append(set(alignment))
    return alignments

def VB_align(english_words, french_words, translate_dict, jump_dict, f_vocab, add_null=True):
    """Align one sentence pair, either with or without the NULL alignments.
    
    Args:
        english_words: list of english words
        french_words: list of french words
        translate_dict: dictionary with translation probabilities e to f
        add_null: boolean to indicate whether NULL alignments should be included

    Return:
        list of tuples
    """
    alignment = []
    for j, fword in enumerate(french_words):
        prior = 0.0
        alignment_j = 0
        for i, eword in enumerate(english_words):
            # Only include terms that are in the dictionary
            if eword in translate_dict:
                if fword in translate_dict[eword]:
                    prob = translate_dict[eword][fword]
                else:
                    prob = translate_dict[eword]["-REST-"]
                prob = prob * jump_dict[get_jump(j, i, len(english_words), len(french_words))]
                if prob > prior:
                    prior = prob
                    alignment_j = i
                # Add dependent on whether it's a NULL alignments
        if alignment_j != 0 or add_null:
            alignment.append((alignment_j, j + 1))
    return alignment

def initialize_t(data, uniform=True):
    """Initialise the translation probabilities.
    
    Args:
        data: list of tuples, english and french sentences
        uniform: boolean indicating initialisation type

    Returns:
        defaultdict(Counter)
    """
    t = defaultdict(Counter)
    for e, f in data:
        for e_word in e:
            for f_word in f:
                if uniform:
                    t[e_word][f_word] = 1
                else:
                    t[e_word][f_word] = random()
    for e_word in t:
        normalization_factor = sum(list(t[e_word].values()))
        for f_word in t[e_word]:
            t[e_word][f_word] = t[e_word][f_word] / normalization_factor
    return t

def test(path, personal_sets=None):
    """Compute AER for alignments.

    Args:
        path (str): path with naacl alignments
        personal_sets: alignments in set repersesntation
    
    Returns:
        int: AER
    """
    from random import random
    # 1. Read in gold alignments
    gold_sets = read_naacl_alignments('validation/dev.wa.nonullalign')

    # 2. Here you would have the predictions of your own algorithm
    if personal_sets is None:
        personal_sets = read_naacl_alignments(path)
        predictions = []
        for s, p in personal_sets:
            links = set()
            for link in s:
                links.add(link)
            predictions.append(links)
    else:
        predictions=personal_sets

    # 3. Compute AER
    # first we get an object that manages sufficient statistics 
    metric = AERSufficientStatistics()
    # then we iterate over the corpus 
    for gold, pred in zip(gold_sets, predictions):
        metric.update(sure=gold[0], probable=gold[1], predicted=pred)
    # AER
    return metric.aer()

def VB_IBM2(training_data, validation, alpha, initial_translation_estimate=None, initialise_method = 'IBM1', max_steps=3):
    """Train IBM 2 using variational inference.
    
    Args:
        training_data: list of tuples with parallel sentences
        validation: list of tuples with parallel sentences
        alpha: dirichlet prior parametrised
        initial_translation_estimate: translation_dict initialiation
        initialise_method: how to initialise translation probabilities
        max_steps: number of learning iterations

    Returns:
        dict: translation probabilities
        dict: jump probabilities
    """
    initialise_method = initialise_method.lower()
    assert initialise_method in ['ibm1', 'uniform', 'random'], "initialise method has to be in [ibm1, uniform, random]"
    if initialise_method == 'ibm1':
        assert initial_translation_estimate, "initial_translation_method has to be given, if ibm1"
        translate_dict = initial_translation_estimate
    elif initialise_method == 'uniform':
        translate_dict = initialize_t(training_data, True)
    else:
        translate_dict = initialize_t(training_data, False)
    
    jump_dict = {}
    for i in range(-100, 101):
        jump_dict[i] = 1/201
    aers = []
    elbos = []
    f_vocab = {f for e in translate_dict for f in translate_dict[e]}

    for iteration in range(max_steps):
        fname = 'IBM2_iteration' + str(iteration) + '.txt'
        lambdas = defaultdict(lambda : defaultdict(lambda : alpha))
        jump_counts = defaultdict(int)
        pos_counts = 0
        
        # VB `expectation'
        print("Expectation step {}".format(iteration))
        for e_s,f_s in tqdm(training_data):
            m = len(f_s)
            l = len(e_s)
            for j, f in enumerate(f_s):
                sum_of_probs = 0
                # Calculate normalizers
                for i, e in enumerate(e_s):
                    jump_prob = jump_dict.get(get_jump(j,i,l,m), 1/l)
                    translate_prob = translate_dict[e][f]
                    sum_of_probs += jump_prob * translate_prob

                # Collect counts
                for i, e in enumerate(e_s):
                    jump_prob = jump_dict.get(get_jump(j,i,l,m), 1/l)
                    translate_prob = translate_dict[e][f]
                    prob = jump_prob * translate_prob / sum_of_probs
                    lambdas[e][f] += prob
                    jump_counts[get_jump(j,i,l,m)] += prob
                    pos_counts += prob

        # VB `maximisation'
        print("Maximisation step {}".format(iteration))
        for e in tqdm(translate_dict):
            summation = 0
            for f2 in f_vocab:
                if f2 in lambdas[e]:
                    summation += lambdas[e][f2]
                else:
                    summation += alpha
            summation = digamma(summation)
            for f in translate_dict[e]:
                translate_dict[e][f] = np.exp(digamma(lambdas[e][f]) - summation)
            translate_dict[e]["-REST-"] = np.exp(digamma(alpha) - summation)

        for jump in jump_counts:
            jump_dict[jump] = jump_counts[jump] / pos_counts
            
        # Writing the iteration files in naacl for AER use
        alignments = VB_align_all(validation, translate_dict, jump_dict, f_vocab, fname)
        _ = VB_align_all(test_data, translate_dict, jump_dict, f_vocab, "ibm1_{}.vb.naacl".format(iteration+1))
        eb = elbo(training_data, translate_dict, jump_dict, f_vocab, alpha, lambdas)
        aer = test("", alignments)
        aers.append(aer)
        elbos.append(eb)
        print("AER {}, LL {}".format(aer, eb))
        
        # Save models in between
        pickle.dump(translate_dict, open("translate_dicts/ibm2_em_t_epoch_{}.pickle".format(iteration + 1), 'wb'))
        pickle.dump(jump_dict, open("translate_dicts/ibm2_em_jump_epoch_{}.pickle".format(iteration + 1), 'wb'))
    
    # Write aers and lls to file
    with open("ibm2_vb_{}_scores.txt".format(alpha), 'w') as f:
        f.write("AER")
        for aer in aers:
            f.write("{}\n".format(aer))
        f.write("\nLL")
        for eb in elbos:
            f.write("{}\n".format(eb))
    return translate_dict, jump_dict

def get_jump(fre_pos, eng_pos, eng_len, fre_len):
    """Get jump.
    
    Args:
        fre_pos (int): position in French sentence
        eng_post (int): position in English sentence
        eng_len (int): length of English sentence
        fre_len (int): length of French sentence

    Returns:
        int: jump
    """
    equivalent_pos = int(math.floor(fre_pos * eng_len / fre_len))
    return eng_pos - equivalent_pos

### 4. Train a model

In [8]:
pretrained_t = pickle.load(open("translate_dicts/IBM1VB/ibm1_vb_epoch_13.pickle", 'rb'))
alpha = 0.0001
ibm2_transdict, ibm2_jumpdict = VB_IBM2(
    training_data, validation_data, alpha, initial_translation_estimate=pretrained_t, max_steps=10
)

Expectation step 0


100%|█████████████████████████████████| 231164/231164 [11:03<00:00, 348.40it/s]


Maximisation step 0


100%|████████████████████████████████████| 25593/25593 [05:56<00:00, 71.87it/s]
100%|████████████████████████████████████| 25593/25593 [05:48<00:00, 73.43it/s]


AER 0.24431256181998018, LL -22982532.1409312
Expectation step 1


100%|█████████████████████████████████| 231164/231164 [11:15<00:00, 342.14it/s]


Maximisation step 1


100%|████████████████████████████████████| 25593/25593 [07:53<00:00, 54.01it/s]
100%|████████████████████████████████████| 25593/25593 [06:51<00:00, 62.25it/s]


AER 0.24751491053677932, LL -21772031.874373805
Expectation step 2


100%|█████████████████████████████████| 231164/231164 [11:49<00:00, 325.83it/s]


Maximisation step 2


100%|████████████████████████████████████| 25593/25593 [05:58<00:00, 71.47it/s]
100%|████████████████████████████████████| 25593/25593 [07:13<00:00, 59.00it/s]


AER 0.24900398406374502, LL -21499952.426671233
Expectation step 3


100%|█████████████████████████████████| 231164/231164 [11:19<00:00, 340.39it/s]


Maximisation step 3


100%|████████████████████████████████████| 25593/25593 [06:12<00:00, 68.71it/s]
100%|████████████████████████████████████| 25593/25593 [06:10<00:00, 69.01it/s]


AER 0.24275724275724275, LL -21372806.558607686
Expectation step 4


100%|█████████████████████████████████| 231164/231164 [10:47<00:00, 356.91it/s]


Maximisation step 4


100%|████████████████████████████████████| 25593/25593 [06:17<00:00, 67.75it/s]
100%|████████████████████████████████████| 25593/25593 [06:36<00:00, 64.52it/s]


AER 0.239, LL -21295252.372020103
Expectation step 5


100%|█████████████████████████████████| 231164/231164 [11:12<00:00, 343.93it/s]


Maximisation step 5


100%|████████████████████████████████████| 25593/25593 [05:58<00:00, 71.31it/s]
100%|████████████████████████████████████| 25593/25593 [06:38<00:00, 64.29it/s]


AER 0.242, LL -21244910.112500586
Expectation step 6


 99%|████████████████████████████████▌| 228276/231164 [10:32<00:08, 360.97it/s]

KeyboardInterrupt: 

 99%|████████████████████████████████▌| 228276/231164 [10:50<00:08, 350.97it/s]