# Natural Language Processing COMP3225
Seq2seq Neural Machine Translation (NMT) lab

Stuart Middleton, 25/08/2020

This lab will provide practical experience with neural machine translation software trained to generate German output text from English input text. You will learn how to use the encoder/decoder model, pre-process a bitext, train the model and use a decoder to translate sentences. You will then learn to use advanced normalization and tokenization, reverse the input sequence and use a beam search decoder for better performamce. 

In part2 you will learn about how to handle rare words, outside the limited vocabulary supported by NMT solutions. You will generate a statistical phrase alignment using fast_align and use this to generate a lookup dictionary to translate rare words. You will then use the phrase alignment to train the NMT model to predict out of vocabulary words with positional markers, allowing use of the lookup dictionary for translation.

## Part 1

### Pre-requisites

You will need python3. The code below will work OK on a CPU only machine with small numbers of sentences. We recommend you run the code on a machine with a GPU when training with a large number of sentences (which will give better translation accuracy but take a lot longer to compute).

### Task 1 - train a basic NMT model using a seq2seq encoder/decoder architecture

Further reading:
    [TensorFlow NMT tutorial](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt)
    [Luong 2015 NMT paper](https://www.aclweb.org/anthology/D15-1166/)

First install python3 and the pre-requisite libraries needed for this tutorial.

```
python3 -m pip install numpy
python3 -m pip install tensorflow-gpu
python3 -m pip install tensorflow_addons
python3 -m pip install keras
python3 -m pip install sacrebleu
python3 -m pip install mosestokenizer
python3 -m pip install notebook

unzip package for lab
jupyter notebook
==> will open browser windows from localhost:8888
==> load the lab .ipynb file
```

Create a new python code file for your work.

Import this labs required python3 libraries. It is safer to import tensorflow2 last as some complex libaries (e.g. numpy) have been known to conflict with tensorflow implementations of backend code resulting in occasional segmentation faults. Also it is best to use the tensorflow logger, as opposed to the python default logger, as tensorflow2 defines its own logger internally and if this is not used your log messages will probably be ignored inside tensorflow graphs.

In [None]:
import unicodedata, re, io, time, os, datetime, sys, codecs, math, gc, random, contextlib, itertools, json, string
import numpy as np
import sacrebleu, mosestokenizer

# small library of functions used by this tutorial
import lab_seq2seq_nmt_lib

import logging
import tensorflow as tf
import absl.logging
formatter = logging.Formatter('[%(levelname)s|%(filename)s:%(lineno)s %(asctime)s] %(message)s')
absl.logging.get_absl_handler().setFormatter(formatter)
absl.logging._warn_preinit_stderr = False
logger = tf.get_logger()
logger.setLevel(logging.INFO)

from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from keras.models import load_model
import tensorflow_addons as tfa


We will now define the hyperparameters used for the models. The sequence length is 20 words for this lab to speed it up, but lengths are usually 40 or 50 words. As sequences get longer the performance degrades, so more epochs are needed to lower loss, and GPU memory use is proportional to sequence length. The same goes for vocabulary size, which is limited to a maximum of 50k to avoid running out of GPU memory (not a problem for the dataset here but for larger datasets it will be).  The batch size should be large enough to avoid yoyo'ing of loss convergence, which can happen if a single entry in a small batch leads the training astray. For larger models the number of epochs will typically need to be around 100 to achieve a good loss score of 0.01 or lower.

The model training will take a long time (hours) if the full bitext corpus is used. Therefore for lab sessions we suggest using a truncated set of sentences (using the DEBUG_TRUNC_SENTS hyperparameter) aiming to get 0.2 loss or lower. At home you can train the model overnight on the full dataset to get much more accurate translations.

In [None]:
# encoder embedding dimension
EMBEDDING_DIM = 256

# encoder LSTM dimension
RNN_UNITS = 1024

# attension layer dimension
DENSE_UNITS = 1024

# dataset parameters
SHUFFLE_BUFFER_SIZE = 10000
MAX_SENT_LENGTH = 20
MAX_SENT_LENGTH_SOURCE = MAX_SENT_LENGTH
MAX_SENT_LENGTH_TARGET = MAX_SENT_LENGTH

# beam search parameters
BEAM_WIDTH = 12
BEAM_LENGTH_NORM_WEIGHT = 1.0

# very short run for debug, will probably provide single wrong word translations (in lab)
#BATCH_SIZE = 64
#EPOCHS = 10
#DEBUG_TRUNC_SENTS = 1000
#MAX_VOCAB_SIZE = 1000

# longer run, better accuracy (in lab with a CPU 1.5h, faster with GPU)
BATCH_SIZE = 256
EPOCHS = 10
DEBUG_TRUNC_SENTS = 20000
MAX_VOCAB_SIZE = 10000

# full run, best accuracy (overnight)
#BATCH_SIZE = 256
#EPOCHS = 100
#DEBUG_TRUNC_SENTS = None
#MAX_VOCAB_SIZE = 10000

Below is the model we will use for our neural machine translation application. The architecture follows the seminal sequence to sequence encoder/decoder architecture used by [Luong 2015](https://www.aclweb.org/anthology/D15-1166/). The encoder has an embedding and LSTM layer. The decoder has an embedding later, an attention layer and a LSTM layer. This code is based on the excellent [tensorflow seq2seq NMT tutorial](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt).

The focus of this lab is understanding how to use this model for machine translation and get the most from it. It is recommended you read [Luong 2015] to understand fully the inner workings of the model, which hyperparameters are best and get an understanding for how different layer and technique choices affect the final BLEU translation score.

In [None]:
class EncoderNetwork(tf.keras.Model):

	def __init__( self, input_vocab_size, embedding_dims, rnn_units ):
		super().__init__()
		self.encoder_embedding = tf.keras.layers.Embedding( input_dim=input_vocab_size, output_dim=embedding_dims )
		self.encoder_rnnlayer = tf.keras.layers.LSTM( rnn_units, return_sequences=True, return_state=True )

class DecoderNetwork(tf.keras.Model):
	def __init__(self, output_vocab_size, embedding_dims, rnn_units, dense_units) :
		super().__init__()
		self.decoder_embedding = tf.keras.layers.Embedding( input_dim = output_vocab_size, output_dim = embedding_dims ) 
		self.dense_layer = tf.keras.layers.Dense( output_vocab_size )
		self.decoder_rnncell = tf.keras.layers.LSTMCell( rnn_units )
		self.dense_units = dense_units

		self.sampler = tfa.seq2seq.sampler.TrainingSampler()

		# Create attention mechanism with memory = None
		self.attention_mechanism = self.build_attention_mechanism( self.dense_units, None, BATCH_SIZE*[MAX_SENT_LENGTH_SOURCE] )
		self.rnn_cell = tfa.seq2seq.AttentionWrapper( self.decoder_rnncell, self.attention_mechanism, attention_layer_size=self.dense_units )
		self.decoder = tfa.seq2seq.BasicDecoder( self.rnn_cell, sampler = self.sampler, output_layer = self.dense_layer )

	def build_attention_mechanism(self, units, memory, memory_sequence_length) :
		return tfa.seq2seq.LuongAttention( units, memory = memory, memory_sequence_length=memory_sequence_length )

	def build_decoder_initial_state( self, batch_size, encoder_state, Dtype ) :
		decoder_initial_state = self.rnn_cell.get_initial_state( batch_size = batch_size, dtype = Dtype )
		decoder_initial_state = decoder_initial_state.clone( cell_state=encoder_state )
		return decoder_initial_state


The model is trained on a bitext consisting of sentences from a source language (e.g. EN) and target langauge (e.g. DE). A bitext is simply two sentence aligned files, a source language and its target for translation. Typically a bitext for machine translation will be over 100,000 sentences long. Bitext documents are often sourced from content that has been translated by humans, for example UN meeting minutes which are translated to many languages or wikipedia articles where volunteers have translated them into different languages. The bitext we will use is a EN to DE bitext with 152,820 pairs of sentences.

Sentences usually need to be cleaned and normalized before they are ready for use as training data, so the model will not pickup on noise patterns. The below pre-processing function will load a corpus from disk and use simple regex patterns to normalize and clean the text. Later we will replace this function with a more advanced version using machine translation community standard [MOSES](http://www.statmt.org/moses/) tokenization and normalization perl scripts.

In [None]:
def task1_preprocess( bitext_source_file, bitext_target_file, logger ) :
	# load bitext using simple regex normalization
	list_source_sents = lab_seq2seq_nmt_lib.load_corpus( filename = bitext_source_file, logger = logger, normalize_func = lab_seq2seq_nmt_lib.normalize_sent_regex )

	# translate mode?
	if bitext_target_file == None:
		return list_source_sents, None, None, None

	list_target_sents = lab_seq2seq_nmt_lib.load_corpus( filename = bitext_target_file, logger = logger, normalize_func = lab_seq2seq_nmt_lib.normalize_sent_regex )
	if len(list_source_sents) != len(list_target_sents) :
		raise Exception( 'bitext size mismatch' )

	# make bitext
	list_bitext = []
	for nSentIndex in range(len(list_source_sents)) :
		list_bitext.append( ( list_source_sents[nSentIndex], list_target_sents[nSentIndex] ) )

	list_source_sents = []
	list_target_sents = []
	for nSentIndex in range(len(list_bitext)) :
		list_source_sents.append( list_bitext[nSentIndex][0] )
		list_target_sents.append( list_bitext[nSentIndex][1] )

	return list_source_sents, list_target_sents, None, None


Each sentence needs to be tokenized and converted into a sequence vector. Tokenization is simply converting text into a sequence of tokens, then building a vocabulary index so tokens can be assigned an index value and made into a tensor ready for model training.

The below tokenization fits a tf.keras.preprocessing.text.Tokenizer() class on the input sentence lists to build a vocabulary index (one for source and one for target). The index is sorted by frequency of occurance automatically by the Tokenizer() class. For now we allow all words to be kept, but later we will impose a vocabulary limit which is needed for processing a larger bitext corpus. Once fit the tokenizer is applied to each sentence in the corpus, and a tensor created of passed sequences of word indexes representing the input sentences.

In [None]:
def task1_tokenize( list_source_sents, list_target_sents, dict_alignment_matrix, logger, use_tokenizer_source = None, use_tokenizer_target = None ) :
	# train tokenizers (or use existing one if we are in translate phase)
	if use_tokenizer_source == None :
		tokenizer_source = lab_seq2seq_nmt_lib.train_tokenizer( list_lines = list_source_sents, logger = logger )
	else :
		tokenizer_source = use_tokenizer_source
	if use_tokenizer_target == None :
		tokenizer_target = lab_seq2seq_nmt_lib.train_tokenizer( list_lines = list_target_sents, logger = logger )
	else :
		tokenizer_target = use_tokenizer_target

	# shuffle corpus and truncate if needed
	if list_target_sents != None :
		bitext = list( zip( list_source_sents, list_target_sents ) )
		random.shuffle( bitext )
		if DEBUG_TRUNC_SENTS == None :
			nMax = len(bitext)
		else :        
			nMax = DEBUG_TRUNC_SENTS
		list_source_sents = []
		list_target_sents = []
		for nSentIndex in range(nMax) :
			list_source_sents.append( bitext[nSentIndex][0] )
			list_target_sents.append( bitext[nSentIndex][1] )
	else :
		if DEBUG_TRUNC_SENTS == None :
			nMax = len(list_source_sents)
		else :        
			nMax = DEBUG_TRUNC_SENTS
		list_source_sents = list_source_sents[:nMax]
        
	# apply tokenizers
	source_tensor_train = lab_seq2seq_nmt_lib.apply_tokenization( list_sents = list_source_sents, tokenizer = tokenizer_source, max_sent_length = MAX_SENT_LENGTH_SOURCE, reverse_seq = False, logger = logger )

	# translate mode?
	if list_target_sents == None :
		return tokenizer_source, tokenizer_target, source_tensor_train, None
    
	target_tensor_train = lab_seq2seq_nmt_lib.apply_tokenization( list_sents = list_target_sents, tokenizer = tokenizer_target, max_sent_length = MAX_SENT_LENGTH_TARGET, reverse_seq = False, logger = logger )
           
	return tokenizer_source, tokenizer_target, source_tensor_train, target_tensor_train


Now we can write a function to train the model on a bitext corpus, and then run the trained model on a set of unseen sentences to translate them and compute a BLEU score.

We start by calling the preprocessing and tokenize functions defined previously. This generates a set of source and target tensors ready for model training.

The dict_lookup_dictionary parameter will be used later on so can be ignored for now.

The training tensors are loaded into a tf.data.Dataset() class and batched and randomly shuffled to help avoid training bias. Prefetch allows tensorflow to load the next batch in parallel to training on the current batch, multui-tasking for more efficient use of the GPU.

Next the encoder and decoder models are created and optimizer defined. At this stage they are untrained. A checkpoint manager is used to allow us to save the best (losest loss) models after each epoch so at the end we have the best model to use. Checkpointing is also helpful to allow resuming training if your computer crashes, helpful if training times are long.

We are using a simple tf.keras.losses.SparseCategoricalCrossentropy() function to compute the loss which we will aim to optimize during training. Note the use of [@tf.function](https://www.tensorflow.org/guide/function) to decorate functions used within in the main training loop. This will allow tensorflow to generate an autograph, using tensorflow versions of python variables, and run much more efficiently than it would using normal python variables.

The train_step() function feeds a batch of input tensors to the encoder and then decoder, uses a gradiant tape to keep track of the gradients which are then fed to the optimizer to adjust the layer weights. The attention layer takes the (source) sequence encoder output and the (target) sequence decoder input and provides a context vector for the decoder to make a prediction. All batches provided to train_step() every epoch and the overall loss recorded to check for convergence. The last best model is kept at the end of training.

Once trained the model is executed on a set of unseen validation sentences. The sentences are preprocesses and tokenized in the same way as the training bitext. A set of translated sentences is also provided as a 'gold' set to allow a BLEU score to be computed at the end.

The decode_func() and translate_func() are defined later and do the work of computing the translations using the trained model. The lookup_unkposN() is used later when rare word support is added.
    
A [BLEU score](https://en.wikipedia.org/wiki/BLEU) is calculated using the [sacrebleu](https://github.com/mjpost/sacrebleu) library, providing a standard measure of the number of tokens shared between the predicted translation and the gold translation. The higher the BLEU score the better, with 20+ providing an error prone but understandable translation and 40+ providing a pretty good translation.

In [None]:
def exec_task( preprocess_func, tokenize_func, decode_func, translate_func, postprocess_func ) :
	logger.info( 'task started' )
	logger.info( '\tpreprocessing = ' + preprocess_func.__name__  )
	logger.info( '\ttokenizer = ' + tokenize_func.__name__  )
	logger.info( '\tdecode = ' + decode_func.__name__  )
	logger.info( '\ttranslate = ' + translate_func.__name__  )
	if postprocess_func != None :
		logger.info( '\tpostprocessing = ' + postprocess_func.__name__  )

	model_dir = './model'
	model_name = 'nmt_en_to_de'
	bitext_source_file = '../corpus/bitext_nmt_en.txt'
	bitext_target_file = '../corpus/bitext_nmt_de.txt'
	lookup_dict_file = './model/lookup_dictionary.txt'
	translated_source_file = './model/translate_source.txt'

	if os.path.exists( model_dir ) == False :
		os.mkdir( model_dir )    

	#
	# pre-process dataset (load, tokenize, make tensor)
	#

	list_source_sents, list_target_sents, dict_alignment_matrix, dict_lookup_dictionary = preprocess_func( bitext_source_file, bitext_target_file, logger )
	tokenizer_source, tokenizer_target, source_tensor_train, target_tensor_train = tokenize_func( list_source_sents, list_target_sents, dict_alignment_matrix, logger )
	logger.info( 'shape of source tensor = ' + repr( source_tensor_train.shape ) )

	vocab_source_size = len(tokenizer_source.word_index)+1
	vocab_target_size = len(tokenizer_target.word_index)+1
	logger.info( 'tokenizer vocab size = ' + repr( (vocab_source_size,vocab_target_size) ) )

	if dict_lookup_dictionary != None :
		logger.info( 'lookup dict vocab size = ' + repr( len(dict_lookup_dictionary) ) )
		logger.info( 'saving lookup dict to file = ' + lookup_dict_file )
		write_handle = codecs.open( lookup_dict_file, 'w', 'utf-8', errors = 'strict' )
		write_handle.write( json.dumps( dict_lookup_dictionary, indent = 4 ) + '\n' )
		write_handle.close()

	#
	# make dataset from tensor
	#

	steps_per_epoch = len(source_tensor_train)//BATCH_SIZE
	logger.info( 'steps per epoch = ' + repr(steps_per_epoch) )

	dataset = tf.data.Dataset.from_tensor_slices( ( source_tensor_train, target_tensor_train ) )
	dataset = dataset.batch( BATCH_SIZE, drop_remainder=True )
	dataset = dataset.shuffle( SHUFFLE_BUFFER_SIZE, reshuffle_each_iteration=True )
	dataset = dataset.prefetch( buffer_size=tf.data.experimental.AUTOTUNE )

	logger.info( 'Input batch X shape: {}'.format( tf.random.uniform( (BATCH_SIZE, 1) ).shape) )
	logger.info( 'Target batch Y shape: {}'.format( tf.random.uniform( (BATCH_SIZE, 1) ).shape) )

	#
	# Train model
	#

	encoderNetwork = EncoderNetwork( vocab_source_size,EMBEDDING_DIM, RNN_UNITS )
	decoderNetwork = DecoderNetwork( vocab_target_size,EMBEDDING_DIM, RNN_UNITS, DENSE_UNITS )

	optimizer = tf.keras.optimizers.Adam( learning_rate = 0.002, amsgrad = True )

	logger.info('Encoder params: (vocab_size, embedding_dims, units) {}'.format( repr( (vocab_source_size, EMBEDDING_DIM, RNN_UNITS) ) ))
	logger.info('Decoder params: (vocab_size, embedding_dims, rnn_units, dense_units) {}'.format( repr( (vocab_target_size, EMBEDDING_DIM, RNN_UNITS, DENSE_UNITS) ) ))

	# setup checkpoint
	checkpoint_dir = model_dir
	checkpoint = tf.train.Checkpoint( optimizer=optimizer, encoder=encoderNetwork, decoder=decoderNetwork )
	checkpointManager = tf.train.CheckpointManager( checkpoint, directory=checkpoint_dir, checkpoint_name=model_name, max_to_keep=3 )

	# setup tf.summary log dir and file handler (to generate output for TensorBoard)
	current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
	train_log_dir = './logs/gradient_tape/' + current_time + '/train'
	test_log_dir = './logs/gradient_tape/' + current_time + '/test'
	train_summary_writer = tf.summary.create_file_writer(train_log_dir)
	test_summary_writer = tf.summary.create_file_writer(test_log_dir)

	# define metrics for TensorBoard
	train_loss = tf.keras.metrics.Mean('train_loss', dtype=tf.float32)

	# loss function based on categorical crossentropy
	@tf.function
	def loss_function(y_pred, y) :
		sparsecategoricalcrossentropy = tf.keras.losses.SparseCategoricalCrossentropy( from_logits=True,reduction='none' )
		loss = sparsecategoricalcrossentropy(y_true=y, y_pred=y_pred)
		mask = tf.logical_not(tf.math.equal(y,0))
		mask = tf.cast(mask, dtype=loss.dtype)
		loss = mask* loss
		loss = tf.reduce_mean(loss)
		return loss

	# training setep using the tensorflow subclassing API
	@tf.function
	def train_step( input_batch, output_batch, encoder_initial_cell_state, train_loss ) :
		loss = 0
		with tf.GradientTape() as tape:
			encoder_emb_inp = encoderNetwork.encoder_embedding( input_batch )
			a, a_tx, c_tx = encoderNetwork.encoder_rnnlayer( encoder_emb_inp, initial_state = encoder_initial_cell_state, training = True )

			# Prepare correct Decoder input & output sequence data
			decoder_input = output_batch[:,:-1] # ignore last word in seq as this will always be <end> or <pad> (decoder input -> training)

			# Compare logits with timestepped +1 version of decoder_input
			decoder_output = output_batch[:,1:] #ignore first word to shift seq to right (decoder output -> target for loss function)

			# Decoder Embeddings
			decoder_emb_inp = decoderNetwork.decoder_embedding(decoder_input)

			# Setting up decoder memory from encoder output and Zero State for AttentionWrapperState
			decoderNetwork.attention_mechanism.setup_memory(a)

			decoder_initial_state = decoderNetwork.build_decoder_initial_state(
				BATCH_SIZE,
				encoder_state=[a_tx, c_tx],
				Dtype=tf.float32 )
			
			# BasicDecoderOutput
			outputs, _, _ = decoderNetwork.decoder( decoder_emb_inp, initial_state=decoder_initial_state, sequence_length=BATCH_SIZE*[MAX_SENT_LENGTH_TARGET-1], training = True)

			logits = outputs.rnn_output

			# Calculate loss
			loss = loss_function(logits, decoder_output)

		# Returns the list of all layer variables / weights.
		variables = encoderNetwork.trainable_variables + decoderNetwork.trainable_variables

		# Differentiate loss wrt variables
		gradients = tape.gradient(loss, variables)

		# grads_and_vars is a List of(gradient, variable) pairs.
		grads_and_vars = zip(gradients,variables)
		optimizer.apply_gradients(grads_and_vars)

		# log metrics for TensorBoard 
		train_loss(loss)

		# return batch loss
		return loss

	nBest = None
	for epoch in range(EPOCHS):
		start = time.time()

		encoder_initial_cell_state = [tf.zeros((BATCH_SIZE, RNN_UNITS)), tf.zeros((BATCH_SIZE, RNN_UNITS))]

		total_loss = 0.0

		batch_list = enumerate( dataset.take(steps_per_epoch) )
		for (batch_index, batch_tuple) in batch_list :
			# input_batch = shape( BATCH_SIZE,MAX_SENT_LENGTH_SOURCE )
			# target_batch = shape( BATCH_SIZE,MAX_SENT_LENGTH_TARGET )
			input_batch, target_batch = batch_tuple

			batch_loss = train_step( input_batch, target_batch, encoder_initial_cell_state, train_loss )
			total_loss += batch_loss

			# print individual batch loss every N batches
			if (batch_index+1)%100 == 0:
				logger.info( 'batch loss: {} epoch {} batch {}'.format(batch_loss.numpy(), epoch, batch_index+1))

		# write metric to summary for this epoch to disk
		with train_summary_writer.as_default():
			# record the loss at each step
			tf.summary.scalar( 'loss', train_loss.result(), step=epoch+1 )

		logger.info( 'Epoch {} Loss {:.4f}'.format(epoch + 1, total_loss / steps_per_epoch) )
		logger.info( 'Time taken for 1 epoch {} sec'.format(time.time() - start) )

		# checkpoint model if we have made an improvement
		if (nBest == None) or (total_loss < nBest) :
			logger.info( 'checkpointing' )
			checkpointManager.save()
			nBest = total_loss

		# reset TensorBoard metrics every epoch
		train_loss.reset_states()

	# the model will have been checkpoint saved into the model dir so we are good to go
	logger.info( 'model training complete - model checkpoints into dir ' + os.path.abspath( model_dir ) )

	#
	# Execute model (translate)
	#

	# add 3 unseen test sentences and the associated gold translation for BLEU score later
	# note: sacrebleu needs the period at end of sents
	list_validation = [ 'I like apples.' ]
	list_gold = [ 'Ich mag Äpfel.' ]

	# save as bitext
	write_handle = codecs.open( translated_source_file, 'w', 'utf-8', errors = 'strict' )
	for str_sent in list_validation :
		write_handle.write( str_sent + '\n' )
	write_handle.close()

	# load as we would for training
	list_validation, _ , _, _ = preprocess_func( translated_source_file, None, logger )
	_, _, validation_tensor, _ = tokenize_func( list_validation, None, dict_alignment_matrix, logger, use_tokenizer_source = tokenizer_source, use_tokenizer_target = tokenizer_target )

	# load the last model checkpoint (best)
	strCheckpointFile = tf.train.latest_checkpoint(checkpoint_dir)
	checkpoint.restore( strCheckpointFile ).expect_partial()

	# calc max sent length
	max_sent_length = 0
	for nSentIndex in range(len(list_validation)) :
		validation_sent = list_validation[nSentIndex]
		max_sent_length = max( max_sent_length, validation_sent.count(' ') + 1 )

	# prepare input batch
	inference_batch_size = len( list_validation )
	encoder_emb_inp = encoderNetwork.encoder_embedding( validation_tensor )

	# call encoder using input batch
	encoder_initial_cell_state = [ tf.zeros((inference_batch_size, RNN_UNITS)),tf.zeros((inference_batch_size, RNN_UNITS)) ]
	a, a_tx, c_tx = encoderNetwork.encoder_rnnlayer( encoder_emb_inp, initial_state = encoder_initial_cell_state, training = False )
	encoder_state=[a_tx, c_tx]

	# prepare decoder input batch using a sequence of <start> tokens to start processing seq
	decoder_input = tf.expand_dims( [ tokenizer_target.word_index['<start>'] ] * inference_batch_size,1 )
	decoder_emb_inp = decoderNetwork.decoder_embedding(decoder_input)

	# create decoder
	decoder_instance, decoder_initial_state = decode_func( decoderNetwork, a, inference_batch_size, encoder_state )
	decoder_embedding_matrix = decoderNetwork.decoder_embedding.variables[0]

	(first_finished, first_inputs, first_state) = decoder_instance.initialize(
		decoder_embedding_matrix,
		start_tokens = tf.fill( [inference_batch_size],tokenizer_target.word_index['<start>'] ),
		end_token = tokenizer_target.word_index['<end>'],
		initial_state = decoder_initial_state )

	# limit prediction to a maximum of 2 * source sent length, as a simple heuristic to limit problems when they occur in translation
	maximum_iterations = max_sent_length * 2
	if maximum_iterations > MAX_SENT_LENGTH_TARGET :
		maximum_iterations = MAX_SENT_LENGTH_TARGET

	# translate
	list_translated = translate_func( first_inputs, first_state, inference_batch_size, maximum_iterations, decoder_instance, tokenizer_source, tokenizer_target )

	# add period to end for sacrebleu to know its a sentence
	for nIndex in range(len(list_translated)) :
		if list_translated[nIndex].endswith('.') == False :
			list_translated[nIndex] = list_translated[nIndex] + '.'
    
	logger.info( 'validation = ' + repr(list_validation) )
	logger.info( 'gold = ' + repr(list_gold) )

	# post processing of list_translated (if any)
	if postprocess_func != None :
		bleu = sacrebleu.corpus_bleu( list_translated, [ list_gold ] )
		logger.info( 'translation (before postprocessing) = ' + repr(list_translated) )
		logger.info( 'BLEU score (before postprocessing) = ' + str(bleu.score) )        
		postprocess_func( sent_list_source = list_validation, sent_list_translated = list_translated, dict_lookup = dict_lookup_dictionary )

	# calculate BLEU score for this corpus
	bleu = sacrebleu.corpus_bleu( list_translated, [ list_gold ] )
	logger.info( 'translation = ' + repr(list_translated) )
	logger.info( 'BLEU score = ' + str(bleu.score) )


We will use at this stage a basic decoder function uses a tfa.seq2seq.BasicDecoder() and tfa.seq2seq.GreedyEmbeddingSampler() to do a simple argmax to choose the most likely next word at any position in the sequence. This function sets the decoder up ready for stepping on encoded sequences to make output token predictions.

In [None]:
def task1_decoder( decoderNetwork, a, inference_batch_size, encoder_state ) :
	# use a basic decoder with greedy sampling
	greedy_sampler = tfa.seq2seq.GreedyEmbeddingSampler()
	decoder_instance = tfa.seq2seq.BasicDecoder( cell = decoderNetwork.rnn_cell, sampler = greedy_sampler, output_layer=decoderNetwork.dense_layer )
	encoder_memory = a

	# setup attention layer
	decoderNetwork.attention_mechanism.setup_memory( encoder_memory )

	# init decoder
	decoder_initial_state = decoderNetwork.build_decoder_initial_state( batch_size = inference_batch_size, encoder_state=encoder_state, Dtype=tf.float32 )

	return decoder_instance, decoder_initial_state

The translate function takes the decoder and steps through the batch of source sequences and predicts the output sequence for each. Stepping is done at a batch level, and a sequence of token indexes are predicted based on the current state (i.e. step position and tokens that have come previously). The final translated words are looked up using the target tokenizer word index, terminating each sentence whenever an <end> token is predicted.

In [None]:
def task1_translate( first_inputs, first_state, inference_batch_size, maximum_iterations, decoder_instance, tokenizer_source, tokenizer_target ) :
	# loop on each seq position (i.e. token) and predict the next token (i.e. translation)
	inputs = first_inputs
	state = first_state
	predictions = np.empty( (inference_batch_size,0), dtype = np.int32 )

	# make token predictions for the input batch
	nToken = 0
	while nToken < maximum_iterations :
		# make final predictions for the batch at this seq position
		outputs, next_state, next_inputs, finished = decoder_instance.step( nToken, inputs, state, training = False )
		inputs = next_inputs
		state = next_state
		outputs = np.expand_dims(outputs.sample_id,axis = -1)
		predictions = np.append(predictions, outputs, axis = -1)
		nToken += 1

	vocab_source_size = len(tokenizer_source.word_index)+1
	vocab_target_size = len(tokenizer_target.word_index)+1
	logger.info( 'tokenizer vocab size = ' + repr( (vocab_source_size,vocab_target_size) ) )

	# convert the predicted token index values to strings for final translation of the batch
	list_translated = []
	for i in range(len(predictions)):
		# get prediction for this sent index
		line = predictions[i,:]

		# get seq up to first <end> or <pad> token
		seq = list( itertools.takewhile( lambda index: index not in [ tokenizer_target.word_index['<end>'],tokenizer_target.word_index['<pad>'] ], line) )

		# lookup words
		listTokens = []
		for nToken in seq :
			if nToken != 0 :
				listTokens.append( tokenizer_target.index_word[nToken] )
			else :
				listTokens.append( '<unk>' )
		list_translated.append( ' '.join( listTokens ) )

	return list_translated


**EXERCISE Task 1**

Execute code for task 1 using the function call below. The translations and BLEU score will be very bad unless you run with a lot of sentences, so just check it generates a translation for now.

Whilst the model is training take some time to step through the code used, including the functions in the library file lab_seq2seq_nmt_lib, to begin to build your understanding of what each are doing. This will take time and you should read after the lab the linked papers where more detailed descriptions of the models used are explained.

**END EXERCISE Task 1**

The next tasks will ask you to change some of these functions to boost translation performance over this basic setup.

In [None]:
exec_task( preprocess_func = task1_preprocess, tokenize_func = task1_tokenize, decode_func = task1_decoder, translate_func = task1_translate, postprocess_func = None )


### Task 2 - use MOSES normalization and tokenization to cleanup the bitext

Further reading:
    [mosestokenizer](https://pypi.org/project/mosestokenizer/)
    [MOSES statistical machine translation system](http://www.statmt.org/moses/)
    
The regex-based approach in task1_preprocess() is a crude language independant way to clean and tokenize text. We normalized character sets, split tokens on whitespace, converted everything to lowercase and removed punctuation, non-printable characters and numbers. This is losing a lot of valuable information we could train on.

The statistical machine translation community have developed over many years a set of perl scripts that do a much more sophisticated job and can handle gramatical structures in many languages. These are shipped with the MOSES toolkit. We can use these scripts via python libraries such as [mosestokenizer](https://pypi.org/project/mosestokenizer/).

In [None]:
def task2_preprocess( bitext_source_file, bitext_target_file, logger ) :
	# load bitext using moses perl scripts (tokenizer and normalization)
	list_source_sents = lab_seq2seq_nmt_lib.load_corpus( filename = bitext_source_file, logger = logger, normalize_func = lab_seq2seq_nmt_lib.normalize_sent_moses, word_tokenizer = mosestokenizer.MosesTokenizer( 'en' ), punc_normalization = mosestokenizer.MosesPunctuationNormalizer('en') )

	# translate mode?
	if bitext_target_file == None:
		return list_source_sents, None, None, None

	list_target_sents = lab_seq2seq_nmt_lib.load_corpus( filename = bitext_target_file, logger = logger, normalize_func = lab_seq2seq_nmt_lib.normalize_sent_moses, word_tokenizer = mosestokenizer.MosesTokenizer( 'de' ), punc_normalization = mosestokenizer.MosesPunctuationNormalizer('de') )
	if len(list_source_sents) != len(list_target_sents) :
		raise Exception( 'bitext size mismatch' )

	# make bitext
	list_bitext = []
	for nSentIndex in range(len(list_source_sents)) :
		list_bitext.append( ( list_source_sents[nSentIndex], list_target_sents[nSentIndex] ) )
	
	list_source_sents = []
	list_target_sents = []
	for nSentIndex in range(len(list_bitext)) :
		list_source_sents.append( list_bitext[nSentIndex][0] )
		list_target_sents.append( list_bitext[nSentIndex][1] )

	return list_source_sents, list_target_sents, None, None


**EXERCISE Task 2**

Add the below logger statements to exec_task() and run it. Look at the example source (EN) and target (DE) sentences printed.

    logger.info( 'example source sent = ' + repr(list_source_sents[-1]) )
    logger.info( 'example target sent = ' + repr(list_target_sents[-1]) )

Now execute using task2_preprocess() as below to run task 2 (moses preprocessing). Look and understand the difference between the preprocessed sentences using regex and moses. Print a few more sentences, especially those with punctuation, symbolic characters and quotations.

    Task 1 (regex)
    
    example source sent = 'doubtless there exists in this world precisely the right woman for any given man to marry and vice versa but when you consider that a human being has the opportunity of being acquainted with only a few hundred people and out of the few hundred that there are but a dozen or less whom he knows intimately and out of the dozen one or two friends at most it will easily be seen when we remember the number of millions who inhabit this world that probably since the earth was created the right man has never yet met the right woman'

    example target sent = 'ohne zweifel findet sich auf dieser welt zu jedem mann genau die richtige ehefrau und umgekehrt wenn man jedoch in betracht zieht dass ein mensch nur gelegenheit hat mit ein paar hundert anderen bekannt zu sein von denen ihm nur ein dutzend oder weniger nahesteht darunter hochstens ein oder zwei freunde dann erahnt man eingedenk der millionen einwohner dieser weltleicht dass seit erschaffung ebenderselben wohl noch nie der richtige mann der richtigen frau begegnet ist'

    Task 2 (moses)

    example source sent = 'Doubtless there exists in this world precisely the right woman for any given man to marry and vice versa ; but when you consider that a human being has the opportunity of being acquainted with only a few hundred people , and out of the few hundred that there are but a dozen or less whom he knows intimately , and out of the dozen , one or two friends at most , it will easily be seen , when we remember the number of millions who inhabit this world , that probably , since the earth was created , the right man has never yet met the right woman'

    example target sent = 'Ohne Zweifel findet sich auf dieser Welt zu jedem Mann genau die richtige Ehefrau und umgekehrt ; wenn man jedoch in Betracht zieht , dass ein Mensch nur Gelegenheit hat , mit ein paar hundert anderen bekannt zu sein , von denen ihm nur ein Dutzend oder weniger nahesteht , darunter höchstens ein oder zwei Freunde , dann erahnt man eingedenk der Millionen Einwohner dieser Welt leicht , dass seit Erschaffung ebenderselben wohl noch nie der richtige Mann der richtigen Frau begegnet ist'

**END EXERCISE Task 2**

In [None]:
exec_task( preprocess_func = task2_preprocess, tokenize_func = task1_tokenize, decode_func = task1_decoder, translate_func = task1_translate, postprocess_func = None )


### Task 3 - reversing the input sequence for better performance

Further reading:
    [Luong NMT paper](https://www.aclweb.org/anthology/D15-1166/)

It has been shown that reversing the source input sequence (i.e. EN sentence words) can deliver better LSTM memory utilization and results in better translations over longer sentences.

**EXERCISE Task 3**

Implement your own version of task1_tokenize() function that reverses the input sentence sequence. For example a source sentence ```'<start> hello there <end>'``` should become a tensor with word indexes for ```'<end> there hello <start> <pad> <pad> ...'```.

You can reuse lab_seq2seq_nmt_lib.train_tokenizer() but you should not use lab_seq2seq_nmt_lib.apply_tokenization() for this exercise.

Confirm your new function works by running exec_task() and providing it with your new functions handle. Log the source and target tensors that are generated to confirm the word indexes of the source tensor are reversed correctly.

A solution is below in task3_tokenize() for reference.

**END EXERCISE Task 3**

In [None]:
def task3_tokenize( list_source_sents, list_target_sents, dict_alignment_matrix, logger, use_tokenizer_source = None, use_tokenizer_target = None ) :
	# train tokenizers (or use existing one if we are in translate phase)
	if use_tokenizer_source == None :
		tokenizer_source = lab_seq2seq_nmt_lib.train_tokenizer( list_lines = list_source_sents, logger = logger )
	else :
		tokenizer_source = use_tokenizer_source
	if use_tokenizer_target == None :
		tokenizer_target = lab_seq2seq_nmt_lib.train_tokenizer( list_lines = list_target_sents, logger = logger )
	else :
		tokenizer_target = use_tokenizer_target

	# shuffle corpus and truncate if needed
	if list_target_sents != None :
		bitext = list( zip( list_source_sents, list_target_sents ) )
		random.shuffle( bitext )
		if DEBUG_TRUNC_SENTS == None :
			nMax = len(bitext)
		else :        
			nMax = DEBUG_TRUNC_SENTS
		list_source_sents = []
		list_target_sents = []
		for nSentIndex in range(nMax) :
			list_source_sents.append( bitext[nSentIndex][0] )
			list_target_sents.append( bitext[nSentIndex][1] )
	else :
		if DEBUG_TRUNC_SENTS == None :
			nMax = len(list_source_sents)
		else :        
			nMax = DEBUG_TRUNC_SENTS
		list_source_sents = list_source_sents[:nMax]

	# apply tokenizers (reverse the source sent)
	source_tensor_train = lab_seq2seq_nmt_lib.apply_tokenization( list_sents = list_source_sents, tokenizer = tokenizer_source, max_sent_length = MAX_SENT_LENGTH_SOURCE, reverse_seq = True, logger = logger )

	# translate mode?
	if list_target_sents == None :
		return tokenizer_source, tokenizer_target, source_tensor_train, None

	target_tensor_train = lab_seq2seq_nmt_lib.apply_tokenization( list_sents = list_target_sents, tokenizer = tokenizer_target, max_sent_length = MAX_SENT_LENGTH_TARGET, reverse_seq = False, logger = logger )

	return tokenizer_source, tokenizer_target, source_tensor_train, target_tensor_train


In [None]:
exec_task( preprocess_func = task2_preprocess, tokenize_func = task3_tokenize, decode_func = task1_decoder, translate_func = task1_translate, postprocess_func = None )

### Task 4 - replace the basic decoder with a beam search decoder for better performance

Further reading:
    [BeamSearchDecoder](https://www.tensorflow.org/addons/api_docs/python/tfa/seq2seq/BeamSearchDecoder)
    [Beam search tutorial](https://towardsdatascience.com/word-sequence-decoding-in-seq2seq-architectures-d102000344ad)
	[seq2seq tutorial using beam search decoder](https://github.com/dhirensk/ai/blob/master/English_to_French_seq2seq_tf_2_0_withAttention.ipynb)

It has been shown that using a beam search decoder will provide better results than a basic argmax decoder. This is because in sequence to sequence learning the token predictions are based on the information available at the current position in the sequence. With an argmax approach once a token is chosen that choice is locked in and that decision cannot be revisited later if further down the sequence it no longer makes sense. A beam search allows several alternative sequence fragments to be considered at once, with the ability to drop sequence options that become unlikely as decoding progresses.

**EXERCISE Task 4**

Read the links in the above further reading section and try to write your own decoder() and translate() functions to use the tfa.seq2seq.BeamSearchDecoder class.

Confirm the decoder is working correctly by logging the partial prediction sets as the decoder steps through the batch of inputs. Use a beam width of 5 to make it easy to log. You can lookup predicted token indexes in the word_index of the tokenizer instance to log the words as well as the word indices.

Observe how the 5 possible sequence fragments are updated each step and the less likely ones removed.

A solution is below in task4_decoder() and task4_translate() for reference.

**END EXERCISE Task 4**

In [None]:
def task4_decoder( decoderNetwork, a, inference_batch_size, encoder_state ) :
	# beam length normalizaion (avoids short sequences being favoured because probabilities are multiplied so long seq means lower prob)
	# length normalization - see https://opennmt.net/OpenNMT/translation/beam_search/
	# length_penalty = (5 + length_seq)**penalty / 6**penalty
	# scores = logprob / length_penalty
	decoder_instance = tfa.seq2seq.BeamSearchDecoder( cell = decoderNetwork.rnn_cell, beam_width = BEAM_WIDTH, output_layer=decoderNetwork.dense_layer, length_penalty_weight = BEAM_LENGTH_NORM_WEIGHT )
	encoder_memory = tfa.seq2seq.tile_batch( a, multiplier=BEAM_WIDTH )

	# setup attention layer
	decoderNetwork.attention_mechanism.setup_memory( encoder_memory )

	# init decoder
	decoder_initial_state = decoderNetwork.rnn_cell.get_initial_state( batch_size = inference_batch_size * BEAM_WIDTH, dtype = tf.float32 )
	tiled_encoder_final_state = tfa.seq2seq.tile_batch( encoder_state, multiplier=BEAM_WIDTH )
	decoder_initial_state = decoder_initial_state.clone( cell_state=tiled_encoder_final_state )

	return decoder_instance, decoder_initial_state


In [None]:
def task4_translate( first_inputs, first_state, inference_batch_size, maximum_iterations, decoder_instance, tokenizer_source, tokenizer_target ) :
	inputs = first_inputs
	state = first_state
	predictions = np.empty((inference_batch_size, BEAM_WIDTH,0), dtype = np.int32)
	beam_scores =  np.empty((inference_batch_size, BEAM_WIDTH,0), dtype = np.float32)
	finished = False

	nToken = 0
	while (nToken < maximum_iterations) and (finished == False) :
		beam_search_outputs, next_state, next_inputs, finished_outputs = decoder_instance.step( nToken, inputs, state, training = False )
		inputs = next_inputs
		state = next_state
		outputs = np.expand_dims(beam_search_outputs.predicted_ids,axis = -1)
		scores = np.expand_dims(beam_search_outputs.scores,axis = -1)
		predictions = np.append(predictions, outputs, axis = -1)
		beam_scores = np.append(beam_scores, scores, axis = -1)

		# check if all beams have finished (numpy array of bool, one for each beam)
		if finished_outputs.numpy().all() == True :
			finished = True
		else :
			finished = False

		nToken += 1

	list_translated = []
	for i in range(len(predictions)):
		output_beams_per_sample = predictions[i,:,:]
		score_beams_per_sample = beam_scores[i,:,:]
		beam_result = zip(output_beams_per_sample,score_beams_per_sample)

		listTopK = []
		for beam, score in beam_result :
			# get seq up to first <end> or <pad> token
			# note: the decoder should already have stopped when it reached the <end> so this is really about removing erronenous <pad> sequences
			seq = list( itertools.takewhile( lambda index: index not in [tokenizer_target.word_index['<end>'],tokenizer_target.word_index['<pad>']], beam) )

			listTokens = []
			for nToken in seq :
				if nToken != 0 :
					listTokens.append(  tokenizer_target.index_word[nToken] )
				else :
					listTokens.append( '<unk>' )
			strSent = ' '.join( listTokens )

			score_indexes = np.arange(len(seq))
			beam_score = score[score_indexes].sum()

			listTopK.append( ( strSent, beam_score ) )
		
		listTopK = sorted( listTopK, key=lambda entry: entry[1], reverse=True )

		# take best result
		list_translated.append( listTopK[0][0] )

	return list_translated

In [None]:
exec_task( preprocess_func = task2_preprocess, tokenize_func = task3_tokenize, decode_func = task4_decoder, translate_func = task4_translate, postprocess_func = None )

## Part 2

### Task 5 - use statistical phrase alignment to create a rare word translation lookup dictionary

Further reading:
    [fast_align](https://github.com/clab/fast_align)
    [Dyer 2013 paper](https://www.aclweb.org/anthology/N13-1073/)

One of the major problems with NMT is that the vocabulary size has to be limited otherwise the models will not fit into GPU memory. In 2020 vocabulary sizes of 50k were typically used as an upper limit for seq2seq models. On a realistic bitext such as [WikiMatrix](https://arxiv.org/abs/1907.05791) there are millions of sentences (e.g. EN to RU is 1.6 million sentences) with 100,000's of words (e.g. EN to RU has TODO words). This means only the more frequent words that make it to the 50k vocabulary will be trained on. Any other work will be labelled as ```<unk>```. Whilst we can still get good BLEU scores realistic translation systems need to be able to handle rare words and untranslatable symbols in an efficient way.

In this task we will use the well established [fast_align](https://github.com/clab/fast_align) statistical phrase alignment tool to generate an alignment matrix for our bitext training corpus. We will then use this to make a lookup dictionary containing DE words which are often found to be aligned to EN words. This lookup dictionary will allow us in the next task to translate ```<unk>``` tokens by taking the predicted aligned source word and looking its translation up in the lookup dictionary we compute in this task.

**EXERCISE Task 5**

Read the readme of [fast_align](https://github.com/clab/fast_align), install and compile it.

Write a new preprocessing function that will save the moses tokenized bitext file in a format suitable for fast_align. Run it and generate the fast_align coprus file.

Run fast_align and pipe the output to an alignment matrix file.

    ./fast_align -i bitext_for_fast_align.txt -d -o -v > bitext_nmt_alignment_matrix.txt

Example alignment file format:

    0-0 2-1 16-2 19-3 3-4 ...
    0-0 1-1 4-2 5-3 2-4 6-5 ...
    0-0 4-1 2-2 1-3 12-4 ...

Update your new preprocessing function so it can read in the alignment matrix file as a python dict. e.g. ```{ sent_index : { source_token_index : [ target_token_index1, target_token_index2 ... ] }```

Use this alignment matrix dict to generate an index of source word alignments to target words, including a frequency count of how many times the alignment occurs in the corpus. e.g. ```{ source_token : { target_token : freq_count } }```

Finally make a translation lookup index for each source_token where there is an aligned target_token that has an occurance frequency of more than 100. If there are multiple aligned tokens to choose from pick the most frequent one. e.g. ```{ source_token : target_token }```

Save the translation lookup dict to disk and check it looks OK. It should have good translations from EN words to DE words.

Example translation lookup dict:

    {
    "Go": "Geh",
    "Stop": "auf",
    "on": "auf",
    "I": "Ich",
    "ran": "rannte",
    "see": "sehen",
    "try": "versuchen",
    "won": "wird",
    "me": "mir",
    "it": "es",
    ...
    }


A solution is below in task5_preprocess() for reference.

**END EXERCISE Task 5**    

In [None]:
def task5_preprocess( bitext_source_file, bitext_target_file, logger ) :
	fast_align_output = '../corpus/bitext_nmt_alignment_matrix.txt'
	bitext_file = './model/bitext_for_fast_align.txt'

	# load bitext using moses perl scripts (tokenizer and normalization)
	list_source_sents = lab_seq2seq_nmt_lib.load_corpus( filename = bitext_source_file, logger = logger, normalize_func = lab_seq2seq_nmt_lib.normalize_sent_moses, word_tokenizer = mosestokenizer.MosesTokenizer( 'en' ), punc_normalization = mosestokenizer.MosesPunctuationNormalizer('en') )

	# translate mode?
	if bitext_target_file == None:
		return list_source_sents, None, None, None

	list_target_sents = lab_seq2seq_nmt_lib.load_corpus( filename = bitext_target_file, logger = logger, normalize_func = lab_seq2seq_nmt_lib.normalize_sent_moses, word_tokenizer = mosestokenizer.MosesTokenizer( 'de' ), punc_normalization = mosestokenizer.MosesPunctuationNormalizer('de') )
	if len(list_source_sents) != len(list_target_sents) :
		raise Exception( 'bitext size mismatch' )

	# make bitext
	list_bitext = []
	for nSentIndex in range(len(list_source_sents)) :
		list_bitext.append( ( list_source_sents[nSentIndex], list_target_sents[nSentIndex] ) )

	# serialize full bitext using fast_align format so we can create the alignment matrix needed next
	# run this code once to create bitext, then run fast_align to make file 'bitext_nmt_alignment_matrix.txt', then run this code again to process alignment file
	# e.g. ./fast_align -i bitext_for_fast_align.txt -d -o -v > bitext_nmt_alignment_matrix.txt
	logger.info( 'saving bitext to file = ' + bitext_file )
	write_handle = codecs.open( bitext_file, 'w', 'utf-8', errors = 'strict' )
	for list_sent_pair in list_bitext :
		write_handle.write( list_sent_pair[0] + ' ||| ' + list_sent_pair[1] + '\n' )
	write_handle.close()

	# load fast_align alignment matrix file from disk
	if os.path.exists( fast_align_output ) == False :
		raise Exception('fast_align output file missing (run fast_align on the file ' + bitext_file + ' to make required file ' + fast_align_output)
	dict_alignment_matrix = lab_seq2seq_nmt_lib.read_alignment_matrix( file = fast_align_output, logger = logger )

	# create a rare word lookup dict from source to target statistically frequent word alignments
	dict_lookup_dictionary = lab_seq2seq_nmt_lib.create_lookup_dict( align_matrix = dict_alignment_matrix, bitext = list_bitext, freq_threshold = 100, logger = logger )

	list_source_sents = []
	list_target_sents = []
	for nSentIndex in range(len(list_bitext)) :
		list_source_sents.append( list_bitext[nSentIndex][0] )
		list_target_sents.append( list_bitext[nSentIndex][1] )

	return list_source_sents, list_target_sents, dict_alignment_matrix, dict_lookup_dictionary


In [None]:
exec_task( preprocess_func = task5_preprocess, tokenize_func = task3_tokenize, decode_func = task4_decoder, translate_func = task4_translate, postprocess_func = None )

### Task 6 - train model to predict positionally aware unk tokens and translate them with a lookup dictionary

Further reading:
    [Luong 2015 paper](https://www.aclweb.org/anthology/P15-1002/)

An effective approach [Luong 2015 paper](https://www.aclweb.org/anthology/P15-1002/) for handling rare words has been shown to be training the NMT model so it learns to predict target ```<unk>``` tokens with an alignment index to the source sequence. For example, ```<unkpos-2>``` means ```<unk> aligned to source token index 2 positions to left of this target <unk> token```. Then as a post-processing stage all ```<unkposN>``` tokens can be replaced with either the lookup translated value or a default of simply the source token. The later is useful for words with no real translation, such as proper names or symbolic tokens.

In this task we will use the last task's [fast_align](https://github.com/clab/fast_align) statistical phrase alignment matrix  and lookup dictionary to implement a rare word strategy similar to Luong.

**EXERCISE Task 6**

Read the links in the above further reading section to understand how to implement the ```<unkposN>``` strategy.

Try to write your own tokenize() function to limit the vocabulary size of the trained tokenizers (source and target) to something small like 1000 words. Then replace any source token not in this limited vocabulary with an ```<unk>``` token. Replace any target tokens not in the limited vocabulary with an ```<unkposN> where N is in range[-7 ... 7]```. Use the alignment matrix to lookup what the relative position value should be of the source token aligned to the unk target token. If the relative alignment position is > 7 tokens, the alignment points to a token outside source seq or no alignment is specified for that token then default to ```<unkpos0>```. This will provide the NMT model with a training set that will teach it to predict the likely source alignment of unk tokens.

Write your own post processing function to replace target ```<unkposN>``` tokens with a value from the look dictionary, or if no translatino available simply copy the aligned source token. This function should have the arguments below and will replace token values directly within the reference list sent_list_translated.

		postprocess_func( sent_list_source, sent_list_translated, dict_lookup )

A solution is below in task6_tokenize() and lab_seq2seq_nmt_lib.lookup_unkposN() for reference.

**END EXERCISE Task 6**    

In [None]:
def task6_tokenize( list_source_sents, list_target_sents, dict_alignment_matrix, logger, use_tokenizer_source = None, use_tokenizer_target = None ) :
	# train tokenizers (or use existing one if we are in translate phase)
	if use_tokenizer_source == None :
		tokenizer_source = lab_seq2seq_nmt_lib.train_tokenizer_top_N( list_lines = list_source_sents, top_N = MAX_VOCAB_SIZE, is_source = True, logger = logger )
	else :
		tokenizer_source = use_tokenizer_source
	if use_tokenizer_target == None :
		tokenizer_target = lab_seq2seq_nmt_lib.train_tokenizer_top_N( list_lines = list_target_sents, top_N = MAX_VOCAB_SIZE, is_source = False, logger = logger )
	else :
		tokenizer_target = use_tokenizer_target

	# replace target rare words with an <unkposN> token so they can be looked up (using the aligned source token) in the decoding stage from the lookup dict.
	# source rare words will be replaced with <unk>
	lab_seq2seq_nmt_lib.unkpos_replacement( align_matrix = dict_alignment_matrix, list_source_sents = list_source_sents, list_target_sents = list_target_sents, source_tokenizer = tokenizer_source, target_tokenizer = tokenizer_target, logger = logger )

	# shuffle corpus and truncate if needed
	if list_target_sents != None :
		bitext = list( zip( list_source_sents, list_target_sents ) )
		random.shuffle( bitext )
		if DEBUG_TRUNC_SENTS == None :
			nMax = len(bitext)
		else :        
			nMax = DEBUG_TRUNC_SENTS
		list_source_sents = []
		list_target_sents = []
		for nSentIndex in range(nMax) :
			list_source_sents.append( bitext[nSentIndex][0] )
			list_target_sents.append( bitext[nSentIndex][1] )
	else :
		if DEBUG_TRUNC_SENTS == None :
			nMax = len(list_source_sents)
		else :        
			nMax = DEBUG_TRUNC_SENTS
		list_source_sents = list_source_sents[:nMax]
        
	# apply tokenizers (reverse the source sent)
	source_tensor_train = lab_seq2seq_nmt_lib.apply_tokenization( list_sents = list_source_sents, tokenizer = tokenizer_source, max_sent_length = MAX_SENT_LENGTH_SOURCE, reverse_seq = True, logger = logger )

	# translate mode?
	if list_target_sents == None :
		return tokenizer_source, tokenizer_target, source_tensor_train, None

	target_tensor_train = lab_seq2seq_nmt_lib.apply_tokenization( list_sents = list_target_sents, tokenizer = tokenizer_target, max_sent_length = MAX_SENT_LENGTH_TARGET, reverse_seq = False, logger = logger )

	return tokenizer_source, tokenizer_target, source_tensor_train, target_tensor_train


In [None]:
exec_task( preprocess_func = task5_preprocess, tokenize_func = task6_tokenize, decode_func = task4_decoder, translate_func = task4_translate, postprocess_func = lab_seq2seq_nmt_lib.lookup_unkposN )