# Impractical 10: Neural Machine Translation
#### Ayoub Bagheri
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />

#### Applied Text Mining - Utrecht Summer School

You made it! It is the last practical of the Applied Text Mining course. 

In this practical, we will create models for neural machine translation. Today we are curious to see how a simple deep learning-based model translates a sentence into its counterpart. See these examples:

<img src="translation_example.png" />

<img src="translation_example2.png" />

The aim of this practical is to convert a Dutch sentence to its English counterpart using a Neural Machine Translation (NMT) system. We will implement this task by building a simple Sequence-to-Sequence model (a special class of Recurrent Neural Network architectures) with the help of Keras library.

Today we will use the following libraries. Take care to have them installed!

In [1]:
import string
import re
import statistics
from numpy import array, argmax, random, take
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Bidirectional, RepeatVector, TimeDistributed
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from keras import optimizers
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# matplotlib inline
pd.set_option('display.max_colwidth', 200)

### Let's get started!

1\. **In this practical we will use a data set of tab-delimited Bilingual Sentence Pairs from http://www.manythings.org/anki/. Use the following two functions (read_text and to_lines) and read the nld.txt data set (also provided in the course webpage next to the practial link). This data set contains phrases in Dutch with their translation in English. Convert the text sequences to an array and check the first items in your array.**

In [2]:
# function to read raw text file
def read_text(filename):
    # open the file
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    file.close()
    return text

In [3]:
# split a text into sentences
def to_lines(text):
    sents = text.strip().split('\n')
    sents = [i.split('\t') for i in sents]
    return sents

# Pre-processing

2\. **Use the maketrans() function to remove punctuations from the nld_enp object. The maketrans() function is a function from the library str that is used to construct a transition table, i.e that it specifies a list of characters that need to be replaced in a string or the characters that need to be deleted from the string. To use this transition table, you can use the translate() function and apply that on a string. It is also possible to use these functions to remove the punctuations. Similar to the example below, apply the maketrans() function to remove punctuations from the nld_eng array.**

3\. **Convert all words into their lowercase.**

# Text to Sequence

4\. **What is the maximum length of a sentence in each of the Dutch and English sets? What about the average length?**

5\. **Use the train_test_split function from sklearn to split the data set into training (80%) and test (20%) sets.**

6\. **Time to tokenize the sentences. Use the Tokenizer function from Keras and fit the sentences. Find out about the vocabulary size for the Dutch and English sets.**

7\. **Write a function to convert tokens into sequences using an argument for maximum sentence length. Other input arguments to this function are tokenizer and sentences, and its output will be sequences of tokens.**

8\. **Convert your tokenized training data into sequences. Use a maximum length of 20 and name the dataframes train_X and train_Y.**

9\. **In the same way, convert your tokenized test data into sequences and name the dataframs test_X and test_Y.**

# Neural Network Model

10\. **Define a Sequence-to-Sequence (Seq2Seq) model architecture using an embedding layer as input layer, an LSTM layer as  encoder and another LSTM layer followed by a Dense layer as decoder. Make this a function and name it build_model(). Define different input arguments for your function including the embedding_size and the number of LSTM units.**

11\. **Create a model by calling the function with embedding_size of 300 and 512 units for the LSTM layers.**

12\. **Compile the model with the RMSprop optimizer and sparse_categorical_crossentropy for loss.**

13\. **Fit the model with your desired number of epochs (e.g. 1 :)), validation_split of 0.2, and batch_size of 128. You can use smaller values for the number of LSTM units (100) and embedding size (50) if it takes a lot of time to run.**

14\. **Plot the accuracy and loss of your model for the training and validations sets.**

15\. **Predict translations for the test set.**

16\. **Use the sequences_to_texts function to convert an index to a word on your predictions.**

17\. **Create a new dataframe with three columns where you show the input Dutch text of the test set, the actual output, and your predictions. Use the sample() function with your dataframe to randomly check some of the lines.**

**Tatoeba.org (https://tatoeba.org/en/downloads) has a large database of example sentences translated into many languages by volunteers. To have a better data for your neural machine translator you can use this tool to generate and download customized sentence pairs. For example it has more than one million sentence pairs translated from Dutch to English. This time, try to tune the hyperparameters and add an attention layer after the dense layer.**