# Neural Machine Translation
## based on the paper Sequence to Sequence Learning with Neural Networks
##### Dataset used: IWSLT'15 English-Vietnamese data [Small]
##### Link: https://nlp.stanford.edu/projects/nmt/

### Introduction to Machine Translation


Machine Translation using Neural Networks is implemented using sequence to sequence learning. In seq2seq learning, an input, which, as the name suggests, is a sequence, is mapped to the output, which is also a sequence. Thus se2seq learning can be used in various areas like Machine Translation, Chatbots, QnA solver.

Deep Neural Networks(DNNs) are powerful model that work well whenever large labeled training data sets are available, however they cannot be used for mapping sequences. However DNNs can only be used with fixed dimensionality input and output vectors. It is a significant limitation since many areas require sequential inputs with varied lengths, for example, speech recognition and machine translation. To overcome this, Neural Machine Translation uses multi-layer Long Short-Term Memory(LSTM) to map input sequence to a vector of fixed dimensionality.

### Dataset details

The original model used WMT'14 English to French dataset. The model was trained on a subset of 12M sentences which consisted of 348M French words and 304M English words. 

Vocabulary of 160000 most frequent English words and 80000 most frequent French words was used. Every out of vocabulary word was replaced with a special unknown token.

However, we'll use the English to French, European Parliament Proceedings Parallel Corpus dataset. This dataset is great for learning about seq2seq model and architecture.

### Setting up the dataset
1. Download the dataset files from http://www.statmt.org/europarl/index.html (parallel corpus French-English)
2. Create a folder dataset in the current directory<br>
```mkdir dataset```
3. Extract and move files into the dataset folder<br>
```tar xvzf fr-en.tgz ./dataset```

In [1]:
import re

In [2]:
# create dataset location holders
fileloc_en = './dataset/europarl-v7.fr-en.en'
fileloc_fr = './dataset/europarl-v7.fr-en.fr'

In [3]:
# open data files
file_en = open(fileloc_en, 'r')
file_fr = open(fileloc_fr, 'r')

## Create Dictionary and Preprocess data
Dictionary(also known as vocabulary) is a list of most frequent words in a particular language. Here, we are working with English to French translation, hence, we will create 2 dictionaries, one for English and the other for French.

In [4]:
# define language tokens
start_token = 'startseq'
end_token = 'endseq'
unknown_token = '<unk>'

In [87]:
# this module provides regular expression matching operations similar to those found in Perl
import re

def preprocess_sentence(w, add_tokens=False):
    # convert each character to lower case
    w = w.lower()
    
    # create a space between word and the punctuation following it
    w = re.sub(r"([?.!,¿'])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replace everything with space except (a-Z, A-Z, ".", "?", "!", ",")
    # w = re.sub(r"[^a-zA-Z?.!,]+", " ", w)
    w = w.rstrip().strip()
    
    # add the start and end token to the sentence
    # the model will learn this behaviour over time
    if add_tokens:
        w = start_token + ' ' + w + ' ' + end_token
    
    return w

In [88]:
preprocess_sentence("this is. a sentence! with 'multi line. quotes'")

"this is . a sentence ! with ' multi line . quotes '"

### Create a dictionary
We'll create our own vocabulary from the words present in the English and French corpus.The vocabulary will consist of 50000 most frequent English words and 50000 most frequent French words.

#### Steps:
1. Read all the words from the files
2. Create a set consisting of all the unique words
3. Extract the most frequent words and create dictionary

In [6]:
# necessary imports for creating a dictionary
from collections import Counter
from pickle import load, dump

In [7]:
# vocab file locations
vocab_loc_en = './dataset/vocab.en'
vocab_loc_fr = './dataset/vocab.fr'

In [8]:
# create a Dictionary class for much better code representation
class Dictionary:
    def __init__(self, input_file):
        self.vocab = set()
        self.word_list = []
        self.processed_sentences = []
        self.input_file = input_file
        
    # sets up the dictionary object
    def set_up(self):
        self.read_file()
        self.preprocess()
        self.create_word_list()
        self.file.close()
        self.processed_sentences = []
        
    # frees up a bit of space by deallocating the list
    def tear_down(self):
        self.word_list = []
    
    # read the data file
    def read_file(self):
        self.file = open(self.input_file, 'r')
        self.sentences = self.file.readlines()
    
    # pro-processes the data, check above for further details
    def preprocess(self):
        for w in self.sentences:
            self.processed_sentences.append(preprocess_sentence(w))
    
    # creates a full list of words
    def create_word_list(self):
        for sentence in self.processed_sentences:
            for word in sentence.split(' '):
                self.word_list.append(word)
    
    # creates a list of num_frequent most frequent word in the dataset
    def create_vocab(self, num_frequent=49997):
        self.vocab = list(dict(Counter(self.word_list).most_common(num_frequent)).keys())
    
    # returns the size of vocabulary
    def get_total_words(self):
        return len(self.vocab)
    
    # returns the vocabulary
    def get_vocab(self):
        return self.vocab
    
    # pickle the vocab to a file so that we do not have to create it again and again
    def save_vocab(self, vocab_file):
        with open(vocab_file, 'wb') as f:
            dump(self.vocab, f)
    
    # load the vocab from the pickled file
    def load_vocab(self, vocab_file):
        with open(vocab_file, 'rb') as f:
            self.vocab = load(f)

In [None]:
# create an english dictionary object
dict_english = Dictionary(fileloc_en)
dict_english.set_up()
dict_english.create_vocab()
dict_english.tear_down()

In [None]:
# verify if everything is working fine
vocab_en = dict_english.get_vocab()
vocab_en[:10]

In [None]:
# create an english dictionary object
dict_french = Dictionary(fileloc_fr)
dict_french.set_up()
dict_french.create_vocab()
dict_french.tear_down()

In [None]:
# verify if everything is working fine
vocab_fr = dict_french.get_vocab()
vocab_fr[:10]

Append special tokens to the vocabularies

In [None]:
list = [unknown_token, end_token, start_token]

extend_vocab_en = lambda x: vocab_en.insert(0, x)
extend_vocab_fr = lambda x: vocab_fr.insert(0, x)
[extend_vocab_en(x) for x in list]
[extend_vocab_fr(x) for x in list]

In [None]:
# verify if the vocabs are just fine
vocab_en[:10], vocab_fr[:10], len(vocab_en), len(vocab_fr)

#### Save Vocabulary
Save the vocab to file so that we do not have to create the vocab again

In [None]:
# save the vocab in a file
dict_english.save_vocab(vocab_loc_en)
dict_french.save_vocab(vocab_loc_fr)

#### Load Vocabulary
Load the vocab if we've already pickled it before(saves a lot of time and memory)

In [69]:
# create Dictionary objects and load the vocab into the object
dict_english = Dictionary(fileloc_en)
dict_english.load_vocab(vocab_loc_en)

dict_french = Dictionary(fileloc_fr)
dict_french.load_vocab(vocab_loc_fr)

In [70]:
# load the vocab into variables
vocab_en = dict_english.get_vocab()
vocab_fr = dict_french.get_vocab()

In [74]:
# verify if the vocabs are just fine
vocab_en[:10], vocab_fr[:10], len(vocab_en), len(vocab_fr)

(['startseq', 'endseq', '<unk>', 'the', ',', '.', 'of', 'to', 'and', 'in'],
 ['startseq', 'endseq', '<unk>', 'de', ',', '.', 'la', 'et', 'le', 'les'],
 50000,
 50000)

### Map vocabulary words to unique integers and vice versa
A computer doesn't understand what a word is. For it, words are just random sequence of characters. It is we humands who understand that a specific set of characters has some meaning. Hence, to overcome this problem, we integer encode every word in the vocabulary and then use this sequence of integers to train our model on.

In [72]:
class Encoding:
    def __init__(self, vocab):
        self.word2idx = dict()
        self.idx2word = dict()
        self.vocab = vocab.copy()
        self.create_encoding()
        
    def create_encoding(self):
        self.vocab.reverse()
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1
            
        for word, index in self.word2idx.items():
            self.idx2word[index] = word

In [79]:
# create an object for english encodings
encoding_en = Encoding(vocab_en)
word2idx_en = encoding_en.word2idx
idx2word_en = encoding_en.idx2word

In [80]:
# create an object for english encodings
encoding_fr = Encoding(vocab_fr)
word2idx_fr = encoding_fr.word2idx
idx2word_fr = encoding_fr.idx2word

In [81]:
vocab_en[-1000:-1]

['pimping',
 "'reasonable",
 'emu;',
 "occupation'",
 'daddy',
 "'progressive'",
 "mother'",
 "someone'",
 "proof'",
 'faintest',
 '51:',
 'dark-skinned',
 'public-spiritedness',
 'blood-stained',
 'drugging',
 '486',
 'north;',
 'proscribe',
 'sayyaf',
 'public-law',
 "'fishing",
 'carpetshell',
 "varela'",
 'messages;',
 'scraping',
 "collector's",
 'extra-european',
 "mckenna'",
 'ecurie',
 'stratospheric',
 'set-top',
 "television'",
 'high-capacity',
 'boroni',
 'wants:',
 'operas',
 'constrict',
 'consumer-oriented',
 'adulterating',
 'ashtrays',
 'starry',
 'echo-funded',
 'macaronesia',
 '14-point',
 '(jrc)',
 'austria)',
 'deceptively',
 'gusinsky',
 "'media",
 '576',
 "monnet'",
 'occar',
 'bruce',
 "empire'",
 'anti-prohibitionists',
 "tar'",
 'cedaw',
 'varaut',
 "brik's",
 'non-native',
 'multilateralisation',
 "making'",
 '974/98',
 'mendoza',
 'gal',
 'anti-abuse',
 'cartesian',
 'rau',
 'non-nato',
 "lechner's",
 'occurred;',
 'boxed',
 'tricolour',
 "turco's",
 'devast

### Preprocess the data
Raw real world data may be incomplete, noisy, or inconsistent which may affect the training accuracy of our model. Hence to overcome this, the raw data is generally cleaned before feeding it into the model. This cleaning process is known as Data Preprocessing.

Preprocessing language data may include some additional steps which we will see below.

#### Steps:
1. Convert sentences to lower case
2. Clean sentences by removing trailing ```\n``` and other punctuations
3. Add a _start_ and _end_ token to each sentence which helps the model to determine the start and end of a sentence(Our model will learn this behaviour over time)
4. Replace words which are not in the vocabulary by a special unknown token

Example: The sentence ```That person is sitting on the bench, and enjoying the cool breeze.``` is converted to ```startseq that person is sitting on the beach and enjoying the cool breeze endseq```

In [12]:
# number of sentences we would be training our model for learning purposes
training_samples = 25000

In [13]:
# create a list of all the sentences
sentences_en = file_en.readlines()
sentences_fr = file_fr.readlines()

In [14]:
# read the translation files
datafile_en = open(fileloc_en, 'r')
datafile_fr = open(fileloc_fr, 'r')

In [15]:
# get all the training sentences from the files
sentences_en = datafile_en.readlines()
sentences_fr = datafile_fr.readlines()

In [16]:
# play with data
print(len(sentences_en))
print(len(sentences_fr))
print(sentences_en[:2])
print(sentences_fr[:2])

2007723
2007723
['Resumption of the session\n', 'I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.\n']
['Reprise de la session\n', 'Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.\n']


In [17]:
# get total number of training samples
num_samples_en = len(sentences_en)
num_samples_fr = len(sentences_fr)

In [20]:
# preprocesses language sentences

def preprocess_language_sentences(sentences, vocab):
    sentences_processed = []
    for sentence in sentences:
        sentence = preprocess_sentence(sentence, add_tokens=True)
        sentence_processed = ""
        for w in sentence.split(' '):
            if w not in vocab:
                sentence_processed += " " + unknown_token
            else:
                sentence_processed += " " + w
        sentences_processed.append(sentence_processed)
        
    return sentences_processed

In [21]:
# get the final preprocessed sentences
sentences_en_processed = preprocess_language_sentences(sentences_en[:training_samples], vocab_en)
sentences_fr_processed = preprocess_language_sentences(sentences_fr[:training_samples], vocab_fr)

In [None]:
# preprocess english sentences

sentences_en_processed = []
for sentence in sentences_en[:training_samples]:
    sentence = preprocess_sentence(sentence, add_tokens=True)
    sentence_processed = ""
    for w in sentence.split(' '):
        if w not in vocab_en:
            sentence_processed += " " + unknown_token
        else:
            sentence_processed += " " + w
    sentences_en_processed.append(sentence_processed)

In [None]:
sentences_en_processed[:10]

In [None]:
# preprocess french sentences

sentences_fr_processed = []
for sentence in sentences_fr[:training_samples]:
    sentence = preprocess_sentence(sentence, add_tokens=True)
    sentence_processed = ""
    for w in sentence.split(' '):
        if w not in vocab_fr:
            sentence_processed += " " + unknown_token
        else:
            sentence_processed += " " + w
    sentences_fr_processed.append(sentence_processed)

In [22]:
print(sentences_en_processed[:10])
print(sentences_fr_processed[:10])

[' startseq resumption of the session endseq', ' startseq i declare resumed the session of the european parliament adjourned on friday 17 december 1999 , and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period . endseq', " startseq although , as you will have seen , the dreaded 'millennium <unk> failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful . endseq", ' startseq you have requested a debate on this subject in the course of the next few days , during this part-session . endseq', " startseq in the meantime , i should like to observe a minute' s silence , as a number of members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the european union . endseq", " startseq please rise , then , for this minute' s silence . endseq", " startseq (the house rose and observed a minute'