# Neural Machine Translation
## based on the paper Sequence to Sequence Learning with Neural Networks
##### Dataset used: IWSLT'15 English-Vietnamese data [Small]
##### Link: https://nlp.stanford.edu/projects/nmt/

### Introduction to Machine Translation


Machine Translation using Neural Networks is implemented using sequence to sequence learning. In seq2seq learning, an input, which, as the name suggests, is a sequence, is mapped to the output, which is also a sequence. Thus se2seq learning can be used in various areas like Machine Translation, Chatbots, QnA solver.

Deep Neural Networks(DNNs) are powerful model that work well whenever large labeled training data sets are available, however they cannot be used for mapping sequences. However DNNs can only be used with fixed dimensionality input and output vectors. It is a significant limitation since many areas require sequential inputs with varied lengths, for example, speech recognition and machine translation. To overcome this, Neural Machine Translation uses multi-layer Long Short-Term Memory(LSTM) to map input sequence to a vector of fixed dimensionality.

### Dataset details

The original model used WMT'14 English to French dataset. The model was trained on a subset of 12M sentences which consisted of 348M French words and 304M English words. 

Vocabulary of 160000 most frequent English words and 80000 most frequent French words was used. Every out of vocabulary word was replaced with a special unknown token.

However, we'll use the English to French, European Parliament Proceedings Parallel Corpus dataset. This dataset is great for learning about seq2seq model and architecture.

### Setting up the dataset
1. Download the dataset files from http://www.statmt.org/europarl/index.html (parallel corpus French-English)
2. Create a folder dataset in the current directory<br>
```mkdir dataset```
3. Extract and move files into the dataset folder<br>
```tar xvzf fr-en.tgz ./dataset```

In [18]:
import re

In [6]:
# create dataset location holders
fileloc_en = './dataset/europarl-v7.fr-en.en'
fileloc_fr = './dataset/europarl-v7.fr-en.fr'

In [7]:
# open data files
file_en = open(fileloc_en, 'r')
file_fr = open(fileloc_fr, 'r')

### Preprocess the data
Raw real world data may be incomplete, noisy, or inconsistent which may affect the training accuracy of our model. Hence to overcome this, the raw data is generally cleaned before feeding it into the model. This cleaning process is known as Data Preprocessing.

Preprocessing language data may include some additional steps which we will see below.

#### Steps:
1. Convert sentences to lower case
2. Clean sentences by removing trailing ```\n``` and other punctuations
3. Add a _start_ and _end_ token to each sentence which helps the model to determine the start and end of a sentence(Our model will learn this behaviour over time)

Example: The sentence ```That person is sitting on the bench, and enjoying the cool breeze.``` is converted to ```startseq that person is sitting on the beach and enjoying the cool breeze endseq```

In [8]:
# create a list of all the sentences
sentences_en = file_en.readlines()
sentences_fr = file_fr.readlines()

In [15]:
# define language tokens
start_token = 'startseq'
end_token = 'endseq'
unknown_token = '<unk>'

In [46]:
# this module provides regular expression matching operations similar to those found in Perl
import re

def preprocess_sentence(w):
    # convert each character to lower case
    w = w.lower()
    
    # create a space between word and the punctuation following it
    w = re.sub(r"([?.!,?])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    
    # replace everything with space except (a-Z, A-Z, ".", "?", "!", ",")
    # w = re.sub(r"[^a-zA-Z?.!,]+", " ", w)
    w = w.rstrip().strip()
    
    # add the start and end token to the sentence
    # the model will learn this behaviour over time
    w = start_token + ' ' + w + ' ' + end_token
    
    return w

In [47]:
print(sentences_fr[12])
print(sentences_en[12])
preprocess_sentence(sentences_fr[12])

Si l'Assemblée en est d'accord, je ferai comme M. Evans l'a suggéré.

If the House agrees, I shall do as Mr Evans has suggested.



"startseq si l'assemblée en est d'accord , je ferai comme m . evans l'a suggéré . endseq"

### Create a dictionary
We'll create our own vocabulary from the words present in the English and French corpus.The vocabulary will consist of 160000 most frequent English words and 80000 most frequent French words.

#### Steps:
1. Read all the words from the files
2. Create a set consisting of all the unique words
3. Extract the most frequent words and create dictionary

In [21]:
from collections import Counter

In [37]:
class Dictionary:
    def __init__(self, input_file):
        self.vocab = set()
        self.word_list = []
        self.input_file = input_file
    
    def read_file(self):
        file = open(self.input_file, 'r')
        self.sentences = file.readlines()
        
    def create_word_list(self):
        for sentence in self.sentences:
            for word in sentence.split(' '):
                self.word_list.append(word)
                
    def get_total_words(self):
        return Counter(self.word_list)

In [38]:
# read the translation files
datafile_en = open(train_en, 'r')
datafile_vi = open(train_vi, 'r')

In [39]:
dict_english = Dictionary(file_en)
dict_english.read_file()
dict_english.create_word_list()
dict_english.get_total_words()

Counter({'Resumption': 221,
         'of': 1816812,
         'the': 3577195,
         'session\n': 421,
         'I': 550847,
         'declare': 1525,
         'resumed': 1246,
         'session': 2039,
         'European': 292298,
         'Parliament': 79457,
         'adjourned': 394,
         'on': 492785,
         'Friday': 675,
         '17': 1868,
         'December': 4742,
         '1999,': 1148,
         'and': 1403097,
         'would': 166566,
         'like': 100247,
         'once': 15128,
         'again': 16115,
         'to': 1680735,
         'wish': 19189,
         'you': 105213,
         'a': 832450,
         'happy': 4045,
         'new': 70999,
         'year': 17618,
         'in': 1098921,
         'hope': 29213,
         'that': 834946,
         'enjoyed': 851,
         'pleasant': 254,
         'festive': 42,
         'period.\n': 1587,
         'Although,': 128,
         'as': 321591,
         'will': 258701,
         'have': 353813,
         'seen,': 396,
  

Create a list of all input and output sentences

In [6]:
# get all the training sentences from the files
sentences_en = datafile_en.readlines()
sentences_vi = datafile_vi.readlines()

In [15]:
# play with data
print(len(sentences_en))
print(len(sentences_vi))
print(sentences_en[:2])
print(sentences_vi[:2])

133317
133317
['Rachel Pike : The science behind a climate headline\n', 'In 4 minutes , atmospheric chemist Rachel Pike provides a glimpse of the massive scientific effort behind the bold headlines on climate change , with her team -- one of thousands who contributed -- taking a risky flight over the rainforest in pursuit of data on a key molecule .\n']
['Khoa học đằng sau một tiêu đề về khí hậu\n', 'Trong 4 phút , chuyên gia hoá học khí quyển Rachel Pike giới thiệu sơ lược về những nỗ lực khoa học miệt mài đằng sau những tiêu đề táo bạo về biến đổi khí hậu , cùng với đoàn nghiên cứu của mình -- hàng ngàn người đã cống hiến cho dự án này -- một chuyến bay mạo hiểm qua rừng già để tìm kiếm thông tin về một phân tử then chốt .\n']


In [50]:
# get total number of training samples
num_samples = len(sentences_en)

In [8]:
# define start, end and an unknown token
start = 'startseq'
end = 'endseq'
unknown = 'unk'

In [14]:
# preprocess sentence
def preprocess_sentence(sentence):
    # convert sentence to lower case
    sentence = sentence.lower()
    # strip all special characters
    sentence = sentence.strip()
    # replace everything with space except(a-z, A-Z, ".", "?", "!", ",")
    # sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
    
    sentence = sentence.rstrip().strip()
    
    sentence = start + ' ' + sentence + ' ' + end
    
    return sentence

In [15]:
# preprocess english sentences

sentences_en_processed = []
for sentence in sentences_en:
    sentence = preprocess_sentence(sentence)
    sentences_en_processed.append(sentence)

In [16]:
# preprocess vietnamemes sentences

sentences_vi_processed = []
for sentence in sentences_vi:
    sentence = preprocess_sentence(sentence)
    # add the sentence to the processed sentenceslist
    sentences_vi_processed.append(sentence)

In [17]:
print(sentences_en_processed[5:7])
print(sentences_vi_processed[5:7])

['startseq recently the headlines looked like this when the intergovernmental panel on climate change , or ipcc , put out their report on the state of understanding of the atmospheric system . endseq', 'startseq that report was written by 620 scientists from 40 countries . endseq']
['startseq các tiêu đề gần đây trông như thế này khi ban điều hành biến đổi khí hậu liên chính phủ , gọi tắt là ipcc đưa ra bài nghiên cứu của họ về hệ thống khí quyển . endseq', 'startseq nghiên cứu được viết bởi 620 nhà khoa học từ 40 quốc gia khác nhau . endseq']


In [17]:
# create a character array of all the vietnamese characters
char_array_vi = []
for sentence in sentences_vi:
    sentence = sentence.lower()
    for char in sentence:
        char_array_vi.append(char)

In [22]:
# create a character array of all the english characters
char_array_en = []
for sentence in sentences_en:
    sentence = sentence.lower()
    for char in sentence:
        char_array_en.append(char)

In [18]:
# create a dictionary of words and their frequency for english characters
from collections import Counter
char_dict_en = dict(Counter(char_array_en))
len(char_dict_en)

96

In [19]:
# create a dictionary of words and their frequency for vietnamese characters
from collections import Counter
char_dict_vi = dict(Counter(char_array_vi))
len(char_dict_vi)

164