# Neural Machine Translation
## based on the paper Sequence to Sequence Learning with Neural Networks
##### Dataset used: IWSLT'15 English-Vietnamese data [Small]
##### Link: https://nlp.stanford.edu/projects/nmt/

### Introduction to Machine Translation


Machine Translation using Neural Networks is implemented using sequence to sequence learning. In seq2seq learning, an input, which, as the name suggests, is a sequence, is mapped to the output, which is also a sequence. Thus se2seq learning can be used in various areas like Machine Translation, Chatbots, QnA solver.

Deep Neural Networks(DNNs) are powerful model that work well whenever large labeled training data sets are available, however they cannot be used for mapping sequences. However DNNs can only be used with fixed dimensionality input and output vectors. It is a significant limitation since many areas require sequential inputs with varied lengths, for example, speech recognition and machine translation. To overcome this, Neural Machine Translation uses multi-layer Long Short-Term Memory(LSTM) to map input sequence to a vector of fixed dimensionality.

### Dataset details

The original model used WMT'14 English to French dataset. The model was trained on a subset of 12M sentences which consisted of 348M French words and 304M English words. 

Vocabulary of 160000 most frequent English words and 80000 most frequent French words was used. Every out of vocabulary word was replaced with a special unknown token.

However, we'll use the English to Vietnamese dataset. This dataset is small in size and great for learning about seq2seq model and architecture.

### Setting up the data for training
1. Download the dataset files from https://nlp.stanford.edu/projects/nmt/ (train.en, train.vi)
2. Create a folder named dataset in the current directory<br>
```mkdir dataset```
3. Move files into the dataset folder<br>
```mv train.en train.vi ./dataset```

In [12]:
# create dataset location holders
train_en = './dataset/train.en'
train_vi = './dataset/train.vi'

In [13]:
# read the translation files
datafile_en = open(train_en, 'r')
datafile_vi = open(train_vi, 'r')

Create a list of all input and output sentences

In [14]:
# get all the training sentences from the files
sentences_en = datafile_en.readlines()
sentences_vi = datafile_vi.readlines()

In [15]:
# play with data
print(len(sentences_en))
print(len(sentences_vi))
print(sentences_en[:2])
print(sentences_vi[:2])

133317
133317
['Rachel Pike : The science behind a climate headline\n', 'In 4 minutes , atmospheric chemist Rachel Pike provides a glimpse of the massive scientific effort behind the bold headlines on climate change , with her team -- one of thousands who contributed -- taking a risky flight over the rainforest in pursuit of data on a key molecule .\n']
['Khoa học đằng sau một tiêu đề về khí hậu\n', 'Trong 4 phút , chuyên gia hoá học khí quyển Rachel Pike giới thiệu sơ lược về những nỗ lực khoa học miệt mài đằng sau những tiêu đề táo bạo về biến đổi khí hậu , cùng với đoàn nghiên cứu của mình -- hàng ngàn người đã cống hiến cho dự án này -- một chuyến bay mạo hiểm qua rừng già để tìm kiếm thông tin về một phân tử then chốt .\n']


In [50]:
# get total number of training samples
num_samples = len(sentences_en)

### Preprocess the training data
1. Trailing ```\n``` is removed from each sentence
2. Each sentence is converted to lower case
3. A special start and end token is added to the sentence to mark the start and end of the sentence. Our model will learn this behaviour over time.

In [27]:
# define start, end and an unknown token
start = 'startseq'
end = 'endseq'
unknown = 'unk'

In [55]:
# preprocess english sentences

sentences_en_processed = []
for sentence in sentences_en:
    # convert sentence to lower case
    sentence = sentence.lower()
    # remove trailing \n from the sentence
    sentence = sentence.rstrip()
    # add start and end tokens
    sentence = start + ' ' + sentence + ' ' + end
    
    # add the sentence to the processed sentenceslist
    sentences_en_processed.append(sentence)

In [56]:
# preprocess vietnamemes sentences

sentences_vi_processed = []
for sentence in sentences_vi:
    # convert sentence to lower case
    sentence = sentence.lower()
    # remove trailing \n from the sentence
    sentence = sentence.rstrip()
    # add start and end tokens
    sentence = start + ' ' + sentence + ' ' + end
    
    # add the sentence to the processed sentenceslist
    sentences_vi_processed.append(sentence)

In [59]:
print(sentences_en_processed[5:7])
print(sentences_vi_processed[5:7])

['startseq recently the headlines looked like this when the intergovernmental panel on climate change , or ipcc , put out their report on the state of understanding of the atmospheric system . endseq', 'startseq that report was written by 620 scientists from 40 countries . endseq']
['startseq các tiêu đề gần đây trông như thế này khi ban điều hành biến đổi khí hậu liên chính phủ , gọi tắt là ipcc đưa ra bài nghiên cứu của họ về hệ thống khí quyển . endseq', 'startseq nghiên cứu được viết bởi 620 nhà khoa học từ 40 quốc gia khác nhau . endseq']


In [17]:
# create a character array of all the vietnamese characters
char_array_vi = []
for sentence in sentences_vi:
    sentence = sentence.lower()
    for char in sentence:
        char_array_vi.append(char)

In [22]:
# create a character array of all the english characters
char_array_en = []
for sentence in sentences_en:
    sentence = sentence.lower()
    for char in sentence:
        char_array_en.append(char)

In [18]:
# create a dictionary of words and their frequency for english characters
from collections import Counter
char_dict_en = dict(Counter(char_array_en))
len(char_dict_en)

96

In [19]:
# create a dictionary of words and their frequency for vietnamese characters
from collections import Counter
char_dict_vi = dict(Counter(char_array_vi))
len(char_dict_vi)

164