## Machine translation for English - Vietnamese
* Skip these lines if not using Google Colab

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
cd 'drive/My Drive/Colab Notebooks/machine_translation'

/content/drive/My Drive/Colab Notebooks/machine_translation


* Import necessary functions

In [0]:
from dataset import MTDataset
from model import Encoder, Decoder
from language import Language
from utils import preprocess, generate_seed
from train import train
from eval import validate
from translate import translate

### Preparing data
* Preprocess input and target sentences for train, val and test set: remove all digits and punctuation, take only input - target pairs of sentences with length shorter than *MAX_LEN*. Here I choose max length = 20 for faster training.

In [0]:
MAX_LEN = 20
sentences_inp_train, sentences_trg_train = preprocess('datasets/train/train.en', 'datasets/train/train.vi', max_len=MAX_LEN)
sentences_inp_val, sentences_trg_val = preprocess('datasets/dev/tst2012.en', 'datasets/dev/tst2012.vi', max_len=MAX_LEN)
sentences_inp_test, sentences_trg_test = preprocess('datasets/test/tst2013.en', 'datasets/test/tst2013.vi', max_len=MAX_LEN)

* Create Language class for each set, each class will contain information like **max length, all sentences, word_to_indices and indices_to_word, vocab size and word vectors**. For validation set and test set, the **word_to_indices** and **indices_to_word** are retrieved from training set.

In [0]:
train_inp = Language(sentences_inp_train)
train_trg = Language(sentences_trg_train)

val_inp = Language(sentences_inp_val, train=False, word2id=train_inp.word2id, id2word=train_inp.id2word)
val_trg = Language(sentences_trg_val, train=False, word2id=train_trg.word2id, id2word=train_trg.id2word)

test_inp = Language(sentences_inp_test, train=False, word2id=train_inp.word2id, id2word=train_inp.id2word)
test_trg = Language(sentences_trg_test, train=False, word2id=train_trg.word2id, id2word=train_trg.id2word)

* Create Dataset classes

In [0]:
train_set = MTDataset(train_inp.wordvec, train_trg.wordvec)
val_set = MTDataset(val_inp.wordvec, val_trg.wordvec)
test_set = MTDataset(test_inp.wordvec, test_trg.wordvec)

In [0]:
from torch.utils.data import DataLoader
import torch
import torch.nn as nn
from torch.optim.lr_scheduler import StepLR

* Create DataLoaders

In [0]:
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
val_loader = DataLoader(val_set, batch_size=64)
test_loader = DataLoader(test_set, batch_size=64)

* Model's parameters

In [0]:
Tx, Ty = train_inp.max_len, train_trg.max_len
vocab_size_inp, vocab_size_trg = train_inp.vocab_size, train_trg.vocab_size
embedding_dim = 256
hidden_size = 1024

### Building models

In [0]:
if torch.cuda.is_available():
    device='cuda'
else:
    device='cpu'

In [0]:
# choose a seed for both models for consistent results
SEED = 5

* **Model 1: first hidden state of Decoder are the same as original paper.**

In [0]:
generate_seed(SEED)  # generate seed to ensure consistent result each run
encoder_1 = Encoder(vocab_size_inp, embedding_dim, hidden_size).to(device=device)
decoder_1 = Decoder(hidden_size, vocab_size_trg, embedding_dim).to(device=device)

In [0]:
optimizer_1 = torch.optim.Adam(params=list(encoder_1.parameters()) + list(decoder_1.parameters()))
criterion_1 = nn.CrossEntropyLoss()
scheduler_1 = StepLR(optimizer_1, step_size=2, gamma=0.2)

In [0]:
# train model, save best state dict
statedict_1 = train(encoder_1, decoder_1, train_loader, val_loader, optimizer_1, criterion_1, train_trg.id2word, scheduler_1, 5, 200, device)

Epoch  1
Iter 0, loss = 9.013927
Iter 200, loss = 2.750423
Iter 400, loss = 2.298985
Iter 600, loss = 2.248380
Iter 800, loss = 1.933960
Iter 1000, loss = 1.963099
Validation BLEU score: 0.113232

Epoch  2
Iter 0, loss = 1.641850
Iter 200, loss = 1.793719
Iter 400, loss = 1.429143
Iter 600, loss = 1.457869
Iter 800, loss = 1.443988
Iter 1000, loss = 1.496264
Validation BLEU score: 0.135118

Epoch  3
Iter 0, loss = 1.053230
Iter 200, loss = 0.966383
Iter 400, loss = 1.037764
Iter 600, loss = 1.063965
Iter 800, loss = 0.900151
Iter 1000, loss = 1.001884
Validation BLEU score: 0.158528

Epoch  4
Iter 0, loss = 0.737438
Iter 200, loss = 0.766254
Iter 400, loss = 0.687951
Iter 600, loss = 0.783305
Iter 800, loss = 0.771455
Iter 1000, loss = 0.810231
Validation BLEU score: 0.159650

Epoch  5
Iter 0, loss = 0.649854
Iter 200, loss = 0.619188
Iter 400, loss = 0.510352
Iter 600, loss = 0.535633
Iter 800, loss = 0.545049
Iter 1000, loss = 0.547665
Validation BLEU score: 0.161177



In [0]:
# save state dict into a file
torch.save(statedict_1, 'saved_models/statedict_1.pth') #  save model's state dict

In [0]:
# load state dict
statedict_1 = torch.load('saved_models/statedict_1.pth')
encoder_1.load_state_dict(statedict_1['encoder'])
decoder_1.load_state_dict(statedict_1['decoder'])

<All keys matched successfully>

* **Model 2 (modified version): initial hidden state h0 of Decoder is created from Encoder's last forward and last backward hidden states.**

In [0]:
# modified = True means different variant from paper
generate_seed(SEED)
encoder_2 = Encoder(vocab_size_inp, embedding_dim, hidden_size, modified=True).to(device=device)
decoder_2 = Decoder(hidden_size, vocab_size_trg, embedding_dim).to(device=device)

In [0]:
optimizer_2 = torch.optim.Adam(params=list(encoder_2.parameters()) + list(decoder_2.parameters()))
criterion_2 = nn.CrossEntropyLoss()
scheduler_2 = StepLR(optimizer_2, step_size=2, gamma=0.2)

In [0]:
# train model
statedict_2 = train(encoder_2, decoder_2, train_loader, val_loader, optimizer_2, criterion_2, train_trg.id2word, scheduler_2, 5, 200, device)

Epoch  1
Iter 0, loss = 8.985854
Iter 200, loss = 2.694446
Iter 400, loss = 2.147349
Iter 600, loss = 2.028440
Iter 800, loss = 2.440283
Iter 1000, loss = 2.084523
Validation BLEU score: 0.121072

Epoch  2
Iter 0, loss = 1.484097
Iter 200, loss = 1.635226
Iter 400, loss = 1.487907
Iter 600, loss = 1.590866
Iter 800, loss = 1.401913
Iter 1000, loss = 1.591312
Validation BLEU score: 0.135914

Epoch  3
Iter 0, loss = 1.099661
Iter 200, loss = 0.975920
Iter 400, loss = 1.067758
Iter 600, loss = 1.179304
Iter 800, loss = 0.907614
Iter 1000, loss = 0.964632
Validation BLEU score: 0.159819

Epoch  4
Iter 0, loss = 0.740380
Iter 200, loss = 0.679936
Iter 400, loss = 0.850062
Iter 600, loss = 0.767678
Iter 800, loss = 0.733945
Iter 1000, loss = 0.789158
Validation BLEU score: 0.159320

Epoch  5
Iter 0, loss = 0.580725
Iter 200, loss = 0.512373
Iter 400, loss = 0.596796
Iter 600, loss = 0.521631
Iter 800, loss = 0.481102


In [0]:
# save state dict
torch.save(statedict_2, 'saved_models/statedict_2.pth')  #  save model's state dict

In [0]:
# load state dict
statedict_2 = torch.load('saved_models/statedict_2.pth')
encoder_2.load_state_dict(statedict_2['encoder'])
decoder_2.load_state_dict(statedict_2['decoder'])

<All keys matched successfully>

### Validate BLEU-4 score on test set

In [0]:
print('Model 1 BLEU score: %.3f' %(100*validate(test_loader, encoder_1, decoder_1, test_trg.id2word, device)))
print('Model 2 BLEU score: %.3f' %(100*validate(test_loader, encoder_2, decoder_2, test_trg.id2word, device)))

Model 1 BLEU score: 18.524
Model 2 BLEU score: 19.283


### Test time
**1. Sentence 1**

In [0]:
sentence = sentences_inp_test[0]
print("Sentence: " + sentence)
print("Model 1: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_1, decoder_1, MAX_LEN, device))
print("Model 2: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_2, decoder_2, MAX_LEN, device))

Sentence: and i was very proud
Model 1: và tôi rất tự hào
Model 2: và tôi tự hào rất tự hào


**2. Sentence 2**

In [0]:
sentence = sentences_inp_test[50]
print("Sentence: " + sentence)
print("Model 1: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_1, decoder_1, MAX_LEN, device))
print("Model 2: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_2, decoder_2, MAX_LEN, device))

Sentence: but most people don apost agree
Model 1: nhưng hầu hết mọi người không đồng ý
Model 2: nhưng hầu hết mọi người không đồng ý


**3. Sentence 3**

In [0]:
sentence = sentences_inp_test[100]
print("Sentence: " + sentence)
print("Model 1: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_1, decoder_1, MAX_LEN, device))
print("Model 2: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_2, decoder_2, MAX_LEN, device))

Sentence: i also didn apost know that the second step is to isolate the victim
Model 1: tôi cũng không biết rằng thứ hai là để phân loại các nạn nhân
Model 2: tôi cũng không biết rằng bước thứ hai là để chuyển nạn nhân


**4. Sentence 4**

In [0]:
sentence = sentences_inp_test[1]
print("Sentence: " + sentence)
print("Model 1: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_1, decoder_1, MAX_LEN, device))
print("Model 2: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_2, decoder_2, MAX_LEN, device))

Sentence: my family was not poor  and myself  i had never experienced hunger
Model 1: gia đình tôi không phải là nghèo và tôi không bao giờ hồi phục hồi
Model 2: gia đình tôi không nghèo và tôi không bao giờ có thể nhìn qua đói


**5. Sentence 5**

In [0]:
sentence = sentences_inp_test[3]
print("Sentence: " + sentence)
print("Model 1: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_1, decoder_1, MAX_LEN, device))
print("Model 2: " + translate(sentence, train_inp.word2id, train_trg.word2id, train_trg.id2word, encoder_2, decoder_2, MAX_LEN, device))

Sentence: this was the first time i heard that people in my country were suffering
Model 1: lần đầu tiên tôi nghe thấy mọi người ở đất nước của tôi bị đau khổ
Model 2: đó là lần đầu tiên tôi nghe thấy mọi người ở đất nước của tôi rất đau khổ
