# Machine Translation for Translators Workshop

In this notebook, we'll learn:

- Running a pre-trained English-Turkish MT model 
- Byte-pair encoding (BPE)
- Translating a document
- Creating training data from translation memory (TMX)
- Creating a test set and calculating BLEU
- Training a model from scratch
- Domain adaptation on our pre-trained model

NOTE: This coding template is partially based on Masakhane's starter notebook (https://github.com/masakhane-io/masakhane-mt)

# Part 1 - Machine translation with JoeyNMT and AfroTranslate

For this workshop we will use 🐨 JoeyNMT. 

JoeyNMT is an open-source, minimalist neural machine translation toolkit for educational purposes. [code](https://github.com/joeynmt/joeynmt), [documentation](https://joeynmt.readthedocs.io/en/latest/).

We will also use a Python package called [AfroTranslate](https://github.com/hgilles06/AfroTranslate) to easily interact with JoeyNMT.

Let's start by installing AfroTranslate. Since it's dependent on JoeyNMT, it'll automatically install it for us.



In [None]:
!pip install AfroTranslate

Collecting AfroTranslate
  Downloading AfroTranslate-0.0.6-py3-none-any.whl (12 kB)
Collecting joeynmt==1.3
  Downloading joeynmt-1.3-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.2 MB/s 
[?25hCollecting spacy==3.2.1
  Downloading spacy-3.2.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 19.7 MB/s 
Collecting six==1.12
  Downloading six-1.12.0-py2.py3-none-any.whl (10 kB)
Collecting torch==1.8.0
  Downloading torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5 MB)
[K     |████████████████████████████████| 735.5 MB 14 kB/s 
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 23.0 MB/s 
Collecting numpy==1.20.1
  Downloading numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 192 kB/s 
Collecting srsly<3.0.0,>=2.4.1
  Downloading srs

AfroTranslate comes with direct links to many African languages ([list](https://github.com/masakhane-io/masakhane-mt/tree/master/benchmarks)). 

Let's use it to load its Tigrinya model and use it.

In [None]:
#import translator
from afrotranslate import MasakhaneTranslate

In [None]:
#create translator object
translator = MasakhaneTranslate(model_name="en-ti")

#translate 
translator.translate("I love you so much!")

We can also load a model of our own.

Let's first download the model stored in the cloud.

[Direct link](https://drive.google.com/file/d/1E_bdkdnYW4wTSdujsDiCqBlBYngvjX0m/view?usp=sharing)

In [None]:
#Execute this to download directly to Colab
from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id="1E_bdkdnYW4wTSdujsDiCqBlBYngvjX0m",
    		                            dest_path="models/entr/entr.zip",
    		                            unzip=True)

Now let's load our English-Turkish translator model using AfroTranslate

'▁ ut an ç ▁.'

'▁ ut an ç ▁.'

The results seem to be a bit strange. It's because the models were trained with BPE encoded words. 

# Part 2 - Byte-pair encoding (BPE)

- One of the most powerful improvements for neural machine translation is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- BPE tokenization limits the number of vocabulary into a certain size by smartly dividing words into subwords

- This is especially useful for agglutinative languages (like Turkish) where vocabulary is effectively endless. 

- Below you have the scripts for doing BPE tokenization of our data. We use bpemb library that has pre-trained BPE models to convert our text into subwords.




In [None]:
! pip install bpemb

In [None]:
from bpemb import BPEmb

BPE_VOCAB_SIZE = 5000
bpemb_en = BPEmb(lang="en", vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=True)
bpemb_tr = BPEmb(lang="tr", vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=True)

Let's use it now and see how it looks like

Now let's go back using our model with properly encoded strings.

Don't forget that `bpemb.encode` outputs a list but we need to input a string to our translator.

You can use `' '.join(list)` to convert a list to string.

And to make that output readable, we need to decode it using the right BPE model

Let's create a translator function that'll reduce our work...

In [None]:
def translate_entr(tr_in):
  #...

In [None]:
translate_entr("Hello! What's up?")

# Part 3 (Challenge) - Translating a document

For this exercise, we're going to create a script that translates a word document automatically. 

We're going to reuse some of the code from last session.

In [None]:
!pip install python-docx

Let's download our document to translate.

[Direct link](https://docs.google.com/document/d/1jwNdh4n_m4M-0Z4j23lH4xyeWfTyHPY-/edit?usp=sharing&ouid=114670265863192986077&rtpof=true&sd=true)

In [None]:
#Execute this to download directly to Colab
gdd.download_file_from_google_drive(file_id="1jwNdh4n_m4M-0Z4j23lH4xyeWfTyHPY-", dest_path="/content/MT.docx")

Let's import docx and use it to read our document

Let's see what's inside

Let's translate paragraphs one by one and put the translations into another list



And finally, write the translated paragraphs into a newly created document.

# Part 4 - Creating parallel data from TMX

Sometimes, we would like to adapt a model into our style or a domain that we translate in. If we have done plenty of translations already, we could use our translation memory to later enhance a model. For that we need to convert our translation memory to a classic parallel data format. 

The classic parallel data format is two files, each of them containing the sentences at each line in different languages. 

A translation memory contains already parallel data in this sense. Although, it is not in the format that we want. 

We can automate this conversion using Python scripting. 

Since TMX parsing is a bit complicated for our level, we're going to use this code by [Yasmin Moslem](https://github.com/ymoslem/file-converters/blob/main/TMX2MT/TMX2MT-ElementTree.py).

In [None]:
import xml
import xml.etree.ElementTree as ET
import sys
import re
import os

def xml_to_parallel(file, source, target):
  source_file = os.path.splitext(file)[0] + "." + source
  target_file = os.path.splitext(file)[0] + "." + target

  tree = ET.parse(file)  
  root = tree.getroot()

  langs = []

  for tu in root.iter('tu'):
      for tuv in tu.iter('tuv'):
          lang = list(tuv.attrib.values())
          langs.append(lang[0].lower())

  langs = set(langs)

  if source in langs and target in langs:
      with open(source_file, "w+", encoding='utf-8') as source_file, open(target_file, "w+", encoding='utf-8') as target_file:
          for tu in root.iter('tu'):
              for tuv in tu.iter('tuv'):
                  lang = list(tuv.attrib.values())
                  #print(lang[0])
                  if lang[0].lower() == source.lower():
                      for seg in tuv.iter('seg'):
                          source_text = ET.tostring(seg, 'utf-8', method="xml")
                          source_text = source_text.decode("utf-8")
                          source_text = re.sub('<.*?>|&lt;.*?&gt;|&?(amp|nbsp|quot);|{}', ' ', source_text)
                          source_text = re.sub(r'[ ]{2,}', ' ', source_text).strip()
                          source_file.write(str(source_text) + "\n")
                          #print(source_text)
                  elif lang[0].lower() == target.lower():
                      for seg in tuv.iter('seg'):
                          target_text = ET.tostring(seg, 'utf-8', method="xml")
                          target_text = target_text.decode("utf-8")
                          target_text = re.sub('<.*?>|&lt;.*?&gt;|&quot;|&apos;|{}', ' ', target_text)
                          target_text = re.sub(r'[ ]{2,}', ' ', target_text).strip()
                          target_file.write(str(target_text) + "\n")
                          #print(target_text)

Here we have a translation memory that contains News translations.

In [None]:
gdd.download_file_from_google_drive(file_id="1xh6k91VFfiWpUnjrs-9mU_tkRkdS-INm", dest_path="/content/news.en-tr.tmx")

Downloading 1xh6k91VFfiWpUnjrs-9mU_tkRkdS-INm into /content/news.en-tr.tmx... Done.


We just need to call our function now.



Now we can download and check the parallel data that we created.

(You can see them in the files panel once you hit refresh)

# Part 5 - Creating a test set and calculating BLEU

In this part, we'll see how our model performs on a test set we create from our in-domain data. 

Usually, we don't want use all our in-domain data for training. We allocate a portion of it for testing purposes and we make sure that we don't mix this in the training data. Because if we do, it will be sort-of cheating and the results we get won't reflect the generalized quality of the model.

Let's see how big is our data first:

In [None]:
#Execute this to see the size of news dataset
#NOTE: This is not Python! 
!wc news.en-tr.en news.en-tr.tr

  10007  197808 1180320 news.en-tr.tmx.en
  10007  149199 1255282 news.en-tr.tmx.tr
  20014  347007 2435602 total


Now let's take a portion of it, say the final 200 samples as testing data, and the rest as training data.

In [None]:
!head -n -200 news.en-tr.en > news.en-tr.train.en 
!tail -n 200 news.en-tr.en > news.en-tr.test.en 

!head -n -200 news.en-tr.en > news.en-tr.train.tr 
!tail -n 200 news.en-tr.en > news.en-tr.test.tr 

In [None]:
!wc news.en-tr.test.en news.en-tr.test.tr

   9987  197548 1178714 news.en-tr.train.en
   9987  149030 1253863 news.en-tr.train.tr
  19974  346578 2432577 total
  20  260 1606 news.en-tr.test.en
  20  169 1419 news.en-tr.test.tr
  40  429 3025 total


OK, now we have a separate training and testing set. Let's see how our generic English-Turkish model performs on it.

We'll first translate the English portion of our test set using our model.

In [None]:
#You can use this function to read a file line by line into a list
def read_file_lines_to_list(filename):
  return [l[:-1] for l in open(filename, 'r').readlines()]

#You can use this function to write a list of strings into a file
def write_list_to_file(strlist, filename):
  with open(filename, 'w') as f:
    for s in strlist:
      f.write(s+"\n")
      

#You can use this function to see first n elements in a list
def get_first_n(l, n):
  for i, elem in enumerate(l):
    print(elem)
    if i == n:
      break

Now we can use the same code we used before to translate the sentences in our list

Let's see how the translations look like

Now, let's write the translations to a textfile.

For doing BLEU evaluations, we can use the SacreBLEU package

In [None]:
!pip install sacrebleu



In [None]:
!sacrebleu ______ -i ________

# Part 7 - Fine-tuning our model


There's a discrepancy between AfroTranslate and JoeyNMT. 

We need to restart our runtime at this step and install JoeyNMT and bpemb again. 

Don't worry, the files we have prepared will stay in their places. 

In [None]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

In [None]:
from bpemb import BPEmb

BPE_VOCAB_SIZE = 5000
bpemb_en = BPEmb(lang="en", vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=True)
bpemb_tr = BPEmb(lang="tr", vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=True)

In [None]:
#BPE encode the in-domain training data
news_en_bpe = [' '.join(bpemb_en.encode(l[:-1])) for l in open('news.en-tr.en', 'r').readlines()]
news_tr_bpe = [' '.join(bpemb_tr.encode(l[:-1])) for l in open('news.en-tr.tr', 'r').readlines()]

#Allocate first 1000 samples as development set
news_dev_en_bpe = news_en_bpe[0:1000]
news_dev_tr_bpe = news_tr_bpe[0:1000]

#Allocate last 200 samples as test set
news_test_en_bpe = news_en_bpe[-200:]
news_test_tr_bpe = news_tr_bpe[-200:]

#Allocate rest as training data
news_train_en_bpe = news_en_bpe[1000:-200]
news_train_tr_bpe = news_tr_bpe[1000:-200]

In [None]:
write_list_to_file(news_train_en_bpe, "news.en-tr.train.BPE.en")
write_list_to_file(news_train_tr_bpe, "news.en-tr.train.BPE.tr")
write_list_to_file(news_dev_en_bpe, "news.en-tr.dev.BPE.en")
write_list_to_file(news_dev_tr_bpe, "news.en-tr.dev.BPE.tr")
write_list_to_file(news_test_en_bpe, "news.en-tr.test.BPE.en")
write_list_to_file(news_test_tr_bpe, "news.en-tr.test.BPE.tr")

In [None]:
# Let's create a JoeyNMT config file for finetuning training
# Changes from previous config are dataset names, model name, batch size and learning rate

# Create the config
config = """
name: "entr_finetune"

data:
    src: "en"
    trg: "tr"
    train: "/content/news.en-tr.train.BPE"
    dev:   "/content/news.en-tr.dev.BPE"
    test:  "/content/news.en-tr.test.BPE"
    level: "bpe"
    lowercase: False
    max_sent_length: 150
    src_vocab: "/content/models/entr/src_vocab.txt"
    trg_vocab: "/content/models/entr/trg_vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/models/entr/best.ckpt" # Load base model from its best scoring checkpoint (from local)
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0001
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 128
    batch_type: "token"
    eval_batch_size: 64
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 100          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "/content/models/entr_finetune"
    overwrite: True               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3
    save_latest_ckpt: True
model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
"""
with open("/content/entr_finetune.yaml",'w') as f:
    f.write(config)

In [None]:
! cd joeynmt; python3 -m joeynmt train "/content/entr_finetune.yaml"

2022-01-28 08:46:42,471 - INFO - root - Hello! This is Joey-NMT (version 1.5.1).
2022-01-28 08:46:42,497 - INFO - joeynmt.data - Loading training data...
2022-01-28 08:46:42,681 - INFO - joeynmt.data - Building vocabulary...
2022-01-28 08:46:44,037 - INFO - joeynmt.data - Loading dev data...
2022-01-28 08:46:44,055 - INFO - joeynmt.data - Loading test data...
2022-01-28 08:46:44,058 - INFO - joeynmt.data - Data loaded.
2022-01-28 08:46:44,058 - INFO - joeynmt.model - Building an encoder-decoder model...
2022-01-28 08:46:44,344 - INFO - joeynmt.model - Enc-dec model built.
2022-01-28 08:46:46,757 - INFO - joeynmt.training - Total params: 13372928
2022-01-28 08:46:49,639 - INFO - joeynmt.training - Loading model from /content/models/entr/best.ckpt
2022-01-28 08:46:49,951 - INFO - joeynmt.helpers -                           cfg.name : entr_transformer
2022-01-28 08:46:49,951 - INFO - joeynmt.helpers -                       cfg.data.src : en
2022-01-28 08:46:49,951 - INFO - joeynmt.helpers

Let's create a translator for our new finetuned model

In [None]:
finetuned_translator = MasakhaneTranslate(model_path="/content/models/entr_finetune")

ValueError: ignored

Now let's test on the finetuned model to see if it has improved on the test set



---



# Part 8 (homework) - Training a model from scratch

## Retrieve your data & make a parallel corpus

In this part we will use open corpus available from OPUS repository to train a translation model. We will first download the data, create training, development, testing sets from it and then use JoeyNMT to train a baseline model. 

In the next cell, you need to set the languages you want to work with and specify which corpus you want to use to train. 

To select a corpus go to https://opus.nlpl.eu/, enter your language pair and select one that you think is more appropriate (size, domain)

In [None]:
# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:
# These will also become the suffix's of all vocab and corpus files used throughout
import os
source_language = "en"
target_language = "tr"

seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

In [None]:
# This will save it to a folder in our gdrive instead!
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p "/content/drive/My Drive/mt-workshop/$src-$tgt-$tag"
os.environ["gdrive_path"] = "/content/drive/My Drive/mt-workshop/%s-%s-%s" % (source_language, target_language, tag)

!echo $gdrive_path

Mounted at /content/drive
/content/drive/My Drive/mt-workshop/en-tr-baseline


In [None]:
# Install opus-tools (Warning! This is not really python)
! pip install opustools-pkg

Collecting opustools-pkg
  Downloading opustools_pkg-0.0.52-py3-none-any.whl (80 kB)
[?25l[K     |████                            | 10 kB 20.8 MB/s eta 0:00:01[K     |████████                        | 20 kB 20.9 MB/s eta 0:00:01[K     |████████████▏                   | 30 kB 24.7 MB/s eta 0:00:01[K     |████████████████▏               | 40 kB 13.1 MB/s eta 0:00:01[K     |████████████████████▎           | 51 kB 11.9 MB/s eta 0:00:01[K     |████████████████████████▎       | 61 kB 13.7 MB/s eta 0:00:01[K     |████████████████████████████▎   | 71 kB 11.4 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 5.7 MB/s 
[?25hInstalling collected packages: opustools-pkg
Successfully installed opustools-pkg-0.0.52


In [None]:
# TODO: Indicate here the ID of the corpus you want to use from OPUS
opus_corpus = "TED2020" 
os.environ["corpus"] = opus_corpus

# Downloading our corpus 
! opus_read -d $corpus -s $src -t $tgt -wm moses -w $corpus.$src $corpus.$tgt -q

# Extract the corpus file
! gunzip ${corpus}_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/TED2020/latest/xml/en-tr.xml.gz not found. The following files are available for downloading:

   2 MB https://object.pouta.csc.fi/OPUS-TED2020/v1/xml/en-tr.xml.gz
  47 MB https://object.pouta.csc.fi/OPUS-TED2020/v1/xml/en.zip
  37 MB https://object.pouta.csc.fi/OPUS-TED2020/v1/xml/tr.zip

  85 MB Total size
./TED2020_latest_xml_en-tr.xml.gz ... 100% of 2 MB
./TED2020_latest_xml_en.zip ... 100% of 47 MB
./TED2020_latest_xml_tr.zip ... 100% of 37 MB



In [None]:
# Read the corpus into python lists
source_file = opus_corpus + '.' + source_language
target_file = opus_corpus + '.' + target_language

src_all = [sentence.strip() for sentence in open(source_file).readlines()]
tgt_all = [sentence.strip() for sentence in open(target_file).readlines()]

In [None]:
# Let's take a peek at the files
print("Source size:", len(src_all))
print("Target size:", len(tgt_all))
print("--------")

peek_size = 5
for i in range(peek_size):
  print("Sent #", i)
  print("SRC:", src_all[i])
  print("TGT:", tgt_all[i])
  print("---------")

Source size: 374378
Target size: 374378
--------
Sent # 0
SRC: Thank you so much , Chris .
TGT: Çok teşekkür ederim Chris .
---------
Sent # 1
SRC: And it 's truly a great honor to have the opportunity to come to this stage twice ; I 'm extremely grateful .
TGT: Bu sahnede ikinci kez yer alma fırsatına sahip olmak gerçekten büyük bir onur . Çok minnettarım .
---------
Sent # 2
SRC: I have been blown away by this conference , and I want to thank all of you for the many nice comments about what I had to say the other night .
TGT: Bu konferansta çok mutlu oldum , ve anlattıklarımla ilgili güzel yorumlarınız için sizlere çok teşekkür ederim .
---------
Sent # 3
SRC: And I say that sincerely , partly because ( Mock sob ) I need that .
TGT: Bunu içtenlikle söylüyorum , çünkü ... ( Ağlama taklidi ) Buna ihtiyacım var .
---------
Sent # 4
SRC: ( Laughter ) Put yourselves in my position .
TGT: ( Kahkahalar ) Kendinizi benim yerime koyun !
---------


## Making training, development and testing sets

We need to pick training, development and testing sets from our corpus. Training set will contain the sentences that we'll teach our model. Development set will be used to see how our model is progressing during the training. And finally, testing set will be used to evaluate the model.

You can optionally load your own testing set. 

In [None]:
# TODO: Determine ratios of each set
all_size = len(src_all)
dev_size = 1000
test_size = 1000
train_size = all_size - test_size - dev_size

src_train = src_all[0:train_size]
tgt_train = tgt_all[0:train_size]

src_dev = src_all[train_size:train_size+dev_size]
tgt_dev = tgt_all[train_size:train_size+dev_size]

src_test = src_all[train_size+dev_size:all_size]
tgt_test = tgt_all[train_size+dev_size:all_size]

print("Set sizes")
print("All:", len(src_all))
print("Train:", len(src_train))
print("Dev:", len(src_dev))
print("Test:", len(src_test))

Set sizes
All: 374378
Train: 372378
Dev: 1000
Test: 1000


# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for neural machine translation is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- BPE tokenization limits the number of vocabulary into a certain size by smartly dividing words into subwords

- This is especially useful for agglutinative languages (like Turkish) where vocabulary is effectively endless. 

- Below you have the scripts for doing BPE tokenization of our data. We use bpemb library that has pre-trained BPE models to convert our data into subwords.

In [None]:
! pip install bpemb
from bpemb import BPEmb

BPE_VOCAB_SIZE = 5000
bpemb_src = BPEmb(lang=source_language, vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=False)
bpemb_tgt = BPEmb(lang=target_language, vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=False)

Collecting bpemb
  Downloading bpemb-0.3.3-py3-none-any.whl (19 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 10.0 MB/s 
Installing collected packages: sentencepiece, bpemb
Successfully installed bpemb-0.3.3 sentencepiece-0.1.96
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs5000.model


100%|██████████| 315918/315918 [00:00<00:00, 557183.59B/s]


downloading https://nlp.h-its.org/bpemb/tr/tr.wiki.bpe.vs5000.model


100%|██████████| 315775/315775 [00:00<00:00, 713720.23B/s]


In [None]:
# Testing BPE encoding
encoded_tokens = bpemb_src.encode("This is a test sentence to demonstrate how BPE encoding works for our source language.")
print(encoded_tokens)

encoded_string = " ".join(encoded_tokens)
print(encoded_string)

decoded_string = bpemb_src.decode(encoded_tokens)
print(decoded_string)

['▁', 'T', 'h', 'is', '▁is', '▁a', '▁test', '▁sent', 'ence', '▁to', '▁demonstr', 'ate', '▁how', '▁', 'BPE', '▁enc', 'od', 'ing', '▁works', '▁for', '▁our', '▁source', '▁language', '.']
▁ T h is ▁is ▁a ▁test ▁sent ence ▁to ▁demonstr ate ▁how ▁ BPE ▁enc od ing ▁works ▁for ▁our ▁source ▁language .
This is a test sentence to demonstrate how BPE encoding works for our source language.


In [None]:
# Shortcut functions to encode and decode
def encode_bpe(string, lang, to_lower=True):
  if to_lower:
    string = string.lower()
  if lang == source_language:
    return " ".join(bpemb_src.encode(string))
  elif lang == target_language:
    return " ".join(bpemb_tgt.encode(string))
  else:
    return ""

def decode_bpe(string, lang):
  tokens = string.strip().split()
  if lang == source_language:
    return bpemb_src.decode(tokens)
  elif lang == target_language:
    return bpemb_tgt.decode(tokens)
  else:
    return ""

In [None]:
# Let's encode all our sets with BPE
src_train_bpe = [encode_bpe(sentence, source_language) for sentence in src_train]
tgt_train_bpe = [encode_bpe(sentence, target_language) for sentence in tgt_train]

src_dev_bpe = [encode_bpe(sentence, source_language) for sentence in src_dev]
tgt_dev_bpe = [encode_bpe(sentence, target_language) for sentence in tgt_dev]

src_test_bpe = [encode_bpe(sentence, source_language) for sentence in src_test]
tgt_test_bpe = [encode_bpe(sentence, target_language) for sentence in tgt_test]

In [None]:
# Now let's write all our sets into separate files

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as tgt_file:
  for s, t in zip(src_train, tgt_train):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as tgt_file:
  for s, t in zip(src_dev, tgt_dev):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("test."+source_language, "w") as src_file, open("test."+target_language, "w") as tgt_file:
  for s, t in zip(src_test, tgt_test):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("train.bpe."+source_language, "w") as src_file, open("train.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(src_train_bpe, tgt_train_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("dev.bpe."+source_language, "w") as src_file, open("dev.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(src_dev_bpe, tgt_dev_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("test.bpe."+source_language, "w") as src_file, open("test.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(src_test_bpe, tgt_test_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

In [None]:
# Doublecheck the files. There should be no extra quotation marks or weird characters.
! head -n5 train.*
! head -n5 dev.*
! head -n5 test.*

==> train.bpe.en <==
▁than k ▁you ▁so ▁much ▁, ▁ch ris ▁.
▁and ▁it ▁' s ▁tr u ly ▁a ▁great ▁honor ▁to ▁have ▁the ▁opportun ity ▁to ▁come ▁to ▁this ▁stage ▁twice ▁; ▁i ▁' m ▁extrem ely ▁gr ate ful ▁.
▁i ▁have ▁been ▁bl own ▁away ▁by ▁this ▁conference ▁, ▁and ▁i ▁want ▁to ▁than k ▁all ▁of ▁you ▁for ▁the ▁many ▁n ice ▁com ments ▁about ▁what ▁i ▁had ▁to ▁say ▁the ▁other ▁night ▁.
▁and ▁i ▁say ▁that ▁s inc er ely ▁, ▁part ly ▁because ▁( ▁m ock ▁so b ▁ ) ▁i ▁need ▁that ▁.
▁( ▁la ugh ter ▁ ) ▁put ▁y ours elves ▁in ▁my ▁position ▁.

==> train.bpe.tr <==
▁çok ▁teş ek kür ▁eder im ▁chris ▁.
▁bu ▁sahne de ▁ikinci ▁kez ▁yer ▁al ma ▁fır sat ına ▁sahip ▁olmak ▁ger ç ekten ▁büyük ▁bir ▁onur ▁. ▁çok ▁min net tar ım ▁.
▁bu ▁konfer ans ta ▁çok ▁mut lu ▁ol d um ▁, ▁ve ▁anlat t ıkları m la ▁ilgili ▁güzel ▁yorum ların ız ▁için ▁s iz lere ▁çok ▁teş ek kür ▁eder im ▁.
▁bunu ▁iç ten likle ▁söy l üyor um ▁, ▁çünkü ▁... ▁( ▁ağ lama ▁tak li di ▁) ▁buna ▁ihtiy ac ım ▁var ▁.
▁( ▁kah k ah alar ▁) ▁kend in izi ▁beni

In [None]:
# If creating data for the first time, move all prepared data to the mounted location in google drive
! mkdir "$gdrive_path"/data
! cp train.* "$gdrive_path"/data
! cp test.* "$gdrive_path"/data
! cp dev.* "$gdrive_path"/data
! ls "$gdrive_path"/data  #See the contents of the drive directory

dev.bpe.en  dev.en  test.bpe.en  test.en  train.bpe.en	train.en
dev.bpe.tr  dev.tr  test.bpe.tr  test.tr  train.bpe.tr	train.tr


In [None]:
# OR... If continuing from previous run, load files from drive
! cp "$gdrive_path"/data/dev.* .
! cp "$gdrive_path"/data/train.* .
! cp "$gdrive_path"/data/test.* .



---


## Installation of JoeyNMT

JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. Check out the documentation for JoeyNMT [here](https://joeynmt.readthedocs.io)  

In [None]:
#IMPORTANT: Restart runtime if you have installed AfroTranslate

# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

fatal: destination path 'joeynmt' already exists and is not an empty directory.
Processing /content/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting torch>=1.9.0
  Using cached torch-1.10.2-cp37-cp37m-manylinux1_x86_64.whl (881.9 MB)
Collecting torchtext>=0.10.0
  Using cached torchtext-0.11.2-cp37-cp37m-manylinux1_x86_64.whl (8.0 MB)
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Created wheel for joeynmt: filename=joeynmt-1.5.1-py3-none-any.whl size=86003 sha256=f0fd9a53198708427af3f8e21bba478d2a83c8bf1f6b08083466061fbae5bf19
  Stored in directory: /tmp

In [None]:
#Move everything important under joeynmt directory
os.environ["data_path"] = os.path.join("joeynmt", "data", source_language + target_language)

# Create directory, move everyone we care about to the correct location
! mkdir -p $data_path
! cp train.* $data_path
! cp test.* $data_path
! cp dev.* $data_path
! ls $data_path

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "Combined BPE Vocab"
! head -n 10 joeynmt/data/$src$tgt/vocab.txt  # Herman

# Backup vocab to drive
! cp joeynmt/data/$src$tgt/vocab.txt "$gdrive_path"/data


dev.bpe.en  dev.en  test.bpe.en  test.en  train.bpe.en	train.en
dev.bpe.tr  dev.tr  test.bpe.tr  test.tr  train.bpe.tr	train.tr
Combined BPE Vocab
774
531
883
6397
794
431
381
1414
761
548


# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [None]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "models/entr_transformer/1000.ckpt"
    load_model: "{gdrive_path}/models/{name}_transformer/1000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: True               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [None]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2021-07-28 16:52:38,026 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-28 16:52:38,058 - INFO - joeynmt.data - Loading training data...
2021-07-28 16:52:46,845 - INFO - joeynmt.data - Building vocabulary...
2021-07-28 16:52:48,172 - INFO - joeynmt.data - Loading dev data...
2021-07-28 16:52:48,225 - INFO - joeynmt.data - Loading test data...
2021-07-28 16:52:48,238 - INFO - joeynmt.data - Data loaded.
2021-07-28 16:52:48,238 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-28 16:52:48,484 - INFO - joeynmt.model - Enc-dec model built.
2021-07-28 16:52:48.642085: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-28 16:52:49,639 - INFO - joeynmt.training - Total params: 13372928
2021-07-28 16:52:52,915 - INFO - joeynmt.training - Loading model from /content/drive/My Drive/mt-workshop/en-tr-baseline/models/entr_transformer/1000.ckpt
2021-07-28 16:52:53,369 - INFO - joeynmt.helpe

In [None]:
# Copy the created models from the notebook storage to google drive for persistant storage 
! mkdir -p "$gdrive_path/models/${src}${tgt}_transformer/"
! cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [None]:
# OR... If continuing from previous work, load models from google drive to notebook storage  
! mkdir -p joeynmt/models/${src}${tgt}_transformer
! cp -r "$gdrive_path/models/${src}${tgt}_transformer" joeynmt/models

In [None]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

Steps: 1000	Loss: 4971087.00000	PPL: 85.69424	bleu: 2.36058	LR: 0.00030000	*


In [None]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"

2021-07-28 16:28:56,802 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-28 16:28:56,802 - INFO - joeynmt.data - Building vocabulary...
2021-07-28 16:28:57,990 - INFO - joeynmt.data - Loading dev data...
2021-07-28 16:28:58,003 - INFO - joeynmt.data - Loading test data...
2021-07-28 16:28:58,014 - INFO - joeynmt.data - Data loaded.
2021-07-28 16:28:58,045 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 18000 (with beam_size)
2021-07-28 16:29:01,412 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-28 16:29:01,642 - INFO - joeynmt.model - Enc-dec model built.
2021-07-28 16:29:01,712 - INFO - joeynmt.prediction - Decoding on dev set (data/entr/dev.bpe.tr)...
2021-07-28 16:30:08,794 - INFO - joeynmt.prediction -  dev bleu[13a]:   0.97 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-28 16:30:08,795 - INFO - joeynmt.prediction - Decoding on test set (data/entr/test.bpe.tr)...
2021-07-28 16:31:30,255 - I

# Fine-tuning to domain

One important technique in neural machine translation is in-domain adaptation or fine-tuning. This introduces the model a certain domain we're interested to do translations in. 

One simple way of doing this is having a pre-trained model and continuing training from it on our in-domain training set. 

In this example we're going to fine-tune our model to news. 

In [None]:
fine_corpus = "WMT-News" 
os.environ["fine"] = fine_corpus

# Downloading our corpus 
! opus_read -d $fine -s $src -t $tgt -wm moses -w $fine.$src $fine.$tgt -q

# Extract the corpus file
! gunzip ${fine}_latest_xml_$src-$tgt.xml.gz



Alignment file /proj/nlpl/data/OPUS/WMT-News/latest/xml/en-tr.xml.gz not found. The following files are available for downloading:

  92 KB https://object.pouta.csc.fi/OPUS-WMT-News/v2019/xml/en-tr.xml.gz
  63 MB https://object.pouta.csc.fi/OPUS-WMT-News/v2019/xml/en.zip
   2 MB https://object.pouta.csc.fi/OPUS-WMT-News/v2019/xml/tr.zip

  66 MB Total size
./WMT-News_latest_xml_en-tr.xml.gz ... 100% of 92 KB
./WMT-News_latest_xml_en.zip ... 100% of 63 MB
./WMT-News_latest_xml_tr.zip ... 100% of 2 MB


In [None]:
# Read the corpus into python lists
source_file = fine_corpus + '.' + source_language
target_file = fine_corpus + '.' + target_language

fine_src_all_bpe = [encode_bpe(sentence.strip(),'en') for sentence in open(source_file).readlines()]
fine_tgt_all_bpe = [encode_bpe(sentence.strip(), 'tr') for sentence in open(target_file).readlines()]


In [None]:
# Let's take a peek at the files
print("Source size:", len(fine_src_all_bpe))
print("Target size:", len(fine_tgt_all_bpe))
print("--------")

peek_size = 5
for i in range(peek_size):
  print("Sent #", i)
  print("SRC:", decode_bpe(fine_src_all_bpe[i], 'en'))
  print("TGT:", decode_bpe(fine_tgt_all_bpe[i],'tr'))
  print("---------")

Source size: 20016
Target size: 20016
--------
Sent # 0
SRC: two people drowned in floods in trabzon
TGT: trabzon ' da sel iki kişiyi yuttu
---------
Sent # 1
SRC: the ikisu creek overflowed on account of heavy rainfall in the district of yomra in trabzon .
TGT: trabzon ’ un yomra ilçesinde etkili olan sağanak yağış nedeniyle i̇kisu deresi taştı .
---------
Sent # 2
SRC: two women disappeared in the floodwaters in the village of tasdelen and the road to the village of sayvan was closed .
TGT: taşdelen köyünde sele kapılan iki kadın kaybolurken , sayvan köyü yolu ulaşıma kapandı .
---------
Sent # 3
SRC: the body of one of the women drowned in the floods was found .
TGT: selde kayıp olan iki kadından birinin cesedine ulaşıldı .
---------
Sent # 4
SRC: there was precipitation in the highlands of yomra at around 15 : 00 .
TGT: yağış , yomra ’ nın yüksek kesiminde saat 15.00 sıralarında etkili oldu .
---------


In [None]:
# Allocate train, dev, test portions
all_size = len(fine_src_all_bpe)
dev_size = 500
test_size = 500
train_size = all_size - test_size - dev_size

fine_src_train_bpe = fine_src_all_bpe[0:train_size]
fine_tgt_train_bpe = fine_tgt_all_bpe[0:train_size]

fine_src_dev_bpe = fine_src_all_bpe[train_size:train_size+dev_size]
fine_tgt_dev_bpe = fine_tgt_all_bpe[train_size:train_size+dev_size]

fine_src_test_bpe = fine_src_all_bpe[train_size+dev_size:all_size]
fine_tgt_test_bpe = fine_tgt_all_bpe[train_size+dev_size:all_size]

print("Set sizes")
print("All:", len(fine_src_all_bpe))
print("Train:", len(fine_src_train_bpe))
print("Dev:", len(fine_src_dev_bpe))
print("Test:", len(fine_src_test_bpe))

Set sizes
All: 20016
Train: 19016
Dev: 500
Test: 500


In [None]:
# Store sentences as files
with open("finetrain.bpe."+source_language, "w") as src_file, open("finetrain.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(fine_src_train_bpe, fine_tgt_train_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("finedev.bpe."+source_language, "w") as src_file, open("finedev.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(fine_src_dev_bpe, fine_tgt_dev_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("finetest.bpe."+source_language, "w") as src_file, open("finetest.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(fine_src_test_bpe, fine_tgt_test_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")


NameError: ignored

In [None]:
# If creating data for the first time, move all prepared data to the mounted location in google drive
! mkdir -p "$gdrive_path"/data
! cp finetrain.* "$gdrive_path"/data
! cp finetest.* "$gdrive_path"/data
! cp finedev.* "$gdrive_path"/data
! ls "$gdrive_path"/data  #See the contents of the drive directory

mkdir: cannot create directory ‘/content/drive/My Drive/mt-workshop/en-tr-baseline/data’: File exists
dev.bpe.en  finedev.bpe.en   finetrain.bpe.en  test.en	     train.en
dev.bpe.tr  finedev.bpe.tr   finetrain.bpe.tr  test.tr	     train.tr
dev.en	    finetest.bpe.en  test.bpe.en       train.bpe.en
dev.tr	    finetest.bpe.tr  test.bpe.tr       train.bpe.tr


In [None]:
# OR... If continuing from previous run, load finetuning data from drive
! cp "$gdrive_path"/data/finedev.* .
! cp "$gdrive_path"/data/finetrain.* .
! cp "$gdrive_path"/data/finetest.* .
! cp "$gdrive_path"/data/finetest.* .

In [None]:
# #Move everything important under joeynmt directory
os.environ["data_path"] = os.path.join("joeynmt", "data", source_language + target_language)

# Move fine-tuning data to data directory
! mkdir -p $data_path
! cp finetrain.* $data_path
! cp finetest.* $data_path
! cp finedev.* $data_path
! ls $data_path

finedev.bpe.en	finetest.bpe.en  finetrain.bpe.en
finedev.bpe.tr	finetest.bpe.tr  finetrain.bpe.tr


In [None]:
# Also, load models from google drive to notebook storage  
! mkdir -p joeynmt/models/${src}${tgt}_transformer
! cp -r "$gdrive_path/models/${src}${tgt}_transformer" joeynmt/models

In [None]:
# Let's create a config file for finetuning training
# Changes from previous config are dataset names, model name, batch size and learning rate

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/finetrain.bpe"
    dev:   "data/{name}/finedev.bpe"
    test:  "data/{name}/finetest.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "{gdrive_path}/models/{name}_transformer/best.ckpt" # Load base model from its best scoring checkpoint (from gdrive)
    #load_model: "models/{name}_transformer/best.ckpt" # Load base model from its best scoring checkpoint (from local)
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0001
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 1028
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: True               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}_finetune.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Test our model on our domain before fine-tuning
! cd joeynmt; python3 -m joeynmt test "configs/transformer_${src}${tgt}_finetune.yaml"

2021-07-29 09:54:48,631 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-29 09:54:48,631 - INFO - joeynmt.data - Building vocabulary...
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/joeynmt/joeynmt/__main__.py", line 48, in <module>
    main()
  File "/content/joeynmt/joeynmt/__main__.py", line 38, in main
    output_path=args.output_path, save_attention=args.save_attention)
  File "/content/joeynmt/joeynmt/prediction.py", line 293, in test
    data_cfg=cfg["data"], datasets=["dev", "test"])
  File "/content/joeynmt/joeynmt/data.py", line 112, in load_data
    dataset=train_data, vocab_file=src_vocab_file)
  File "/content/joeynmt/joeynmt/vocabulary.py", line 161, in build_vocab
    vocab = Vocabulary(file=vocab_file)
  File "/content/joeynmt/joeynmt/vocabulary.py", line 40, in __init

In [None]:
# Train to our domain
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
! cd joeynmt; python3 -m joeynmt train "configs/transformer_${src}${tgt}_finetune.yaml"

2021-07-28 17:35:43,552 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-28 17:35:43,585 - INFO - joeynmt.data - Loading training data...
2021-07-28 17:35:52,185 - INFO - joeynmt.data - Building vocabulary...
2021-07-28 17:35:53,413 - INFO - joeynmt.data - Loading dev data...
2021-07-28 17:35:53,459 - INFO - joeynmt.data - Loading test data...
2021-07-28 17:35:53,469 - INFO - joeynmt.data - Data loaded.
2021-07-28 17:35:53,469 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-28 17:35:53,695 - INFO - joeynmt.model - Enc-dec model built.
2021-07-28 17:35:53.930981: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-28 17:35:54,906 - INFO - joeynmt.training - Total params: 13372928
2021-07-28 17:35:58,161 - INFO - joeynmt.training - Loading model from /content/drive/My Drive/mt-workshop/en-tr-baseline/models/entr_transformer/best.ckpt
2021-07-28 17:35:58,607 - INFO - joeynmt.helpe

In [None]:
# Copy the created models from the notebook storage to google drive for persistant storage 
! mkdir -p "$gdrive_path/models/${src}${tgt}_transformer_finetune/"
! cp -r joeynmt/models/${src}${tgt}_transformer_finetune/* "$gdrive_path/models/${src}${tgt}_transformer_finetune/"



In [None]:
# Test again to see how our model improved
#! cd joeynmt; python3 -m joeynmt test "configs/transformer_${src}${tgt}_finetune.yaml"

! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer_finetune/config.yaml"