# Machine Translation for Translators Workshop
Localization summer school '21

### Note before beginning:

#### - This coding template is based on Masakhane's starter notebook (https://github.com/masakhane-io/masakhane-mt)
#### - The idea is that you should be able to make minimal changes to this in order to get SOME result for your own translation corpus. 
#### - The TL;DR: Go to the **"TODO"** comments which will tell you what to update to get up and running
#### - If you actually want to have a clue what you're doing, read the text and peek at the links
#### - With 100 epochs, it should take around 7 hours to run in Google Colab

## Retrieve your data & make a parallel corpus

In this workshop we will use open corpus available from OPUS repository to train a translation model. We will first download the data, create training, development, testing sets from it and then use JoeyNMT to train a baseline model. 

In the next cell, you need to set the languages you want to work with and specify which corpus you want to use to train. 

To select a corpus go to https://opus.nlpl.eu/, enter your language pair and select one that you think is more appropriate (size, domain)

In [1]:
# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:
# These will also become the suffix's of all vocab and corpus files used throughout
import os
source_language = "en"
target_language = "tr"
opus_corpus = "TED2020" 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["corpus"] = opus_corpus
os.environ["tag"] = tag

In [2]:
# This will save it to a folder in our gdrive instead!
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p "/content/drive/My Drive/masakhane/$src-$tgt-$tag"
os.environ["gdrive_path"] = "/content/drive/My Drive/masakhane/%s-%s-%s" % (source_language, target_language, tag)

!echo $gdrive_path

Mounted at /content/drive
/content/drive/My Drive/masakhane/en-tr-baseline


In [3]:
# Install opus-tools (Warning! This is not really python)
! pip install opustools-pkg

Collecting opustools-pkg
  Downloading opustools_pkg-0.0.52-py3-none-any.whl (80 kB)
[?25l[K     |████                            | 10 kB 23.7 MB/s eta 0:00:01[K     |████████                        | 20 kB 13.7 MB/s eta 0:00:01[K     |████████████▏                   | 30 kB 9.6 MB/s eta 0:00:01[K     |████████████████▏               | 40 kB 8.2 MB/s eta 0:00:01[K     |████████████████████▎           | 51 kB 4.1 MB/s eta 0:00:01[K     |████████████████████████▎       | 61 kB 4.5 MB/s eta 0:00:01[K     |████████████████████████████▎   | 71 kB 4.6 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 3.2 MB/s 
[?25hInstalling collected packages: opustools-pkg
Successfully installed opustools-pkg-0.0.52


In [4]:
# Downloading our corpus 
! opus_read -d $corpus -s $src -t $tgt -wm moses -w $corpus.$src $corpus.$tgt -q

# Extract the corpus file
! gunzip ${corpus}_latest_xml_$src-$tgt.xml.gz


Alignment file /proj/nlpl/data/OPUS/TED2020/latest/xml/en-tr.xml.gz not found. The following files are available for downloading:

   2 MB https://object.pouta.csc.fi/OPUS-TED2020/v1/xml/en-tr.xml.gz
  47 MB https://object.pouta.csc.fi/OPUS-TED2020/v1/xml/en.zip
  37 MB https://object.pouta.csc.fi/OPUS-TED2020/v1/xml/tr.zip

  86 MB Total size
./TED2020_latest_xml_en-tr.xml.gz ... 100% of 2 MB
./TED2020_latest_xml_en.zip ... 100% of 47 MB
./TED2020_latest_xml_tr.zip ... 100% of 37 MB


In [5]:
# Read the corpus into python lists
source_file = opus_corpus + '.' + source_language
target_file = opus_corpus + '.' + target_language

src_all = [sentence.strip() for sentence in open(source_file).readlines()]
tgt_all = [sentence.strip() for sentence in open(target_file).readlines()]

In [6]:
# Let's take a peek at the files
print("Source size:", len(src_all))
print("Target size:", len(tgt_all))
print("--------")

peek_size = 5
for i in range(peek_size):
  print("Sent #", i)
  print("SRC:", src_all[i])
  print("TGT:", tgt_all[i])
  print("---------")

Source size: 374378
Target size: 374378
--------
Sent # 0
SRC: Thank you so much , Chris .
TGT: Çok teşekkür ederim Chris .
---------
Sent # 1
SRC: And it 's truly a great honor to have the opportunity to come to this stage twice ; I 'm extremely grateful .
TGT: Bu sahnede ikinci kez yer alma fırsatına sahip olmak gerçekten büyük bir onur . Çok minnettarım .
---------
Sent # 2
SRC: I have been blown away by this conference , and I want to thank all of you for the many nice comments about what I had to say the other night .
TGT: Bu konferansta çok mutlu oldum , ve anlattıklarımla ilgili güzel yorumlarınız için sizlere çok teşekkür ederim .
---------
Sent # 3
SRC: And I say that sincerely , partly because ( Mock sob ) I need that .
TGT: Bunu içtenlikle söylüyorum , çünkü ... ( Ağlama taklidi ) Buna ihtiyacım var .
---------
Sent # 4
SRC: ( Laughter ) Put yourselves in my position .
TGT: ( Kahkahalar ) Kendinizi benim yerime koyun !
---------


## Making training, development and testing sets

We need to pick training, development and testing sets from our corpus. Training set will contain the sentences that we'll teach our model. Development set will be used to see how our model is progressing during the training. And finally, testing set will be used to evaluate the model.

You can optionally load your own testing set. 

In [7]:
# TODO: Determine ratios of each set
train_ratio = 0.8
dev_ratio = 0.1
test_ratio = 0.1

all_size = len(src_all)
train_size = int(all_size * train_ratio)
dev_size = int(all_size * dev_ratio)
test_size = all_size -train_size - dev_size

src_train = src_all[0:train_size]
tgt_train = tgt_all[0:train_size]

src_dev = src_all[train_size:train_size+dev_size]
tgt_dev = tgt_all[train_size:train_size+dev_size]

src_test = src_all[train_size+dev_size:all_size]
tgt_test = tgt_all[train_size+dev_size:all_size]

print("Set sizes")
print("All:", all_size)
print("Train:", train_size)
print("Dev:", dev_size)
print("Test:", test_size)

Set sizes
All: 374378
Train: 299502
Dev: 37437
Test: 37439


# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for neural machine translation is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- BPE tokenization limits the number of vocabulary into a certain size by smartly dividing words into subwords

- This is especially useful for agglutinative languages (like Turkish) where vocabulary is effectively endless. 

- Below you have the scripts for doing BPE tokenization of our data. We use bpemb library that has pre-trained BPE models to convert our data into subwords.

In [8]:
! pip install bpemb
from bpemb import BPEmb

BPE_VOCAB_SIZE = 5000
bpemb_src = BPEmb(lang=source_language, vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=False)
bpemb_tgt = BPEmb(lang=target_language, vs=BPE_VOCAB_SIZE, segmentation_only=True, preprocess=False)

Collecting bpemb
  Downloading bpemb-0.3.3-py3-none-any.whl (19 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.6 MB/s 
Installing collected packages: sentencepiece, bpemb
Successfully installed bpemb-0.3.3 sentencepiece-0.1.96
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs5000.model


100%|██████████| 315918/315918 [00:01<00:00, 298993.58B/s]


downloading https://nlp.h-its.org/bpemb/tr/tr.wiki.bpe.vs5000.model


100%|██████████| 315775/315775 [00:01<00:00, 293777.40B/s]


In [9]:
# Testing BPE encoding
encoded_tokens = bpemb_src.encode("This is a test sentence to demonstrate how BPE encoding works for our source language.")
print(encoded_tokens)

encoded_string = " ".join(encoded_tokens)
print(encoded_string)

decoded_string = bpemb_src.decode(encoded_tokens)
print(decoded_string)

['▁', 'T', 'h', 'is', '▁is', '▁a', '▁test', '▁sent', 'ence', '▁to', '▁demonstr', 'ate', '▁how', '▁', 'BPE', '▁enc', 'od', 'ing', '▁works', '▁for', '▁our', '▁source', '▁language', '.']
▁ T h is ▁is ▁a ▁test ▁sent ence ▁to ▁demonstr ate ▁how ▁ BPE ▁enc od ing ▁works ▁for ▁our ▁source ▁language .
This is a test sentence to demonstrate how BPE encoding works for our source language.


In [10]:
# Shortcut functions to encode and decode
def encode_bpe(string, lang):
  if lang == source_language:
    return " ".join(bpemb_src.encode(string))
  elif lang == target_language:
    return " ".join(bpemb_tgt.encode(string))
  else:
    return ""

def decode_bpe(string, lang):
  tokens = string.strip().split()
  if lang == source_language:
    return bpemb_src.decode(tokens)
  elif lang == target_language:
    return bpemb_tgt.decode(tokens)
  else:
    return ""

In [11]:
# Let's encode all our sets with BPE
src_train_bpe = [encode_bpe(sentence, source_language) for sentence in src_train]
tgt_train_bpe = [encode_bpe(sentence, target_language) for sentence in tgt_train]

src_dev_bpe = [encode_bpe(sentence, source_language) for sentence in src_dev]
tgt_dev_bpe = [encode_bpe(sentence, target_language) for sentence in tgt_dev]

src_test_bpe = [encode_bpe(sentence, source_language) for sentence in src_test]
tgt_test_bpe = [encode_bpe(sentence, target_language) for sentence in tgt_test]

In [13]:
# Now let's write all our sets into separate files

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as tgt_file:
  for s, t in zip(src_train, tgt_train):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as tgt_file:
  for s, t in zip(src_dev, tgt_dev):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("test."+source_language, "w") as src_file, open("test."+target_language, "w") as tgt_file:
  for s, t in zip(src_test, tgt_test):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("train.bpe."+source_language, "w") as src_file, open("train.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(src_train_bpe, tgt_train_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("dev.bpe."+source_language, "w") as src_file, open("dev.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(src_dev_bpe, tgt_dev_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

with open("test.bpe."+source_language, "w") as src_file, open("test.bpe."+target_language, "w") as tgt_file:
  for s, t in zip(src_test_bpe, tgt_test_bpe):
    src_file.write(s+"\n")
    tgt_file.write(t+"\n")

In [14]:
# Doublecheck the files. There should be no extra quotation marks or weird characters.
! head -n5 train.*
! head -n5 dev.*
! head -n5 test.*

==> train.bpe.en <==
▁ T h ank ▁you ▁so ▁much ▁, ▁ C h ris ▁.
▁ A nd ▁it ▁' s ▁tr u ly ▁a ▁great ▁honor ▁to ▁have ▁the ▁opportun ity ▁to ▁come ▁to ▁this ▁stage ▁twice ▁; ▁ I ▁' m ▁extrem ely ▁gr ate ful ▁.
▁ I ▁have ▁been ▁bl own ▁away ▁by ▁this ▁conference ▁, ▁and ▁ I ▁want ▁to ▁than k ▁all ▁of ▁you ▁for ▁the ▁many ▁n ice ▁com ments ▁about ▁what ▁ I ▁had ▁to ▁say ▁the ▁other ▁night ▁.
▁ A nd ▁ I ▁say ▁that ▁s inc er ely ▁, ▁part ly ▁because ▁( ▁ M ock ▁so b ▁ ) ▁ I ▁need ▁that ▁.
▁( ▁ L augh ter ▁ ) ▁ P ut ▁y ours elves ▁in ▁my ▁position ▁.

==> train.bpe.tr <==
▁ Ç ok ▁teş ek kür ▁eder im ▁ C h ris ▁.
▁ B u ▁sahne de ▁ikinci ▁kez ▁yer ▁al ma ▁fır sat ına ▁sahip ▁olmak ▁ger ç ekten ▁büyük ▁bir ▁onur ▁. ▁ Ç ok ▁min net tar ım ▁.
▁ B u ▁konfer ans ta ▁çok ▁mut lu ▁ol d um ▁, ▁ve ▁anlat t ıkları m la ▁ilgili ▁güzel ▁yorum ların ız ▁için ▁s iz lere ▁çok ▁teş ek kür ▁eder im ▁.
▁ B unu ▁iç ten likle ▁söy l üyor um ▁, ▁çünkü ▁... ▁( ▁ A ğ lama ▁tak li di ▁) ▁ B una ▁ihtiy ac ım ▁var ▁.
▁( ▁



---


## Installation of JoeyNMT

JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. Check out the documentation for JoeyNMT [here](https://joeynmt.readthedocs.io)  

In [15]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .
# Install Pytorch with GPU support v1.7.1.
! pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Cloning into 'joeynmt'...
remote: Enumerating objects: 3127, done.[K
remote: Counting objects: 100% (176/176), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 3127 (delta 101), reused 142 (delta 91), pack-reused 2951[K
Receiving objects: 100% (3127/3127), 8.09 MiB | 9.19 MiB/s, done.
Resolving deltas: 100% (2130/2130), done.
Processing /content/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting numpy==1.20.1
  Downloading numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 95 kB/s 
Collecting torch==1.8.0
  Downloading torch-1.8.0-cp3

In [22]:
## TO DELETE?

os.environ["data_path"] = os.path.join("joeynmt", "data", source_language + target_language)

# Create directory, move everyone we care about to the correct location
! mkdir -p $data_path
! cp train.* $data_path
! cp test.* $data_path
! cp dev.* $data_path
! ls $data_path

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/$src$tgt/vocab.txt  # Herman


dev.bpe.en  dev.en  test.bpe.en  test.en  train.bpe.en	train.en
dev.bpe.tr  dev.tr  test.bpe.tr  test.tr  train.bpe.tr	train.tr
Combined BPE Vocab
▁spanish
SMW
KBI
IS
▁anton
1784
ò
عسل
مسكين
T8


In [17]:
# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! ls "$gdrive_path"  #See the contents of the drive directory

dev.bpe.en  dev.en  test.bpe.en  test.en  train.bpe.en	train.en
dev.bpe.tr  dev.tr  test.bpe.tr  test.tr  train.bpe.tr	train.tr


# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [23]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 1000          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [None]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2021-07-28 12:41:07,414 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-28 12:41:07,439 - INFO - joeynmt.data - Loading training data...
2021-07-28 12:41:15,383 - INFO - joeynmt.data - Building vocabulary...
2021-07-28 12:41:17,439 - INFO - joeynmt.data - Loading dev data...
2021-07-28 12:41:17,959 - INFO - joeynmt.data - Loading test data...
2021-07-28 12:41:19,884 - INFO - joeynmt.data - Data loaded.
2021-07-28 12:41:19,884 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-28 12:41:20,186 - INFO - joeynmt.model - Enc-dec model built.
2021-07-28 12:41:20.439783: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-28 12:41:22,372 - INFO - joeynmt.training - Total params: 13813760
2021-07-28 12:41:24,627 - INFO - joeynmt.helpers - cfg.name                           : entr_transformer
2021-07-28 12:41:24,627 - INFO - joeynmt.helpers - cfg.data.src                       : en
2021-0

In [None]:
# Copy the created models from the notebook storage to google drive for persistant storage 
!cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [None]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

In [None]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"