# 1. Preparation

## 1.1 Install pip and torch

In [None]:
!pip install pip==24.0
!pip show torch | grep Version

## 1.2 Install fairseq

First install fairseq, trust me, the original installation method sucks...

In [None]:
!git clone https://github.com/facebookresearch/fairseq.git
%cd fairseq
!pip install --editable ./

Then we need to add a new environment variable so that we can use the fairseq command in the terminal.  

In [None]:
!echo $PYTHONPATH
import os
os.environ['PYTHONPATH'] += ":/content/fairseq/"
!echo $PYTHONPATH

## 1.3 Install other packages

In [None]:
!pip install sacremoses
!pip install sentencepiece
!pip install sacrebleu

## 1.4 Activate GPU

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device(); print('Current device: ', torch.cuda.get_device_name(device))
else:
    device = 'cpu'; print('Current device: CPU.')

# 2. Data Preparation

## 2.1 Download dataset

In [None]:
%cd /content/fairseq/examples/translation

!wget -O sample_data.zip https://bwsyncandshare.kit.edu/s/Xx3D56SJmG8PwXj/download
# If wget command not found, download the wget.exe from this website and move it to C:\Windows\System32: https://eternallybored.org/misc/wget/

!unzip -o sample_data.zip
# If unzip command not found, download "Complete package, except sources" and copy unzip.exe to C:\Windows: https://gnuwin32.sourceforge.net/packages/unzip.htm

## 2.2 Preprocessing

In [None]:
# List files in downloaded `sample_data`
!ls -ltr sample_data

!echo -e "\nFirst lines of German:\n"
!head sample_data/train.wikimedia.de-en.de
!echo -e "\nFirst lines of English:\n"
!head sample_data/train.wikimedia.de-en.en

Segment the text into subwords using BPE.

In [None]:
import sentencepiece as spm

# After execution, you can find two bpe files in the directory.
spm.SentencePieceTrainer.train(input="sample_data/train.wikimedia.de-en.de,sample_data/train.wikimedia.de-en.en",
                               model_prefix="bpe",
                               vocab_size=10000)

print('Finished training sentencepiece model.')

Then we use the trained segmentation model to preprocess the sentences from train/dev/test sets:

In [None]:
# Load the trained sentencepiece model
spm_model = spm.SentencePieceProcessor(model_file="bpe.model")

# Important: encoding="utf-8"
for partition in ["train", "dev", "tst"]:
    for lang in ["de", "en"]:
        f_out = open(f"sample_data/spm.{partition}.wikimedia.de-en.{lang}", "w", encoding="utf-8")

        with open(f"sample_data/{partition}.wikimedia.de-en.{lang}", "r", encoding="utf-8") as f_in:
            for line_idx, line in enumerate(f_in.readlines()):
                # Segmented into subwords
                line_segmented = spm_model.encode(line.strip(), out_type=str)
                # Join the subwords into a string
                line_segmented = " ".join(line_segmented)
                f_out.write(line_segmented + "\n")

        f_out.close()

Now, we will binarize the data for training with fairseq.  
Feel free to check the [documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) of fairseq commands.

In [None]:
# Preprocess/binarize the data
TEXT="/content/fairseq/examples/translation/sample_data"
!echo $TEXT
# Binarize the data for training
!fairseq-preprocess \
    --source-lang de --target-lang en \
    --trainpref $TEXT/spm.train.wikimedia.de-en \
    --validpref $TEXT/spm.dev.wikimedia.de-en \
    --testpref $TEXT/spm.tst.wikimedia.de-en \
    --destdir data-bin/iwslt14.de-en \
    --thresholdtgt 0 --thresholdsrc 0 \
    --workers 8

The data preprocessing is completed.

# 3. Training

In [None]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
    /content/fairseq/examples/translation/data-bin/iwslt14.de-en \
    --max-source-positions 4096 --max-target-positions 4096 \
    --skip-invalid-size-inputs-valid-test \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 2 \
    --max-tokens 4096 \
    --max-epoch 10 \
    --fp16

# 4. Decoding

Now we can generate translations with the trained model.

In [None]:
# TEST_INPUT="/content/fairseq/examples/translation/sample_data/spm.tst.de-en.de"
PRED_LOG="/content/fairseq/examples/translation/de-en.decode.log"

!fairseq-generate /content/fairseq/examples/translation/data-bin/iwslt14.de-en \
      --task translation \
      --source-lang de \
      --target-lang en \
      --path /content/fairseq/examples/translation/checkpoints/checkpoint_best.pt \
      --batch-size 256 \
      --beam 4 \
      --max-source-positions 4096 --max-target-positions 4096 \
      --skip-invalid-size-inputs-valid-test \
      --remove-bpe=sentencepiece > $PRED_LOG

We extract the hypotheses and references from the decoding log file.

In [None]:
!grep ^H "de-en.decode.log" | sed 's/^H-//g' | cut -f 3 | sed 's/ ##//g' > ./hyp.txt
!grep ^T "de-en.decode.log" | sed 's/^T-//g' | cut -f 2 | sed 's/ ##//g' > ./ref.txt
!head ./hyp.txt
!echo ""
!head ./ref.txt

# Section 5. Evaluation

In [None]:
!bash -c "cat hyp.txt | sacrebleu ref.txt"