# 1. Preparation

## 1.1 Install pip and torch

In [1]:
!pip install pip==24.0
!pip show torch | grep Version

Version: 2.5.0+cu121


## 1.2 Install fairseq

First install fairseq, trust me, the original installation method sucks...

In [2]:
!git clone https://github.com/facebookresearch/fairseq.git
%cd fairseq
!pip install --editable ./

Cloning into 'fairseq'...
remote: Enumerating objects: 35385, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 35385 (delta 10), reused 40 (delta 9), pack-reused 35337 (from 1)[K
Receiving objects: 100% (35385/35385), 25.47 MiB | 20.59 MiB/s, done.
Resolving deltas: 100% (25539/25539), done.
/content/fairseq
Obtaining file:///content/fairseq
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting hydra-core<1.1,>=1.0.7 (from fairseq==0.12.2)
  Downloading hydra_core-1.0.7-py3-none-any.whl.metadata (3.7 kB)
Collecting omegaconf<2.1 (from fairseq==0.12.2)
  Downloading omegaconf-2.0.6-py3-none-any.whl.metadata (3.0 kB)
Collecting antlr4-python3-runtime==4.8 (from hydra-core<1.1,>=1.0.7->fairseq==0.12.2)
  

Then we need to add a new environment variable so that we can use the fairseq command in the terminal.  

In [3]:
!echo $PYTHONPATH
import os
os.environ['PYTHONPATH'] += ":/content/fairseq/"
!echo $PYTHONPATH

/env/python
/env/python:/content/fairseq/


## 1.3 Install other packages

In [4]:
!pip install sacremoses
!pip install sentencepiece
!pip install sacrebleu

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/897.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/897.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m890.9/897.5 kB[0m [31m15.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.co

## 1.4 Activate GPU

In [5]:
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device(); print('Current device: ', torch.cuda.get_device_name(device))
else:
    device = 'cpu'; print('Current device: CPU.')

Current device:  Tesla T4


# 2. Data Preparation

## 2.1 Download dataset

In [6]:
%cd /content/fairseq/examples/translation

!wget -O sample_data.zip https://bwsyncandshare.kit.edu/s/Xx3D56SJmG8PwXj/download
# If wget command not found, download the wget.exe from this website and move it to C:\Windows\System32: https://eternallybored.org/misc/wget/

!unzip -o sample_data.zip
# If unzip command not found, download "Complete package, except sources" and copy unzip.exe to C:\Windows: https://gnuwin32.sourceforge.net/packages/unzip.htm

/content/fairseq/examples/translation
--2024-10-30 15:38:28--  https://bwsyncandshare.kit.edu/s/Xx3D56SJmG8PwXj/download
Resolving bwsyncandshare.kit.edu (bwsyncandshare.kit.edu)... 141.3.135.129, 2a00:1398:b::8d03:8781
Connecting to bwsyncandshare.kit.edu (bwsyncandshare.kit.edu)|141.3.135.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘sample_data.zip’

sample_data.zip         [              <=>   ]  33.03M  7.79MB/s    in 4.2s    

2024-10-30 15:38:34 (7.79 MB/s) - ‘sample_data.zip’ saved [34633335]

Archive:  sample_data.zip
   creating: sample_data/
 extracting: sample_data/dev.wikimedia.de-en.de  
 extracting: sample_data/dev.wikimedia.de-en.en  
 extracting: sample_data/train.wikimedia.de-en.de  
 extracting: sample_data/train.wikimedia.de-en.en  
 extracting: sample_data/tst.wikimedia.de-en.de  
 extracting: sample_data/tst.wikimedia.de-en.en  


## 2.2 Preprocessing

In [7]:
# List files in downloaded `sample_data`
!ls -ltr sample_data

!echo -e "\nFirst lines of German:\n"
!head sample_data/train.wikimedia.de-en.de
!echo -e "\nFirst lines of English:\n"
!head sample_data/train.wikimedia.de-en.en

total 33832
-rw-r--r-- 1 root root   335913 Oct 30 15:26 tst.wikimedia.de-en.en
-rw-r--r-- 1 root root   347217 Oct 30 15:26 tst.wikimedia.de-en.de
-rw-r--r-- 1 root root 16015279 Oct 30 15:26 train.wikimedia.de-en.en
-rw-r--r-- 1 root root 17268516 Oct 30 15:26 train.wikimedia.de-en.de
-rw-r--r-- 1 root root   320077 Oct 30 15:26 dev.wikimedia.de-en.en
-rw-r--r-- 1 root root   345243 Oct 30 15:26 dev.wikimedia.de-en.de

First lines of German:

Er war Aufsichtsratsvorsitzender der Bahngesellschaften Chesapeake and Ohio Railway und New York Central Railroad. 
Youngs Großvater war ein Rancher im [[Texas Panhandle]]. 
Er wurde als drittes von vier Kindern geboren. 
Seine Mutter starb als er zehn Jahre alt war. 
Er besuchte die von der Baptistischen Kirche unterhaltene Canadian Academy. 
Als Teenager absolvierte er die [[Culver Military Academy]] in [[Culver (Indiana)]]. 
Diese schloss er 1914 als Klassenbester ab. Anschließend begann er ein Studium an der [[University of Virginia]] in [[C

Segment the text into subwords using BPE.

In [8]:
import sentencepiece as spm

# After execution, you can find two bpe files in the directory.
spm.SentencePieceTrainer.train(input="sample_data/train.wikimedia.de-en.de,sample_data/train.wikimedia.de-en.en",
                               model_prefix="bpe",
                               vocab_size=10000)

print('Finished training sentencepiece model.')

Finished training sentencepiece model.


Then we use the trained segmentation model to preprocess the sentences from train/dev/test sets:

In [9]:
# Load the trained sentencepiece model
spm_model = spm.SentencePieceProcessor(model_file="bpe.model")

# Important: encoding="utf-8"
for partition in ["train", "dev", "tst"]:
    for lang in ["de", "en"]:
        f_out = open(f"sample_data/spm.{partition}.wikimedia.de-en.{lang}", "w", encoding="utf-8")

        with open(f"sample_data/{partition}.wikimedia.de-en.{lang}", "r", encoding="utf-8") as f_in:
            for line_idx, line in enumerate(f_in.readlines()):
                # Segmented into subwords
                line_segmented = spm_model.encode(line.strip(), out_type=str)
                # Join the subwords into a string
                line_segmented = " ".join(line_segmented)
                f_out.write(line_segmented + "\n")

        f_out.close()

Now, we will binarize the data for training with fairseq.  
Feel free to check the [documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) of fairseq commands.

In [10]:
# Preprocess/binarize the data
TEXT="/content/fairseq/examples/translation/sample_data"
!echo $TEXT
# Binarize the data for training
!fairseq-preprocess \
    --source-lang de --target-lang en \
    --trainpref $TEXT/spm.train.wikimedia.de-en \
    --validpref $TEXT/spm.dev.wikimedia.de-en \
    --testpref $TEXT/spm.tst.wikimedia.de-en \
    --destdir data-bin/iwslt14.de-en \
    --thresholdtgt 0 --thresholdsrc 0 \
    --workers 8

/content/fairseq/examples/translation/sample_data
2024-10-30 15:46:47.758228: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 15:46:48.009347: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 15:46:48.083353: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-30 15:46:48.509267: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  d

The data preprocessing is completed.

# 3. Training

In [11]:
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
    /content/fairseq/examples/translation/data-bin/iwslt14.de-en \
    --max-source-positions 4096 --max-target-positions 4096 \
    --skip-invalid-size-inputs-valid-test \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 2 \
    --max-tokens 4096 \
    --max-epoch 10 \
    --fp16

2024-10-30 15:52:00.802100: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 15:52:00.822047: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 15:52:00.827996: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-30 15:52:00.843356: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-30 15:52:03 | INFO | numexpr.utils | NumExpr 

# 4. Decoding

Now we can generate translations with the trained model.

In [12]:
# TEST_INPUT="/content/fairseq/examples/translation/sample_data/spm.tst.de-en.de"
PRED_LOG="/content/fairseq/examples/translation/de-en.decode.log"

!fairseq-generate /content/fairseq/examples/translation/data-bin/iwslt14.de-en \
      --task translation \
      --source-lang de \
      --target-lang en \
      --path /content/fairseq/examples/translation/checkpoints/checkpoint_best.pt \
      --batch-size 256 \
      --beam 4 \
      --max-source-positions 4096 --max-target-positions 4096 \
      --skip-invalid-size-inputs-valid-test \
      --remove-bpe=sentencepiece > $PRED_LOG

2024-10-30 16:27:10.020568: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 16:27:10.057867: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 16:27:10.069934: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-30 16:27:10.094893: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:fairseq.tasks.text_to_speech:Please install tens

We extract the hypotheses and references from the decoding log file.

In [14]:
!grep ^H "de-en.decode.log" | sed 's/^H-//g' | cut -f 3 | sed 's/ ##//g' > ./hyp.txt
!grep ^T "de-en.decode.log" | sed 's/^T-//g' | cut -f 2 | sed 's/ ##//g' > ./ref.txt
!head ./hyp.txt
!echo ""
!head ./ref.txt

Discography
Background
Personal life
Discography
Family
Education
Gallery
Education
Location
Examples

Publications (selection)
Column connectors
Teaching and Publications
Publications (selection)
Notable family members
Training and operations manager
New beginning 1929
Youth and professional life
Rhyme structure
Calendar Definition


# Section 5. Evaluation

In [15]:
!bash -c "cat hyp.txt | sacrebleu ref.txt"

{
 "name": "BLEU",
 "score": 0.5,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.4.3",
 "verbose_score": "19.5/2.4/0.6/0.2 (BP = 0.297 ratio = 0.452 hyp_len = 27473 ref_len = 60809)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.4.3"
}
[0m