# 1. Preparation

## 1.1 Install pip and torch

In [1]:
import lmod
await lmod.purge(force=True)
await lmod.load('compiler/gnu/13.3')

In [2]:
import sys
# sys.executable 应该指向你的虚拟环境中的 Python 解释器。
# sys.path 中应该包含你虚拟环境的 site-packages 目录。
print(sys.executable)
print(sys.path)

/pfs/data5/home/kit/stud/unyfv/myEnv/bin/python
['/pfs/data5/home/kit/stud/unyfv', '', '/home/kit/stud/unyfv/.local/lib/python3.9/site-packages', '/opt/bwhpc/common/jupyter/tensorflow/2023-10-10/lib/python3.9/site-packages', '/usr/lib64/python39.zip', '/usr/lib64/python3.9', '/usr/lib64/python3.9/lib-dynload', '/pfs/data5/home/kit/stud/unyfv/myEnv/lib64/python3.9/site-packages', '/pfs/data5/home/kit/stud/unyfv/myEnv/lib/python3.9/site-packages']


In [3]:
import os
# 更新 PYTHONPATH 和 PATH
os.environ["PYTHONPATH"] = "/pfs/data5/home/kit/stud/unyfv/myEnv/lib/python3.9/site-packages:" + os.environ.get("PYTHONPATH", "")
os.environ["PATH"] = "/pfs/data5/home/kit/stud/unyfv/myEnv/bin:" + os.environ["PATH"]
# 验证更新
!which python
!which pip
!echo $PYTHONPATH

/pfs/data5/home/kit/stud/unyfv/myEnv/bin/python
/pfs/data5/home/kit/stud/unyfv/myEnv/bin/pip
/pfs/data5/home/kit/stud/unyfv/myEnv/lib/python3.9/site-packages:/home/kit/stud/unyfv/.local/lib/python3.9/site-packages:/opt/bwhpc/common/jupyter/tensorflow/2023-10-10/lib/python3.9/site-packages


In [4]:
!pip install pip==24.0
!pip show torch | grep Version

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Version: 2.5.1


## 1.2 Install fairseq

First install fairseq, trust me, the original installation method sucks...

In [None]:
!git clone https://github.com/facebookresearch/fairseq.git
%cd fairseq
!pip install --editable ./ 

Then we need to add a new environment variable so that we can use the fairseq command in the terminal.  

In [4]:
!echo $PYTHONPATH
os.environ['PYTHONPATH'] += ":/pfs/data5/home/kit/stud/unyfv/fairseq/"
!echo $PYTHONPATH

/pfs/data5/home/kit/stud/unyfv/myEnv/lib/python3.9/site-packages:/home/kit/stud/unyfv/.local/lib/python3.9/site-packages:/opt/bwhpc/common/jupyter/tensorflow/2023-10-10/lib/python3.9/site-packages
/pfs/data5/home/kit/stud/unyfv/myEnv/lib/python3.9/site-packages:/home/kit/stud/unyfv/.local/lib/python3.9/site-packages:/opt/bwhpc/common/jupyter/tensorflow/2023-10-10/lib/python3.9/site-packages:/pfs/data5/home/kit/stud/unyfv/fairseq/


## 1.3 Install other packages

In [None]:
!pip install sacremoses
!pip install sentencepiece
!pip install sacrebleu

## 1.4 Activate GPU

In [13]:
import torch
print(torch.__version__)
if torch.cuda.is_available():
    device = torch.cuda.current_device(); print('Current device: ', torch.cuda.get_device_name(device))
else:
    device = 'cpu'; print('Current device: CPU.')

2.1.0+cu121
Current device:  Tesla V100-SXM2-32GB
True
2


So far, all packages have been installed.
From now on, just execute the following cells.

# 2. Data Preparation

## 2.1 Download dataset

Here we still use TED-dataset as example.

In [None]:
!wget -O sample_data.zip https://bwsyncandshare.kit.edu/s/Xx3D56SJmG8PwXj/download
!unzip sample_data.zip -d dataset

## 2.2 Preprocessing

Segment the text into subwords using BPE.

In [None]:
import sentencepiece as spm

# After execution, you can find two bpe files in the directory.
spm.SentencePieceTrainer.train(input="dataset/sample_data/train.de-en.en,dataset/sample_data/train.de-en.de",
                               model_prefix="bpe",
                               vocab_size=10000)

print('Finished training sentencepiece model.')

Then we use the trained segmentation model to preprocess the sentences from train/dev/test sets:

In [5]:
# Load the trained sentencepiece model
spm_model = spm.SentencePieceProcessor(model_file="bpe.model")

# Important: encoding="utf-8"
for partition in ["train", "dev", "tst"]:
    for lang in ["de", "en"]:
        f_out = open(f"dataset/sample_data/spm.{partition}.de-en.{lang}", "w", encoding="utf-8")

        with open(f"dataset/sample_data/{partition}.de-en.{lang}", "r", encoding="utf-8") as f_in:
            for line_idx, line in enumerate(f_in.readlines()):
                # Segmented into subwords
                line_segmented = spm_model.encode(line.strip(), out_type=str)
                # Join the subwords into a string
                line_segmented = " ".join(line_segmented)
                f_out.write(line_segmented + "\n")

        f_out.close()

Now, we will binarize the data for training with fairseq.  
Feel free to check the [documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) of fairseq commands.

In [None]:
# Transformer
# Preprocess/binarize the data
TEXT="dataset/sample_data"
!echo $TEXT
# Binarize the data for training
!fairseq-preprocess \
    --source-lang de --target-lang en \
    --trainpref $TEXT/spm.train.de-en \
    --validpref $TEXT/spm.dev.de-en \
    --testpref $TEXT/spm.tst.de-en \
    --destdir binarized_data/iwslt14.de-en \
    --thresholdtgt 0 --thresholdsrc 0 \
    --workers 8

In [34]:
# mBART https://github.com/facebookresearch/fairseq/blob/main/examples/mbart/README.md
# Preprocess/binarize the data
TEXT="dataset/sample_data"
!echo $TEXT
# Binarize the data for training
!fairseq-preprocess \
    --source-lang de --target-lang en \
    --trainpref $TEXT/spm.train.de-en \
    --validpref $TEXT/spm.dev.de-en \
    --testpref $TEXT/spm.tst.de-en \
    --destdir binarized_data/iwslt14.de-en \
    --joined-dictionary \
    --workers 8

dataset/sample_data
2024-11-01 20:31:03.455656: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-01 20:31:03.491655: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-01 20:31:03.491690: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-01 20:31:03.491720: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-01 20:31:03.499411: I tensorflow/core/pl

The data preprocessing is completed.

# 3. Training

In [35]:
# Make sure that (0.9, 0.98) in "", error might occurs when use ''.
!CUDA_VISIBLE_DEVICES=0 fairseq-train \
    binarized_data/iwslt14.de-en \
    --arch mbart_large --share-decoder-input-output-embed \
    --optimizer adam --adam-betas "(0.9, 0.98)" --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 2 \
    --max-tokens 4096 \
    --max-epoch 10 \
    --fp16

2024-11-01 20:32:03.253714: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-01 20:32:03.290233: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-01 20:32:03.290269: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-01 20:32:03.290292: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-01 20:32:03.297900: I tensorflow/core/platform/cpu_feature_g

# 4. Decoding

Now we can generate translations with the trained model.

In [36]:
# ↓↓↓ Even delete this line, it works...xD
# TEST_INPUT="dataset/spm.tst.de-en.de"

!fairseq-generate binarized_data/iwslt14.de-en \
      --task translation \
      --source-lang de \
      --target-lang en \
      --path checkpoints/checkpoint_best.pt \
      --batch-size 256 \
      --beam 4 \
      --remove-bpe=sentencepiece > "en-de.decode.log"

2024-11-01 21:19:02.870125: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-01 21:19:06.611487: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-01 21:19:06.611537: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-01 21:19:06.611574: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-01 21:19:07.163666: I tensorflow/core/platform/cpu_feature_g

We extract the hypotheses and references from the decoding log file.

In [37]:
%%bash
grep ^H "en-de.decode.log" | sed 's/^H-//g' | cut -f 3 | sed 's/ ##//g' > ./hyp.txt
grep ^T "en-de.decode.log" | sed 's/^T-//g' | cut -f 2 | sed 's/ ##//g' > ./ref.txt
head ./hyp.txt
echo ""
head ./ref.txt

That's water.
I can show you.
Now the good news.
Oh.
It's not about technology.
It's not about technology.
Well we had this.
We've done this.
Here's another group.
You look at this process.

The water is completely black.
Where are the different countries?
Now, the good news.
Evan. Okay.
It's not hardware.
It's not software.
Okay, we got that.
And we reported this.
Here's another group.
Look at this complexity here.


# Section 5. Evaluation

Here we use BLEU as example.

In [38]:
!echo $PWD
!bash -c "cat hyp.txt | sacrebleu ref.txt"

/pfs/data5/home/kit/stud/unyfv
{
 "name": "BLEU",
 "score": 7.9,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.4.3",
 "verbose_score": "44.5/14.7/6.1/2.6 (BP = 0.781 ratio = 0.801 hyp_len = 44594 ref_len = 55644)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.4.3"
}
[0m

In [39]:
!pip install rouge_score
!bash -c "cat hyp.txt | python compute_rouge.py"

[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
ROUGE scores:
rouge1: 0.3851
rouge2: 0.1604
rougeL: 0.3541
