Hi guys, following these tips, we should be able to run this notebook on Windows soon.   
Before we start:  
Make sure that you open your IDE as adminstrator. Otherwise, unexpectd errors may occur when installing packages.

# 1. Preparation

## 1.1 Install pip and torch

In [None]:
!pip install pip==24.0
!pip show torch | findstr Version

## 1.2 Install fairseq

First install fairseq, trust me, the original installation method sucks...

In [None]:
!pip install git+https://github.com/One-sixth/fairseq.git
# It fixes a problem that may occur if your python version>3.9.
# In case:
# If you meet an error here, it could be helpful to install gcc from this link:
# https://download.visualstudio.microsoft.com/download/pr/69e24482-3b48-44d3-af65-51f866a08313/99c7677154366062a43082921f40f3ce00ef2614dbf94db23b244dd13dc9443d/vs_BuildTools.exe
# Then download the gcc-tools (of size 5.8Gb).

# Original installation method:
# !git clone https://github.com/facebookresearch/fairseq.git
# %cd fairseq
# !pip install --editable ./

Then we need to add a new environment variable so that we can use the fairseq command in the terminal.  
*We will add manually！！！*

In [None]:
# On Windows, add a new environment variable using this location path.
!pip show fairseq | findstr Location
# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
# Copy it and Add manually!

# On Colab:
# !echo $PYTHONPATH
# import os
# os.environ['PYTHONPATH'] += ":/content/fairseq/"
# !echo $PYTHONPATH

## 1.3 Install other packages

In [None]:
!pip install sacremoses
!pip install sentencepiece
!pip install sacrebleu

## 1.4 Check GPU

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device(); print('Current device: ', torch.cuda.get_device_name(device))
else:
    device = 'cpu'; print('Current device: CPU.')

So far, all packages have been installed.
From now on, just execute the following cells.

# 2. Data Preparation

## 2.1 Download dataset

Here we still use TED-dataset as example.

In [None]:
!wget -O sample_data.zip https://bwsyncandshare.kit.edu/s/Xx3D56SJmG8PwXj/download
# If wget command not found, download the wget.exe from this website and move it to C:\Windows\System32: https://eternallybored.org/misc/wget/

!unzip -o sample_data.zip -d dataset
# If unzip command not found, download "Complete package, except sources" and copy unzip.exe to C:\Windows: https://gnuwin32.sourceforge.net/packages/unzip.htm

!del sample_data.zip

--2024-10-30 11:35:25--  https://bwsyncandshare.kit.edu/s/Xx3D56SJmG8PwXj/download
Resolving bwsyncandshare.kit.edu (bwsyncandshare.kit.edu)... 141.3.135.129, 2a00:1398:b::8d03:8781
Connecting to bwsyncandshare.kit.edu (bwsyncandshare.kit.edu)|141.3.135.129|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: 'sample_data.zip'

     0K .......... .......... .......... .......... .......... 5.66M
    50K .......... .......... .......... .......... .......... 19.1M
   100K .......... .......... .......... .......... .......... 19.9M
   150K .......... .......... .......... .......... .......... 12.4M
   200K .......... .......... .......... .......... .......... 15.7M
   250K .......... .......... .......... .......... .......... 26.0M
   300K .......... .......... .......... .......... .......... 10.8M
   350K .......... .......... .......... .......... ..........  135M
   400K .......... .......... .......... .......... ....

Archive:  sample_data.zip
   creating: dataseta/data/        
 extracting: dataseta/data/dev.wikimedia.de-en.de  
 extracting: dataseta/data/dev.wikimedia.de-en.en  
 extracting: dataseta/data/test.wikimedia.de-en.de  
 extracting: dataseta/data/test.wikimedia.de-en.en  
 extracting: dataseta/data/train.wikimedia.de-en.de  
 extracting: dataseta/data/train.wikimedia.de-en.en  


## 2.2 Preprocessing

Segment the text into subwords using BPE.

In [None]:
import sentencepiece as spm

# After execution, you can find two bpe files in the directory.
spm.SentencePieceTrainer.train(input="dataset/train.de-en.en,dataset/train.de-en.de",
                               model_prefix="bpe",
                               vocab_size=10000)

print('Finished training sentencepiece model.')

Then we use the trained segmentation model to preprocess the sentences from train/dev/test sets:

In [11]:
# Load the trained sentencepiece model
spm_model = spm.SentencePieceProcessor(model_file="bpe.model")

# Important: encoding="utf-8"
for partition in ["train", "dev", "tst"]:
    for lang in ["de", "en"]:
        f_out = open(f"dataset/spm.{partition}.de-en.{lang}", "w", encoding="utf-8")

        with open(f"dataset/{partition}.de-en.{lang}", "r", encoding="utf-8") as f_in:
            for line_idx, line in enumerate(f_in.readlines()):
                # Segmented into subwords
                line_segmented = spm_model.encode(line.strip(), out_type=str)
                # Join the subwords into a string
                line_segmented = " ".join(line_segmented)
                f_out.write(line_segmented + "\n")

        f_out.close()

Now, we will binarize the data for training with fairseq.  
Feel free to check the [documentation](https://fairseq.readthedocs.io/en/latest/command_line_tools.html) of fairseq commands.

In [None]:
# Preprocess/binarize the data
TEXT="dataset"
!echo $TEXT
# Binarize the data for training
!fairseq-preprocess \
    --source-lang en --target-lang de \
    --trainpref $TEXT/spm.train.de-en \
    --validpref $TEXT/spm.dev.de-en \
    --testpref $TEXT/spm.tst.de-en \
    --destdir binarized_data/iwslt14.de-en \
    --thresholdtgt 0 --thresholdsrc 0 \
    --workers 8

The data preprocessing is completed.

# 3. Training

In [None]:
# Make sure that (0.9, 0.98) in "", error might occurs when use ''.
!fairseq-train \
    binarized_data/iwslt14.de-en \
    --arch transformer --share-decoder-input-output-embed \
    --optimizer adam --adam-betas "(0.9, 0.98)" --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --keep-last-epochs 2 \
    --max-tokens 4096 \
    --max-epoch 1 \
    --fp16

# 4. Decoding

Now we can generate translations with the trained model.

In [None]:
!fairseq-generate binarized_data/iwslt14.de-en \
      --task translation \
      --source-lang en \
      --target-lang de \
      --path checkpoints/checkpoint_best.pt \
      --batch-size 256 \
      --beam 4 \
      --remove-bpe=sentencepiece > "en-de.decode.log"

We extract the hypotheses and references from the decoding log file.

In [None]:
%%bash
grep ^H "en-de.decode.log" | sed 's/^H-//g' | cut -f 3 | sed 's/ ##//g' > ./hyp.txt
grep ^T "en-de.decode.log" | sed 's/^T-//g' | cut -f 2 | sed 's/ ##//g' > ./ref.txt
head ./hyp.txt
head ./ref.txt

# Section 5. Evaluation

Here we use BLEU as example.

In [None]:
!bash -c "cat hyp.txt | sacrebleu ref.txt"