This repo contains code written for the CL Team Lab project at the Institute of Natural Language Processing, University of Stuttgart.
It contains a Rust implementation of IBM Model 1, along with Python implementations of byte-pair encoding and BLEU for tokenization and evaluation.
To replicate our studies, please acquire the Korean-Chinese data through the following link: https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=129
To build the Rust code, we assume that Rust and Cargo is already installed. Please follow the instructions on this page, which details how to use rustup to download Rust, and will have Cargo installed alongside automatically.
Assuming that is done, in order to compile the code, please run:
cargo build
Which builds the project. After the project is build, the code can be run while in the main directory:
./target/debug/IBM-1
The source and target corpora are to be specified in the main.rs file under src/, which should be parallel corpora of the two languages, each put in their own .txt file.
The BPE for ko-zh.ipynb file tokenizes the Chinese-Korean data, and the bleu_score.ipynb file evaluates the results.
The word_segmentation.ipynb file assumes the Korean-Chinese data to exist under data/. This produces the segmentation data used for phrase-based MT training in Moses.
mkdir smt
cd smt
sudo apt-get install build-essential git-core pkg-config automake libtool wget zlib1g-dev python-dev libbz2-dev
sudo apt-get install libsoap-lite-perl
git clone https://github.com/moses-smt/giza-pp.git
package
cd giza-pp
make
cd ..
mkdir tools
cp giza-pp/GIZA++-v2/GIZA++ giza-pp/GIZA++-v2/snt2cooc.out giza-pp/mkcls-v2/mkcls tools
cd smt
wget https://www.achrafothman.net/aslsmt/tools/smt-moses-ao-ubuntu-16.04.tgz
tar -xzvf smt-moses-ao-ubuntu-16.04.tgz
cd ubuntu-16.04/
mv bin ../
mv scripts ../
mv training-tools ../
cd ..
rm -r ubuntu-16.04
rm -r smt-moses-ao-ubuntu-16.04.tgz
mkdir corpus
cd corpus
ls
> test.ko test.zh train.ko train.zh tuning.ko tuning.zh to this folder
/home/dojun/smt/bin/lmplz -o 3 < /home/dojun/smt/corpus/train.ko > /home/dojun/smt/corpus/train.arpa.ko
/home/dojun/smt/bin/build_binary /home/dojun/smt/corpus/train.arpa.ko /home/dojun/smt/corpus/train.blm.ko
echo "안녕 하세요" | /home/dojun/smt/bin/query /home/dojun/smt/corpus/train.blm.ko
mkdir working
cd working
nohup nice /home/dojun/smt/scripts/training/train-model.perl -root-dir train -corpus /home/dojun/smt/corpus/train -f zh -e ko -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/home/dojun/smt/corpus/train.blm.ko:8 -external-bin-dir /home/dojun/smt/tools >& training.out &
tail -f training.out
# once the line starting with "(9) create moses.ini @..." appears, you can type CTRL+C to exit the tail mode.
cd ..
cd ..
cd working
nohup nice /home/dojun/smt/scripts/training/mert-moses.pl /home/dojun/smt/corpus/tuning.zh /home/dojun/smt/corpus/tuning.ko /home/dojun/smt/bin/moses /home/dojun/smt/working/train/model/moses.ini --mertdir /home/dojun/smt/bin/ &> mert.out &
tail -f mert.out
# once the line starting with "Saving new config to: ./moses.ini Saved: ./moses.ini..." appears, you can type CTRL+C to exit the tail mode.
cd ..
cd working
mkdir binarised-model
/home/dojun/smt/bin/processPhraseTableMin -in train/model/phrase-table.gz -nscores 4 -out binarised-model/phrase-table
/home/dojun/smt/bin/processLexicalTableMin -in train/model/reordering-table.wbe-msd-bidirectional-fe.gz -out binarised-model/reordering-table
cp /home/dojun/smt/working/mert-work/moses.ini /home/dojun/smt/working/binarised-model/
cd binarised-model/
vim moses.ini
@1. Change PhraseDictionaryMemory to PhraseDictionaryCompact
@2. Set the path of the PhraseDictionaryCompact feature to point to: /home/dojun/smt/working/binarised-model/phrase-table.minphr
@3. Set the path of the LexicalReordering feature to point to: /home/dojun/smt/working/binarised-model/reordering-table
@4. Save moses.ini
# to save and quit, type ESC and type wq! followed by ENTER.
cd ..
cd working
/home/dojun/smt/scripts/training/filter-model-given-input.pl filtered-corpus-mini mert-work/moses.ini /home/dojun/smt/corpus/test.zh -Binarizer /home/dojun/smt/bin/processPhraseTableMin
nohup nice /home/dojun/smt/bin/moses -f /home/dojun/smt/working/filtered-corpus-mini/moses.ini < /home/dojun/smt/corpus/test.zh > /home/dojun/smt/working/test.translated.ko 2> /home/dojun/smt/working/test.translated.out
# See the log
tail -f /home/dojun/smt/working/test.translated.out
cd ..
/home/dojun/smt/scripts/generic/multi-bleu.perl -lc /home/dojun/smt/corpus/test.ko < /home/dojun/smt/working/test.translated.ko
Thanks to achrafothman.net for this nice Moses tutorial
Team members: (Ryan) Soh-Eun Shim, Dojun Park