Skip to content
Branch: master
Find file History
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
README.md fixed fastBPE install command Aug 8, 2019
lowercase_and_remove_accent.py convert_to_unicode() is six.ensure_text() (#1) Feb 6, 2019
segment_th.py initial commit Feb 5, 2019
tokenize.sh Added punctuation normalization for zh/ja/th Jul 8, 2019

README.md

Tools

In XLM/tools/, you will need to install the following tools:

Tokenizers

Moses tokenizer:

git clone https://github.com/moses-smt/mosesdecoder

Thai PythaiNLP tokenizer:

pip install pythainlp

Japanese Kytea tokenizer:

wget http://www.phontron.com/kytea/download/kytea-0.4.7.tar.gz
tar -xzf kytea-0.4.7.tar.gz
cd kytea-0.4.7
./configure
make
make install
kytea --help

Chinese Stanford segmenter:

wget https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip
unzip stanford-segmenter-2018-10-16.zip

fastBPE

git clone https://github.com/glample/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
You can’t perform that action at this time.