`thxxwiki` - create th-xx parallel corpus from Wikipedia dumps

Getting Started

Download th and xx Wikipedia dumps; replace xx with your language of choice that are used to train mUSE

bin/bash prepare_wiki.sh data/thwiki th
bin/bash prepare_wiki.sh data/xxwiki xx

Create thwiki.csv and xxwiki.csv from Wikipedia dumps

python wikidump2csv.py --input_dir 'data/thwiki/wiki_extr/th/*/*' --output_path data/thwiki.csv
python wikidump2csv.py --input_dir 'data/xxwiki/wiki_extr/xx/*/*' --output_path data/xxwiki.csv

Align titles of each Wikipedia; default cosine similarity score threshold at 0.85

python align_titles.py --en_titles_path data/xxwiki.csv --th_titles_path data/thwiki.csv --output_path data/mappings.csv --bs 10000

Create sentences from aligned documents.

python create_sentences.py --en_path data/xxwiki.csv --th_path data/thwiki.csv --mappings_path data/mappings.csv --output_en_dir data/xx_sentences --output_th_dir data/th_sentences --use_thres 0.85

Align sentences within each document; default cosine similarity score threshold at 0.7

python align_sentences.py --en_dir data/xx_sentences --th_dir data/th_sentences --output_path data/xxth_aligned.csv --max_n 3 --bs 10000 --use_thres 0.7

`data` directory structure

data

#wikipedia dumps
--xxwiki
--thwiki

#sentences
--xx_sentences
----doc_0000.sent
----doc_0001.sent
...
--th_sentences
----doc_0000.sent
----doc_0001.sent
...

#csvs
xxwiki.csv
thwiki.csv
mappings.csv
xxth_aligned.csv

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
align_sentences.py		align_sentences.py
align_titles.py		align_titles.py
create_sentences.py		create_sentences.py
prepare_wiki.sh		prepare_wiki.sh
preprocess.py		preprocess.py
wikidump2csv.py		wikidump2csv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`thxxwiki` - create th-xx parallel corpus from Wikipedia dumps

Getting Started

`data` directory structure

About

Releases

Packages

Languages

License

cstorm125/thxxwiki

Folders and files

Latest commit

History

Repository files navigation

thxxwiki - create th-xx parallel corpus from Wikipedia dumps

Getting Started

data directory structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`thxxwiki` - create th-xx parallel corpus from Wikipedia dumps

`data` directory structure

Packages