Instruction about building a KenLM model based on Ukrainian Wikipedia data
You need Python 3 and pip for text processing.
pip install mwxml tqdm
Go to https://dumps.wikimedia.org/backup-index.html and find the "ukwiki: Dump complete" line. Follow the "ukwiki" link.
On the page, find the "Recombine all pages, current versions only." link and download BZ2 archive.
python extract_text_from_dump.py ukwiki-20220701-pages-meta-current.xml > uncleaned_text.txt
python cleaner.py --corpus-path uncleaned_text.txt --corpus-clean cleaned_text.txt --n-workers 5 --min-words 2
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz
mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
kenlm/build/bin/lmplz -o 5 < "cleaned_text.txt" > "uk_wiki.arpa"
python fix_kenlm.py --arpa-file-in uk_wiki.arpa --arpa-file-out uk_wiki_corrected.arpa
kenlm/build/bin/build_binary uk_wiki_corrected.arpa uk_wiki_corrected.bin