Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)
- Brown-UK — carefully curated corpus of modern Ukrainian language
- UberText — 6 GB of news, Wikipedia and fiction texts
- Wikipedia
- OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
- CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.
- UA-GEC — grammatical error correction (GEC) and fluency corpus.
- NER-uk — Brown-UK labeled for named entities
- Yakaboo Book Reviews — book reviews, rating and descriptions
- Universal Dependencies — dependency trees corpus
- ВЕСУМ — POS tag dictionary. Can generate a list of all wordforms valid for spelling.
- Tonal dictionary
- tree_stem — stemmer
- pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
- LanguageTool — grammar, style and spell checker
- Stanza — POS, tokenization, lemmatization, etc.
- Helsinki NLP models — 10 language pairs, Ukrainian from/to English, Finnish, French, Spanish, Swedish.
- M2M-100 — translate from/to any of 100 languages.
- fastText
- fastText_multilingual — word vectors in 78 languages, aligned to the same vector space.
- Word2Vec
- GloVe
- LexVec