awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

1. Datasets / Corpora

Monolingual

Brown-UK — carefully curated corpus of modern Ukrainian language
UberText — 6 GB of news, Wikipedia and fiction texts
Wikipedia
OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.

Labeled

UA-GEC — grammatical error correction (GEC) and fluency corpus.
NER-uk — Brown-UK labeled for named entities
Yakaboo Book Reviews — book reviews, rating and descriptions
Universal Dependencies — dependency trees corpus

Dictionaries

ВЕСУМ — POS tag dictionary. Can generate a list of all wordforms valid for spelling.
Tonal dictionary

2. Tools

tree_stem — stemmer
pymorphy2 + pymorphy2-dicts-uk — POS tagger and lemmatizer
LanguageTool — grammar, style and spell checker
Stanza — POS, tokenization, lemmatization, etc.

3. Pretrained models

Language models

Machine translation

Helsinki NLP models — 10 language pairs, Ukrainian from/to English, Finnish, French, Spanish, Swedish.
M2M-100 — translate from/to any of 100 languages.

Sequence-to-sequence models

mBART50
mT5

Named-entity recognition (NER)

MITIE NER Model

Word embeddings

fastText
fastText_multilingual — word vectors in 78 languages, aligned to the same vector space.
Word2Vec
GloVe
LexVec

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

awesome-ukrainian-nlp

1. Datasets / Corpora

Monolingual

Labeled

Dictionaries

2. Tools

3. Pretrained models

Language models

Machine translation

Sequence-to-sequence models

Named-entity recognition (NER)

Word embeddings

About

Releases

Packages

anstadnik/awesome-ukrainian-nlp

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

awesome-ukrainian-nlp

1. Datasets / Corpora

Monolingual

Labeled

Dictionaries

2. Tools

3. Pretrained models

Language models

Machine translation

Sequence-to-sequence models

Named-entity recognition (NER)

Word embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages