Skip to content

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

Notifications You must be signed in to change notification settings

anstadnik/awesome-ukrainian-nlp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 

Repository files navigation

awesome-ukrainian-nlp

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

1. Datasets / Corpora

Monolingual

  • Brown-UK — carefully curated corpus of modern Ukrainian language
  • UberText — 6 GB of news, Wikipedia and fiction texts
  • Wikipedia
  • OSCAR — shuffled sentences extracted from Common Crawl and classified with a language detection model. Ukrainian portion of it is 28GB deduplicated.
  • CC-100 — documents extracted from Common Crawl, automatically classified and filtered. Ukrainian part is 200M sentences or 10GB of deduplicated text.

Labeled

Dictionaries

2. Tools

3. Pretrained models

Language models

Machine translation

  • Helsinki NLP models — 10 language pairs, Ukrainian from/to English, Finnish, French, Spanish, Swedish.
  • M2M-100 — translate from/to any of 100 languages.

Sequence-to-sequence models

Named-entity recognition (NER)

Word embeddings

About

Curated list of Ukrainian natural language processing (NLP) resources (corpora, pretrained models, libriaries, etc.)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published