Skip to content

Releases: dmytro-yemelianov/verbacorpus

verba corpus v1.0.2

24 Jun 11:20

Choose a tag to compare

[1.0.2] - 2026-06-24

  • Proper bibliographic references for all 5 sources: generated references.bib BibTeX + references.csl.json CSL-JSON + rendered citations.
  • About «Джерела» is now a bibliography with BibTeX/CSL-JSON downloads.
  • Full source citations on the /p/:id detail page, DATACARD, croissant isBasedOn, and CITATION.cff references:.
  • Verified ISBN 978-966-02-5147-2 for the Mlodzynskyi 2009 reprint.

SHA256 corpus.csv: 79facbbcd1991ed8d1ae36a49684b2d5046ebcfed4bb71cee12870d9d929dbe0

Live: https://verbacorpus.org · API: https://verbacorpus.org/api.html · License: CC BY 4.0 (compilation)

verba corpus v1.0.1

24 Jun 07:29

Choose a tag to compare

[1.0.1] — 2026-06-24

Text-quality cleanup (no schema or count change; 48,787 entries).

  • Fixed ~150 entries that began with OCR junk (stray punctuation, list-numbers, leading dashes) — stripped and recapitalized (e.g. ' По парі…По парі…).
  • Repaired 70 mixed-script (homoglyph) entries where Latin/Greek letters were OCR'd for Cyrillic (e.g. He вартНе варт, ΤοтоТото); archaic word forms preserved.
  • Rescued 6 severely garbled entries (conservative, non-fabricating).
  • Canonicalized all text to plain ASCII punctuation (straight quotes, ASCII apostrophe, hyphen, ...) for code-friendly exports; the website/cards render Ukrainian typography (« » «—» ’ …) via a display layer.

SHA256 corpus.csv: 79facbbcd1991ed8d1ae36a49684b2d5046ebcfed4bb71cee12870d9d929dbe0

Live: https://verbacorpus.org · API: https://verbacorpus.org/api.html · License: CC BY 4.0 (compilation)

verba corpus v1.0.0

24 Jun 07:28

Choose a tag to compare

[1.0.0] — 2026-06-24

Initial public release.

  • 48,787 proverbs and adages from 5 sources: Франко 1901 (30,906),
    Номис 1864 (9,785), Бобкова (5,613), Ількевич 1841 (2,702), Млодзинський 2009 (2,261).
  • Every entry enriched with a modern-spelling rendering (modern_text) and 1–3 of 27
    thematic categories; 30,532 carry scholarly explanations.
  • Non-destructive variant linking across sources; 10-column schema.
  • Distributed as CSV, JSON, JSONL, XML; live at https://verbacorpus.org with a
    multi-format REST API and semantic search.
  • Known limitations: Nomis 1864 is best-effort OCR (~75–80% character fidelity);
    category tags ~85% acceptable; modern_text ~95% acceptable. Enrichment is LLM-generated.

SHA256 corpus.csv: 03faf4718ad39a3feaa38e484c709b6fb02260f50dd1f88c0f9f946838fffe43

Live: https://verbacorpus.org · API: https://verbacorpus.org/api.html · License: CC BY 4.0 (compilation)