Gukhanmun 0.2.0: Open Korean Dictionary, parenthetical collapsing, proper name grouping, smart numerals #10
dahlia
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Gukhanmun is a library and CLI tool for converting Korean text written in mixed hanja/hangul script into consistently annotated output. Academic texts, historical documents, and legal materials routinely mix Chinese characters (hanja) with Korean script (hangul); Gukhanmun reads them and renders the result in your chosen mode: hangul-only, ruby annotation, original script, and more. It is available as a Rust crate, a Node.js package, a WebAssembly module, and a standalone CLI binary.
0.2.0 ships four user-visible improvements: bundled Open Korean Dictionary (우리말샘) support with North Korean readings out of the box, automatic collapsing of redundant inline reading annotations, correct grouping of proper names and unknown multi-character hanja words, and smarter numeral rendering for dates. The playground on the docs site was also significantly upgraded.
Open Korean Dictionary support
The Standard Korean Language Dictionary (標準國語大辭典) that ships with Gukhanmun covers about 260,000 entries, the prescriptive core of South Korean vocabulary. The Open Korean Dictionary, also published by the National Institute of Korean Language, is a far larger, collaboratively edited resource covering technical vocabulary, regional dialects, archaic forms, and, most significantly, North Korean readings. Its general category alone contains roughly 533,000 hanja keys, about twice the coverage of the Standard dictionary.
0.2.0 adds the Open Korean Dictionary as an optional bundled data source, split into four independently composable categories:
The most immediately visible effect is on the
ko-kppreset, which previously had no bundled dictionary at all because the Standard dictionary's South Korean readings are wrong for North Korean text. Theko-kppreset now includes the North Korean category by default, giving it real dictionary support for the first time.To disable all bundled dictionaries (Standard and Open Korean Dictionary alike), pass
--no-bundled-dictionaries:For JavaScript and TypeScript users, the new
@gukhanmun/opendict-fstand@gukhanmun/opendict-cdbpackages expose per-category loaders:JavaScript presets do not auto-load dictionary data; supply the loaders explicitly when you want extended coverage. The dictionary data is distributed under CC BY-SA 2.0 KR, the same license as the Standard dictionary data.
See #5 for the full design rationale and #6 for the implementation.
Redundant parenthetical collapsing
Korean authors writing mixed-script text sometimes include an explicit parenthetical alongside the hanja or hangul:
庫間(곳간)(hanja followed by its hangul reading) or곳간(庫間)(hangul followed by the hanja equivalent). Before this release, Gukhanmun would convert the hanja and leave the parenthetical alone, producing redundant output like곳간(곳간)in hangul-only mode:庫間(곳간)곳간(곳간)곳간(庫間)곳간(庫間)곳간(곳간)곳간(庫間)庫間(곳간)庫間(곳간)庫間(곳간)✓Gukhanmun now recognises the author's intent and renders the word with both scripts in every rendering mode. Definition glosses such as
庫間(물건을 간직하여 두는 곳)and foreign transliterations such as蔣介石(장제스)(where 介 does not read 제) pass through untouched.A parenthetical can also pin an alternative reading for that specific occurrence.
數字(수자)fixes the reading to 수자 (“a few characters”) rather than the Standard dictionary's default 숫자 (“numeral”), even though the collapser would normally select the prescribed spelling.The feature is on by default. To disable it, pass
--no-collapse-parenson the CLI or callBuilder::collapse_redundant_parens(false)in Rust. See #3 and #4.Proper name and multi-character word grouping
When the converter encountered a multi-character hanja word not registered in the dictionary (a personal name, a place name, a technical term), but some of its individual characters happened to have single-character entries in the Standard dictionary, it would annotate each character separately instead of grouping them:
The engine now treats single-character dictionary matches that carry no special rendering marks as trivial, and merges them with adjacent fallback segments into a single grouped annotation. Homophone marking still applies correctly across the merged span, and the dictionary provenance is preserved.
This fix applies automatically with no configuration change. See #7 and #8.
Smart numeral defaults
The CLI
--numeralsoption previously defaulted tohangul-phonetic, which renders date numerals phonetically following Korean lexical conventions (六月 as 유월, not 육월). The new default issmart, which substitutes positional Arabic numerals for date-style hanja numeral sequences:Words that are ambiguous (a numeral in one reading, a proper name or compound in another), such as 百濟 (Baekje) or 十長生, are left as fallback readings rather than being split into numeric fragments.
To restore the previous phonetic calendar behaviour, pass
--numerals hangul-phonetic.Playground improvements
The playground on the docs site received several upgrades in this cycle:
Orthographic fixes and dictionary updates
數字should convert to숫자, not수자#1, Prefer §30 saisiot readings over saisiot-free homographs #2). Both readings appear in the Standard dictionary, but 숫자 is the form mandated by Standard Korean Orthography §30 (한글 맞춤法 第30項), which names six Sino-Korean saisiot (사이시옷) compounds: 곳간, 셋방, 숫자, 찻간, 툇간, 횟수. The dictionary extractor now gives these six forms a dedicated priority tier so the prescribed spelling wins regardless of the order they appear in the source dump.Upgrading and installing
Upgrading from 0.1.x
Two default behaviours changed in 0.2.0; existing users should be aware of them before upgrading:
--numeralsdefault is nowsmart. If you relied on phonetic calendar readings such as六月→ 유월, add--numerals hangul-phoneticto restore the previous behaviour. The library preset defaults are unchanged.--no-collapse-parens(CLI) or setcollapseRedundantParens: false(JavaScript) or callBuilder::collapse_redundant_parens(false)(Rust) to opt out.For Rust users:
Annotationis now#[non_exhaustive], so exhaustive pattern matches on it will fail to compile. Add a..arm or construct newAnnotationvalues fromAnnotation::default().The
opendictCargo feature, which bundles the Open Korean Dictionary data (~8 MB), is enabled by default. If binary size or compile time is a concern, disable it explicitly:CLI
Install a prebuilt binary with mise:
On Windows, use
winget:Or build from source with Cargo:
Prebuilt archives for Linux, macOS, and Windows are attached to the 0.2.0 release on GitHub.
Rust
See the Rust installation guide for feature flag options.
JavaScript/TypeScript
See the JavaScript installation guide for npm, yarn, Bun, and Deno equivalents.
Beta Was this translation helpful? Give feedback.
All reactions