0.2.0: UAX #29 word segmentation (Words) by redvers · Pull Request #3 · contact-red/unicode

redvers · 2026-05-28T07:14:48Z

First piece of the 0.2.0 segments theme. UAX #29 word boundaries — mirrors the Graphemes architecture.

What's new

`WordBreak` closed union (20 values: Other, CR, LF, Newline, Extend, ZWJ, Regional_Indicator, Format, Katakana, Hebrew_Letter, ALetter, Single_Quote, Double_Quote, MidLetter, MidNum, MidNumLet, Numeric, ExtendNumLet, WSegSpace, Extended_Pictographic).
Auto-generated `_UcdWordBreak` cp-range table from `WordBreakProperty.txt` and `emoji-data.txt` (Extended_Pictographic for WB3c).
`_WordBreakCursor` state machine implementing WB1..WB16. Two-pass design: first decodes every codepoint into a `(byte_offset, class)` pair, then applies the rules with full lookahead. This handles the lookahead-dependent rules cleanly:
- WB6: `AHLetter × (MidLetter | MidNumLetQ) AHLetter` — needs to peek past the Mid* to see if AHLetter follows.
- WB7b: `Hebrew × Double_Quote Hebrew` — same shape.
- WB12: `Numeric × (MidNum | MidNumLetQ) Numeric` — same shape.
`Words` topical primitive with `count`, `ranges`, `iter` over `String box`.
`make conform-word` runs WordBreakTest.txt conformance (now part of `make ci`).

WB4 transparency

WB4 ("X (Extend|Format|ZWJ)*") is handled by tracking "effective" prev/prev2 classes that skip transparent codepoints. The state machine maintains both the raw prev (for WB3, WB3a, WB3b, WB3c, WB3d) and the effective lookback (for WB5..WB13b).

Test results on Unicode 16.0.0

Suite	Result
Unit tests	152 / 152
NormalizationTest Part 1	19,965 / 19,965
NormalizationTest Part 2	275,446 / 275,446
GraphemeBreakTest	1,093 / 1,093
WordBreakTest	1,826 / 1,826

100% UAX #29 word conformance on first complete build.

Test plan

`make ci` locally — all five suites green
PR CI runs the same

Mirrors the Graphemes infrastructure for word boundaries: unicode/word_break.pony — hand-written closed union with 20 primitive values unicode_build/word_break_codes — codegen-side name → byte map unicode_build/word_break_table — emits _UcdWordBreak from WordBreakProperty.txt + emoji-data (Extended_Pictographic for WB3c) unicode/_word_break_cursor.pony — UAX #29 word-boundary state machine; two-pass design with a precomputed (offset, class) array so WB6, WB7b, and WB12 (which need lookahead) work cleanly unicode/_word_iterators.pony — range + slice iterators unicode/words.pony — Words topical primitive (count, ranges, iter) unicode_word_conform_main — WordBreakTest.txt runner make conform-word — runs the conformance suite All rules WB1..WB16 implemented including the lookahead-dependent ones (WB6 AHLetter × Mid* AHLetter, WB7b Hebrew × DQuote Hebrew, WB12 Numeric × Mid* Numeric). WB4 transparency handled via "effective" prev/prev2 state that skips Extend/Format/ZWJ. Final tally on Unicode 16.0.0: 152 unit tests NormalizationTest Part 1: 19,965 / 19,965 NormalizationTest Part 2: 275,446 / 275,446 GraphemeBreakTest: 1,093 / 1,093 WordBreakTest: 1,826 / 1,826 ← new (100% UAX #29)

redvers added 2 commits May 28, 2026 03:13

Updated CHANGELOG.md

9e7f0e7

redvers merged commit 15079a2 into main May 28, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.2.0: UAX #29 word segmentation (Words)#3

0.2.0: UAX #29 word segmentation (Words)#3
redvers merged 2 commits into
mainfrom
segments/words

redvers commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

redvers commented May 28, 2026

What's new

WB4 transparency

Test results on Unicode 16.0.0

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant