Skip to content

0.2.0: UAX #29 word segmentation (Words)#3

Merged
redvers merged 2 commits into
mainfrom
segments/words
May 28, 2026
Merged

0.2.0: UAX #29 word segmentation (Words)#3
redvers merged 2 commits into
mainfrom
segments/words

Conversation

@redvers
Copy link
Copy Markdown
Contributor

@redvers redvers commented May 28, 2026

First piece of the 0.2.0 segments theme. UAX #29 word boundaries — mirrors the Graphemes architecture.

What's new

  • `WordBreak` closed union (20 values: Other, CR, LF, Newline, Extend, ZWJ, Regional_Indicator, Format, Katakana, Hebrew_Letter, ALetter, Single_Quote, Double_Quote, MidLetter, MidNum, MidNumLet, Numeric, ExtendNumLet, WSegSpace, Extended_Pictographic).
  • Auto-generated `_UcdWordBreak` cp-range table from `WordBreakProperty.txt` and `emoji-data.txt` (Extended_Pictographic for WB3c).
  • `_WordBreakCursor` state machine implementing WB1..WB16. Two-pass design: first decodes every codepoint into a `(byte_offset, class)` pair, then applies the rules with full lookahead. This handles the lookahead-dependent rules cleanly:
    • WB6: `AHLetter × (MidLetter | MidNumLetQ) AHLetter` — needs to peek past the Mid* to see if AHLetter follows.
    • WB7b: `Hebrew × Double_Quote Hebrew` — same shape.
    • WB12: `Numeric × (MidNum | MidNumLetQ) Numeric` — same shape.
  • `Words` topical primitive with `count`, `ranges`, `iter` over `String box`.
  • `make conform-word` runs WordBreakTest.txt conformance (now part of `make ci`).

WB4 transparency

WB4 ("X (Extend|Format|ZWJ)*") is handled by tracking "effective" prev/prev2 classes that skip transparent codepoints. The state machine maintains both the raw prev (for WB3, WB3a, WB3b, WB3c, WB3d) and the effective lookback (for WB5..WB13b).

Test results on Unicode 16.0.0

Suite Result
Unit tests 152 / 152
NormalizationTest Part 1 19,965 / 19,965
NormalizationTest Part 2 275,446 / 275,446
GraphemeBreakTest 1,093 / 1,093
WordBreakTest 1,826 / 1,826

100% UAX #29 word conformance on first complete build.

Test plan

  • `make ci` locally — all five suites green
  • PR CI runs the same

redvers added 2 commits May 28, 2026 03:13
Mirrors the Graphemes infrastructure for word boundaries:

  unicode/word_break.pony          — hand-written closed union with
                                     20 primitive values
  unicode_build/word_break_codes   — codegen-side name → byte map
  unicode_build/word_break_table   — emits _UcdWordBreak from
                                     WordBreakProperty.txt + emoji-data
                                     (Extended_Pictographic for WB3c)
  unicode/_word_break_cursor.pony  — UAX #29 word-boundary state
                                     machine; two-pass design with a
                                     precomputed (offset, class) array
                                     so WB6, WB7b, and WB12 (which
                                     need lookahead) work cleanly
  unicode/_word_iterators.pony     — range + slice iterators
  unicode/words.pony               — Words topical primitive
                                     (count, ranges, iter)
  unicode_word_conform_main        — WordBreakTest.txt runner
  make conform-word                — runs the conformance suite

All rules WB1..WB16 implemented including the lookahead-dependent
ones (WB6 AHLetter × Mid* AHLetter, WB7b Hebrew × DQuote Hebrew,
WB12 Numeric × Mid* Numeric). WB4 transparency handled via
"effective" prev/prev2 state that skips Extend/Format/ZWJ.

Final tally on Unicode 16.0.0:

  152 unit tests
  NormalizationTest Part 1: 19,965 / 19,965
  NormalizationTest Part 2: 275,446 / 275,446
  GraphemeBreakTest:        1,093  / 1,093
  WordBreakTest:            1,826  / 1,826   ← new (100% UAX #29)
@redvers redvers merged commit 15079a2 into main May 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant