Skip to content

Tighter and broader VS16 handling#24

Merged
clipperhouse merged 10 commits into
mainfrom
vs16-tigthen
May 17, 2026
Merged

Tighter and broader VS16 handling#24
clipperhouse merged 10 commits into
mainfrom
vs16-tigthen

Conversation

@clipperhouse
Copy link
Copy Markdown
Owner

@clipperhouse clipperhouse commented May 17, 2026

Addresses issues pointed out by @gammons in #23, with thanks. I decided to research and do my implementation.

This PR tightens VS16 width handling based on what we learned from Unicode TR51 and Unicode 17 data files.

Problem we are solving

Current behavior can misclassify grapheme width when VS16 appears later in a grapheme cluster (especially in some ZWJ sequences), and it could not represent overlapping properties on one code point.

That matters because some code points are both:

  • East Asian Ambiguous (A in EastAsianWidth.txt), and
  • valid VS16 bases (in emoji-variation-sequences.txt).

A single-property model cannot represent that overlap correctly.

What we learned about VS16 and Unicode data

  • Valid emoji variation sequences are explicitly listed in emoji-variation-sequences.txt (base + FE0F / FE0E).
  • VS16 is local to a base+VS16 pair; it is not a global “cluster-wide” switch.
  • In ZWJ data, FE0F can appear later in the cluster and still be semantically meaningful for a later component.
  • The overlap between East Asian Ambiguous and VS16-eligible bases is real (not hypothetical), so precedence-only modeling is fragile.

Implementation changes

  • Switch generated character properties to a bitmap model so one code point can carry multiple traits (notably _East_Asian_Ambiguous and _VS16_Eligible).
  • Parse and include Unicode 17 emoji-variation-sequences.txt in generation.
  • Update width logic to:
    • preserve existing fast-path handling,
    • apply immediate base+VS16 checks, and
    • handle later FE0F in a grapheme by checking local preceding-base eligibility without rune-decoding the full string.
  • Add explicit regression coverage for later-VS16 ZWJ cases.
  • Add FuzzHasEligibleVS16Pair with a UTF-8-decoding reference oracle to validate the byte-level implementation.

Test plan

  • go generate ./...
  • go test ./...
  • go test ./... -run 'TestTR51Conformance/C5'
  • go test . -run '^$' -fuzz FuzzHasEligibleVS16Pair -fuzztime=10s

only look for the VS16 cases in the variation sequences. still need to look for those with later FE0F.
Copilot AI review requested due to automatic review settings May 17, 2026 15:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes over-eager VS16 widening in the display-width calculator. Previously any non-wide grapheme followed by U+FE0F was promoted to width 2; now only bases that appear in Unicode's emoji-variation-sequences.txt (as <base> FE0F) are eligible, and the check also handles ZWJ sequences where FE0F appears later inside a grapheme cluster. To support multiple traits on a single code point, the generated character property is changed from an enum to a bitmap.

Changes:

  • Convert generated property from enum to bitmap and add a new _VS16_Eligible flag fed from emoji-variation-sequences.txt (Unicode 17).
  • Rewrite graphemeWidth to use bitmap checks and a new hasEligibleVS16Pair helper that locates later <base, FE0F> pairs inside a cluster without rune decoding.
  • Add a new C4/C5 conformance test and a fuzz target (FuzzHasEligibleVS16Pair) comparing the byte-level scanner to a utf8.DecodeRune-based oracle.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
width.go Switches width logic to bitmap properties, adds hasEligibleVS16Pair/is helpers, restricts VS16 widening to eligible bases.
width_test.go Renames C3 test, adds C4 (VS16 ignored for invalid bases) and C5 (later VS16 in ZWJ sequences).
internal/gen/unicode.go Adds VS16Eligible map, parses emoji-variation-sequences.txt, switches property constants to 1 << iota, makes buildPropertyBitmap OR flags.
internal/gen/trie.go Emits property constants as a bitmap and updates header comment.
internal/gen/data/17.0.0/emoji-variation-sequences.txt Vendored Unicode 17 source data for VS16 eligibility.
fuzz_test.go Adds FuzzHasEligibleVS16Pair plus a UTF-8-decoding reference implementation as oracle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread width.go Outdated
Comment thread internal/gen/trie.go
Comment thread width.go Outdated
Comment thread internal/gen/unicode.go
Skip directly to each 0xEF candidate instead of scanning byte-by-byte.
Loops past non-FE0F candidates (FE0E, fullwidth forms, etc.) so we
still find a valid pair later in the cluster.
Add C6 conformance test and fuzz seeds for the FE0E/FE20 patterns
that motivated the change.
~1% improvement on String_Mixed/EastAsian/Emoji benchmarks.
@clipperhouse clipperhouse merged commit aa2c425 into main May 17, 2026
11 checks passed
@clipperhouse clipperhouse deleted the vs16-tigthen branch May 17, 2026 21:34
@clipperhouse clipperhouse changed the title tighten VS16 width handling and add regression coverage Tighter and broader VS16 handling May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants