Tighter and broader VS16 handling by clipperhouse · Pull Request #24 · clipperhouse/displaywidth

clipperhouse · 2026-05-17T15:01:04Z

Addresses issues pointed out by @gammons in #23, with thanks. I decided to research and do my implementation.

This PR tightens VS16 width handling based on what we learned from Unicode TR51 and Unicode 17 data files.

Problem we are solving

Current behavior can misclassify grapheme width when VS16 appears later in a grapheme cluster (especially in some ZWJ sequences), and it could not represent overlapping properties on one code point.

That matters because some code points are both:

East Asian Ambiguous (A in EastAsianWidth.txt), and
valid VS16 bases (in emoji-variation-sequences.txt).

A single-property model cannot represent that overlap correctly.

What we learned about VS16 and Unicode data

Valid emoji variation sequences are explicitly listed in emoji-variation-sequences.txt (base + FE0F / FE0E).
VS16 is local to a base+VS16 pair; it is not a global “cluster-wide” switch.
In ZWJ data, FE0F can appear later in the cluster and still be semantically meaningful for a later component.
The overlap between East Asian Ambiguous and VS16-eligible bases is real (not hypothetical), so precedence-only modeling is fragile.

Implementation changes

Switch generated character properties to a bitmap model so one code point can carry multiple traits (notably _East_Asian_Ambiguous and _VS16_Eligible).
Parse and include Unicode 17 emoji-variation-sequences.txt in generation.
Update width logic to:
- preserve existing fast-path handling,
- apply immediate base+VS16 checks, and
- handle later FE0F in a grapheme by checking local preceding-base eligibility without rune-decoding the full string.
Add explicit regression coverage for later-VS16 ZWJ cases.
Add FuzzHasEligibleVS16Pair with a UTF-8-decoding reference oracle to validate the byte-level implementation.

Test plan

go generate ./...
go test ./...
go test ./... -run 'TestTR51Conformance/C5'
go test . -run '^$' -fuzz FuzzHasEligibleVS16Pair -fuzztime=10s

only look for the VS16 cases in the variation sequences. still need to look for those with later FE0F.

Copilot

Pull request overview

This PR fixes over-eager VS16 widening in the display-width calculator. Previously any non-wide grapheme followed by U+FE0F was promoted to width 2; now only bases that appear in Unicode's emoji-variation-sequences.txt (as <base> FE0F) are eligible, and the check also handles ZWJ sequences where FE0F appears later inside a grapheme cluster. To support multiple traits on a single code point, the generated character property is changed from an enum to a bitmap.

Changes:

Convert generated property from enum to bitmap and add a new _VS16_Eligible flag fed from emoji-variation-sequences.txt (Unicode 17).
Rewrite graphemeWidth to use bitmap checks and a new hasEligibleVS16Pair helper that locates later <base, FE0F> pairs inside a cluster without rune decoding.
Add a new C4/C5 conformance test and a fuzz target (FuzzHasEligibleVS16Pair) comparing the byte-level scanner to a utf8.DecodeRune-based oracle.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
width.go	Switches width logic to bitmap properties, adds `hasEligibleVS16Pair`/`is` helpers, restricts VS16 widening to eligible bases.
width_test.go	Renames C3 test, adds C4 (VS16 ignored for invalid bases) and C5 (later VS16 in ZWJ sequences).
internal/gen/unicode.go	Adds `VS16Eligible` map, parses `emoji-variation-sequences.txt`, switches property constants to `1 << iota`, makes `buildPropertyBitmap` OR flags.
internal/gen/trie.go	Emits property constants as a bitmap and updates header comment.
internal/gen/data/17.0.0/emoji-variation-sequences.txt	Vendored Unicode 17 source data for VS16 eligibility.
fuzz_test.go	Adds `FuzzHasEligibleVS16Pair` plus a UTF-8-decoding reference implementation as oracle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Skip directly to each 0xEF candidate instead of scanning byte-by-byte. Loops past non-FE0F candidates (FE0E, fullwidth forms, etc.) so we still find a valid pair later in the cluster. Add C6 conformance test and fuzz seeds for the FE0E/FE20 patterns that motivated the change. ~1% improvement on String_Mixed/EastAsian/Emoji benchmarks.

clipperhouse added 3 commits May 16, 2026 18:41

first pass for simple case

b92f99d

only look for the VS16 cases in the variation sequences. still need to look for those with later FE0F.

handle later VS16

80d2489

fuzz and fix for hasEligibleVS16Pair

079671b

Copilot AI review requested due to automatic review settings May 17, 2026 15:01

Copilot started reviewing on behalf of clipperhouse May 17, 2026 15:01 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

Comment thread width.go Outdated

Comment thread internal/gen/trie.go

Comment thread width.go Outdated

Comment thread internal/gen/unicode.go

clipperhouse added 7 commits May 17, 2026 12:29

bad parse should error

52119ba

rename & unused

96122d8

oh, a type oversight

a838e3d

benches

af90257

add fuzz to CI

ed69062

method instead of func

6922f65

clipperhouse merged commit aa2c425 into main May 17, 2026
11 checks passed

clipperhouse deleted the vs16-tigthen branch May 17, 2026 21:34

clipperhouse changed the title ~~tighten VS16 width handling and add regression coverage~~ Tighter and broader VS16 handling May 17, 2026

clipperhouse mentioned this pull request May 17, 2026

fix: detect VS16 and skin-tone modifiers anywhere in grapheme cluster #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighter and broader VS16 handling#24

Tighter and broader VS16 handling#24
clipperhouse merged 10 commits into
mainfrom
vs16-tigthen

clipperhouse commented May 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

clipperhouse commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem we are solving

What we learned about VS16 and Unicode data

Implementation changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clipperhouse commented May 17, 2026 •

edited

Loading