Skip to content

[lexical-playground] Bug Fix: AutocompleteExtension wordlist returns the highest-priority completion#8599

Merged
etrepum merged 2 commits into
facebook:mainfrom
etrepum:claude/wizardly-bell-lqX7r
May 31, 2026
Merged

[lexical-playground] Bug Fix: AutocompleteExtension wordlist returns the highest-priority completion#8599
etrepum merged 2 commits into
facebook:mainfrom
etrepum:claude/wizardly-bell-lqX7r

Conversation

@etrepum
Copy link
Copy Markdown
Collaborator

@etrepum etrepum commented May 30, 2026

Description

createWordlistDictionary built a trie whose query walked it in pre-order, which returns the shallowest (shortest) terminal among the words sharing a prefix rather than the earliest-listed one. That only matches the intended "first-listed wins" semantics when the wordlist is in lexicographic order. The shipped English list is ordered by frequency, so 105 common word pairs (398 prefixes) surfaced a rarer completion than the priority order intends, e.g. "prof" -> "profession" instead of "professional", "cons" -> "construct" instead of "construction". The Korean list is sorted, so it was unaffected.

Replace the trie with a compact prefix index: order lists the word indices sorted by case-folded text, so words sharing a prefix form one contiguous block. query() binary-searches for the block and scans it for the earliest-listed word longer than the prefix — exactly the word the original linear scan returned (verified to match across every queryable prefix of both shipped dictionaries). The only retained overhead is one Uint32 per word (~7x less than a prefix->word map on the 25k-word Korean list); folded text is recomputed on the fly and the worst-case block scan is ~40 entries.

Add unit tests covering build and query: priority ordering for nested and non-nested matches, exact-prefix handling, multi-byte (Hangul) words, casing preservation, minPrefixLength gating, duplicates, and empty lists.

Follow-up to #8574

Test plan

New unit tests

…the highest-priority completion

createWordlistDictionary built a trie whose query walked it in pre-order,
which returns the shallowest (shortest) terminal among the words sharing a
prefix rather than the earliest-listed one. That only matches the intended
"first-listed wins" semantics when the wordlist is in lexicographic order.
The shipped English list is ordered by frequency, so 105 common word pairs
(398 prefixes) surfaced a rarer completion than the priority order intends,
e.g. "prof" -> "profession" instead of "professional", "cons" -> "construct"
instead of "construction". The Korean list is sorted, so it was unaffected.

Replace the trie with a compact prefix index: `order` lists the word indices
sorted by case-folded text, so words sharing a prefix form one contiguous
block. query() binary-searches for the block and scans it for the
earliest-listed word longer than the prefix — exactly the word the original
linear scan returned (verified to match across every queryable prefix of both
shipped dictionaries). The only retained overhead is one Uint32 per word
(~7x less than a prefix->word map on the 25k-word Korean list); folded text is
recomputed on the fly and the worst-case block scan is ~40 entries.

Add unit tests covering build and query: priority ordering for nested and
non-nested matches, exact-prefix handling, multi-byte (Hangul) words, casing
preservation, minPrefixLength gating, duplicates, and empty lists.

https://claude.ai/code/session_01LSyqnSFhrv8SstFAkXG5GB
@vercel
Copy link
Copy Markdown

vercel Bot commented May 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lexical Ready Ready Preview, Comment May 30, 2026 11:31pm
lexical-playground Ready Ready Preview, Comment May 30, 2026 11:31pm

Request Review

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 30, 2026
…detection and document case-fold assumption

detectLanguage walked UTF-16 indices and called codePointAt(i), so a
trailing supplementary-plane ideograph was read as its lone low surrogate
and fell through to 'en'. It now iterates real code points (Array.from)
and recognizes CJK Extension A, Compatibility Ideographs, and the
Supplementary Ideographic Plane alongside CJK Unified, so extended and
astral Han ideographs resolve to 'ja'. (No default dictionary is keyed on
'ja', so this only affects hosts that register one; the en/ko defaults are
unchanged.)

Document that createWordlistDictionary's case-insensitive matching assumes
length-preserving case folding (true for Latin, Hangul, kana, Han) and to
use caseSensitive for scripts where toLowerCase() changes length, e.g.
Turkish 'İ'. Full Unicode-correct casing is avoided on purpose: per-
character folding would reintroduce context-sensitive cases such as Greek
final sigma.

Add unit tests: detectLanguage cases for Extension A, a supplementary-plane
ideograph (bare and trailing after ascii), and an astral emoji; plus a
dedicated Korean (non-ASCII) wordlist suite covering prefix completion,
earliest-listed priority among shared stems, minPrefixLength gating, exact
words with no suffix, and unmatched prefixes.

https://claude.ai/code/session_01LSyqnSFhrv8SstFAkXG5GB
@potatowagon
Copy link
Copy Markdown
Contributor

Review from Sherry's Navi (automated assistant)

Summary

This PR fixes a correctness bug in createWordlistDictionary where the trie-based implementation returned the shortest matching word instead of the earliest-listed (highest-priority) word, and extends CJK detection to supplementary-plane ideographs.

What I verified

1. Correctness of the new algorithm:

  • The sorted-index approach (binary search for the prefix block + scan for lowest original index) correctly implements "first-listed wins" semantics. The binary search finds the contiguous block of words sharing the prefix, and the linear scan through that block tracks the minimum original index — this is equivalent to the linear scan but O(log n + block_size) instead of O(n).
  • The Uint32Array.from(...) with a stable sort (tie-break by original index a - b) ensures deterministic ordering when folded words are equal (duplicates).
  • The fold(word).startsWith(needle) check + break on non-match is correct because the block is sorted by folded text.

2. Edge cases handled:

  • Prefix equals a complete word → returns the next longer word (or null if none exists)
  • Empty wordlist → returns null
  • Prefix longer than all entries → returns null
  • Duplicates → first occurrence wins (by original index)
  • Case-insensitive matching preserves source casing in the suffix
  • Multi-byte (Hangul) words work correctly with the length-based comparisons

3. detectLanguage supplementary-plane fix:

  • Array.from(text) correctly splits by code points (not UTF-16 code units), fixing the surrogate pair issue
  • Extended isCJKIdeograph covers Extension A (U+3400–4DBF), Compatibility Ideographs (U+F900–FAFF), and supplementary plane (U+20000–2FA1F) — all standard ranges
  • Astral emoji test confirms non-CJK supplementary chars still return 'en'

4. Test coverage:

  • Comprehensive unit tests covering priority ordering, nested words, Korean, case insensitivity, minPrefixLength, duplicates, empty lists, and supplementary-plane CJK detection
  • Tests directly validate the "first-listed wins" contract that was broken before

5. Performance consideration:

  • The doc notes ~7x less memory than a prefix→word map for the 25k Korean list
  • Worst-case block scan is ~40 entries per the PR description, which is fast for an autocomplete lookup
  • fold() is recomputed on-the-fly during query — acceptable tradeoff vs caching folded strings

6. CI status: All checks green (30+ jobs passing including all e2e matrix combinations). Only collab-mac webkit still pending.

Assessment

✅ Safe to approve. This is a well-motivated bug fix with thorough test coverage, correct algorithm implementation, and clean documentation. The Turkish İ limitation is explicitly documented with a workaround (caseSensitive: true). No behavioral regression for existing users — the change makes the dictionary honor priority ordering as originally intended.

@etrepum etrepum added this pull request to the merge queue May 31, 2026
Merged via the queue into facebook:main with commit 00fed3e May 31, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. extended-tests Run extended e2e tests on a PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants