[lexical-playground] Bug Fix: AutocompleteExtension wordlist returns the highest-priority completion by etrepum · Pull Request #8599 · facebook/lexical

etrepum · 2026-05-30T23:24:18Z

Description

createWordlistDictionary built a trie whose query walked it in pre-order, which returns the shallowest (shortest) terminal among the words sharing a prefix rather than the earliest-listed one. That only matches the intended "first-listed wins" semantics when the wordlist is in lexicographic order. The shipped English list is ordered by frequency, so 105 common word pairs (398 prefixes) surfaced a rarer completion than the priority order intends, e.g. "prof" -> "profession" instead of "professional", "cons" -> "construct" instead of "construction". The Korean list is sorted, so it was unaffected.

Replace the trie with a compact prefix index: order lists the word indices sorted by case-folded text, so words sharing a prefix form one contiguous block. query() binary-searches for the block and scans it for the earliest-listed word longer than the prefix — exactly the word the original linear scan returned (verified to match across every queryable prefix of both shipped dictionaries). The only retained overhead is one Uint32 per word (~7x less than a prefix->word map on the 25k-word Korean list); folded text is recomputed on the fly and the worst-case block scan is ~40 entries.

Add unit tests covering build and query: priority ordering for nested and non-nested matches, exact-prefix handling, multi-byte (Hangul) words, casing preservation, minPrefixLength gating, duplicates, and empty lists.

Follow-up to #8574

Test plan

New unit tests

…the highest-priority completion createWordlistDictionary built a trie whose query walked it in pre-order, which returns the shallowest (shortest) terminal among the words sharing a prefix rather than the earliest-listed one. That only matches the intended "first-listed wins" semantics when the wordlist is in lexicographic order. The shipped English list is ordered by frequency, so 105 common word pairs (398 prefixes) surfaced a rarer completion than the priority order intends, e.g. "prof" -> "profession" instead of "professional", "cons" -> "construct" instead of "construction". The Korean list is sorted, so it was unaffected. Replace the trie with a compact prefix index: `order` lists the word indices sorted by case-folded text, so words sharing a prefix form one contiguous block. query() binary-searches for the block and scans it for the earliest-listed word longer than the prefix — exactly the word the original linear scan returned (verified to match across every queryable prefix of both shipped dictionaries). The only retained overhead is one Uint32 per word (~7x less than a prefix->word map on the 25k-word Korean list); folded text is recomputed on the fly and the worst-case block scan is ~40 entries. Add unit tests covering build and query: priority ordering for nested and non-nested matches, exact-prefix handling, multi-byte (Hangul) words, casing preservation, minPrefixLength gating, duplicates, and empty lists. https://claude.ai/code/session_01LSyqnSFhrv8SstFAkXG5GB

vercel · 2026-05-30T23:24:23Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
lexical	Ready	Preview, Comment	May 30, 2026 11:31pm
lexical-playground	Ready	Preview, Comment	May 30, 2026 11:31pm

…detection and document case-fold assumption detectLanguage walked UTF-16 indices and called codePointAt(i), so a trailing supplementary-plane ideograph was read as its lone low surrogate and fell through to 'en'. It now iterates real code points (Array.from) and recognizes CJK Extension A, Compatibility Ideographs, and the Supplementary Ideographic Plane alongside CJK Unified, so extended and astral Han ideographs resolve to 'ja'. (No default dictionary is keyed on 'ja', so this only affects hosts that register one; the en/ko defaults are unchanged.) Document that createWordlistDictionary's case-insensitive matching assumes length-preserving case folding (true for Latin, Hangul, kana, Han) and to use caseSensitive for scripts where toLowerCase() changes length, e.g. Turkish 'İ'. Full Unicode-correct casing is avoided on purpose: per- character folding would reintroduce context-sensitive cases such as Greek final sigma. Add unit tests: detectLanguage cases for Extension A, a supplementary-plane ideograph (bare and trailing after ascii), and an astral emoji; plus a dedicated Korean (non-ASCII) wordlist suite covering prefix completion, earliest-listed priority among shared stems, minPrefixLength gating, exact words with no suffix, and unmatched prefixes. https://claude.ai/code/session_01LSyqnSFhrv8SstFAkXG5GB

potatowagon · 2026-05-31T00:02:26Z

Review from Sherry's Navi (automated assistant)

Summary

This PR fixes a correctness bug in createWordlistDictionary where the trie-based implementation returned the shortest matching word instead of the earliest-listed (highest-priority) word, and extends CJK detection to supplementary-plane ideographs.

What I verified

1. Correctness of the new algorithm:

The sorted-index approach (binary search for the prefix block + scan for lowest original index) correctly implements "first-listed wins" semantics. The binary search finds the contiguous block of words sharing the prefix, and the linear scan through that block tracks the minimum original index — this is equivalent to the linear scan but O(log n + block_size) instead of O(n).
The Uint32Array.from(...) with a stable sort (tie-break by original index a - b) ensures deterministic ordering when folded words are equal (duplicates).
The fold(word).startsWith(needle) check + break on non-match is correct because the block is sorted by folded text.

2. Edge cases handled:

Prefix equals a complete word → returns the next longer word (or null if none exists)
Empty wordlist → returns null
Prefix longer than all entries → returns null
Duplicates → first occurrence wins (by original index)
Case-insensitive matching preserves source casing in the suffix
Multi-byte (Hangul) words work correctly with the length-based comparisons

3. detectLanguage supplementary-plane fix:

Array.from(text) correctly splits by code points (not UTF-16 code units), fixing the surrogate pair issue
Extended isCJKIdeograph covers Extension A (U+3400–4DBF), Compatibility Ideographs (U+F900–FAFF), and supplementary plane (U+20000–2FA1F) — all standard ranges
Astral emoji test confirms non-CJK supplementary chars still return 'en'

4. Test coverage:

Comprehensive unit tests covering priority ordering, nested words, Korean, case insensitivity, minPrefixLength, duplicates, empty lists, and supplementary-plane CJK detection
Tests directly validate the "first-listed wins" contract that was broken before

5. Performance consideration:

The doc notes ~7x less memory than a prefix→word map for the 25k Korean list
Worst-case block scan is ~40 entries per the PR description, which is fast for an autocomplete lookup
fold() is recomputed on-the-fly during query — acceptable tradeoff vs caching folded strings

6. CI status: All checks green (30+ jobs passing including all e2e matrix combinations). Only collab-mac webkit still pending.

Assessment

✅ Safe to approve. This is a well-motivated bug fix with thorough test coverage, correct algorithm implementation, and clean documentation. The Turkish İ limitation is explicitly documented with a workaround (caseSensitive: true). No behavioral regression for existing users — the change makes the dictionary honor priority ordering as originally intended.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 30, 2026

vercel Bot deployed to Preview – lexical-playground May 30, 2026 23:25 View deployment

vercel Bot deployed to Preview – lexical May 30, 2026 23:25 View deployment

vercel Bot deployed to Preview – lexical-playground May 30, 2026 23:31 View deployment

vercel Bot deployed to Preview – lexical May 30, 2026 23:31 View deployment

etrepum added the extended-tests Run extended e2e tests on a PR label May 30, 2026

etrepum marked this pull request as ready for review May 30, 2026 23:36

etrepum requested review from acywatson, fantactuka, ivailop7, potatowagon and zurfyx as code owners May 30, 2026 23:36

potatowagon approved these changes May 31, 2026

View reviewed changes

etrepum added this pull request to the merge queue May 31, 2026

Merged via the queue into facebook:main with commit 00fed3e May 31, 2026
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lexical-playground] Bug Fix: AutocompleteExtension wordlist returns the highest-priority completion#8599

[lexical-playground] Bug Fix: AutocompleteExtension wordlist returns the highest-priority completion#8599
etrepum merged 2 commits into
facebook:mainfrom
etrepum:claude/wizardly-bell-lqX7r

etrepum commented May 30, 2026

Uh oh!

vercel Bot commented May 30, 2026 •

edited

Loading

Uh oh!

potatowagon commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

etrepum commented May 30, 2026

Description

Test plan

Uh oh!

vercel Bot commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potatowagon commented May 31, 2026

Summary

What I verified

Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel Bot commented May 30, 2026 •

edited

Loading