[lexical-playground] Bug Fix: AutocompleteExtension wordlist returns the highest-priority completion#8599
Conversation
…the highest-priority completion createWordlistDictionary built a trie whose query walked it in pre-order, which returns the shallowest (shortest) terminal among the words sharing a prefix rather than the earliest-listed one. That only matches the intended "first-listed wins" semantics when the wordlist is in lexicographic order. The shipped English list is ordered by frequency, so 105 common word pairs (398 prefixes) surfaced a rarer completion than the priority order intends, e.g. "prof" -> "profession" instead of "professional", "cons" -> "construct" instead of "construction". The Korean list is sorted, so it was unaffected. Replace the trie with a compact prefix index: `order` lists the word indices sorted by case-folded text, so words sharing a prefix form one contiguous block. query() binary-searches for the block and scans it for the earliest-listed word longer than the prefix — exactly the word the original linear scan returned (verified to match across every queryable prefix of both shipped dictionaries). The only retained overhead is one Uint32 per word (~7x less than a prefix->word map on the 25k-word Korean list); folded text is recomputed on the fly and the worst-case block scan is ~40 entries. Add unit tests covering build and query: priority ordering for nested and non-nested matches, exact-prefix handling, multi-byte (Hangul) words, casing preservation, minPrefixLength gating, duplicates, and empty lists. https://claude.ai/code/session_01LSyqnSFhrv8SstFAkXG5GB
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…detection and document case-fold assumption detectLanguage walked UTF-16 indices and called codePointAt(i), so a trailing supplementary-plane ideograph was read as its lone low surrogate and fell through to 'en'. It now iterates real code points (Array.from) and recognizes CJK Extension A, Compatibility Ideographs, and the Supplementary Ideographic Plane alongside CJK Unified, so extended and astral Han ideographs resolve to 'ja'. (No default dictionary is keyed on 'ja', so this only affects hosts that register one; the en/ko defaults are unchanged.) Document that createWordlistDictionary's case-insensitive matching assumes length-preserving case folding (true for Latin, Hangul, kana, Han) and to use caseSensitive for scripts where toLowerCase() changes length, e.g. Turkish 'İ'. Full Unicode-correct casing is avoided on purpose: per- character folding would reintroduce context-sensitive cases such as Greek final sigma. Add unit tests: detectLanguage cases for Extension A, a supplementary-plane ideograph (bare and trailing after ascii), and an astral emoji; plus a dedicated Korean (non-ASCII) wordlist suite covering prefix completion, earliest-listed priority among shared stems, minPrefixLength gating, exact words with no suffix, and unmatched prefixes. https://claude.ai/code/session_01LSyqnSFhrv8SstFAkXG5GB
|
Review from Sherry's Navi (automated assistant) SummaryThis PR fixes a correctness bug in What I verified1. Correctness of the new algorithm:
2. Edge cases handled:
3. detectLanguage supplementary-plane fix:
4. Test coverage:
5. Performance consideration:
6. CI status: All checks green (30+ jobs passing including all e2e matrix combinations). Only collab-mac webkit still pending. Assessment✅ Safe to approve. This is a well-motivated bug fix with thorough test coverage, correct algorithm implementation, and clean documentation. The Turkish İ limitation is explicitly documented with a workaround ( |
Description
createWordlistDictionary built a trie whose query walked it in pre-order, which returns the shallowest (shortest) terminal among the words sharing a prefix rather than the earliest-listed one. That only matches the intended "first-listed wins" semantics when the wordlist is in lexicographic order. The shipped English list is ordered by frequency, so 105 common word pairs (398 prefixes) surfaced a rarer completion than the priority order intends, e.g. "prof" -> "profession" instead of "professional", "cons" -> "construct" instead of "construction". The Korean list is sorted, so it was unaffected.
Replace the trie with a compact prefix index:
orderlists the word indices sorted by case-folded text, so words sharing a prefix form one contiguous block. query() binary-searches for the block and scans it for the earliest-listed word longer than the prefix — exactly the word the original linear scan returned (verified to match across every queryable prefix of both shipped dictionaries). The only retained overhead is one Uint32 per word (~7x less than a prefix->word map on the 25k-word Korean list); folded text is recomputed on the fly and the worst-case block scan is ~40 entries.Add unit tests covering build and query: priority ordering for nested and non-nested matches, exact-prefix handling, multi-byte (Hangul) words, casing preservation, minPrefixLength gating, duplicates, and empty lists.
Follow-up to #8574
Test plan
New unit tests