Fix POSIX character classes for non-ASCII letters in RE (#4841)#4843
Fix POSIX character classes for non-ASCII letters in RE (#4841)#4843shai-almog merged 2 commits intomasterfrom
Conversation
RECharacter.getType() returned UNASSIGNED for any char >= 128, so [[:alpha:]], [[:alnum:]], [[:lower:]], and [[:upper:]] silently failed to match non-Latin letters. As reported in #4841, the regex "test:\s*([[:alpha:]][[:alnum:]]*)" did not match "test: c123" when the identifier began with a non-ASCII letter. Delegate to java.lang.Character.getType(c) for c >= 128 in the non-RE_UNICODE branch. The RECharacter constants (UPPERCASE_LETTER=1, LOWERCASE_LETTER=2, ...) are the Unicode general-category numeric codes and match Character.getType()'s return values exactly, so a byte cast is safe. The RE_UNICODE preprocessor branch keeps its existing table-based lookup with a UNASSIGNED fallthrough. Add five tests covering Latin-with-cedilla, Greek, Cyrillic, CJK ideographs, vulgar fractions, and currency symbols, including a regression test for the exact failing case from the issue. Tests use \uXXXX escapes to keep sources ASCII-only (CI javac uses the platform default encoding). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ Continuous Quality ReportTest & Coverage
Static Analysis
Generated automatically by the PR CI workflow. |
|
Compared 86 screenshots: 86 matched. Native Android coverage
✅ Native Android screenshot tests passed. Native Android coverage
Benchmark ResultsDetailed Performance Metrics
|
|
Compared 42 screenshots: 42 matched. Benchmark Results
Build and Run Timing
|
…stub The CI Ant build sets -bootclasspath to Ports/CLDC11/dist/CLDC11.jar, whose java.lang.Character stub does not expose getType, isLetter, or isLetterOrDigit. The previous fix used Character.getType(c), which compiled fine under the maven build (full JDK rt.jar) but fails the Ant build with "cannot find symbol: method getType(char)". Compose the same effect from the methods that the CLDC11 stub does expose: isLowerCase, isUpperCase, isDigit, isSpaceChar. This covers cased letters in Latin (with diacritics), Greek, and Cyrillic, plus decimal digits and space separators -- enough to fix the reported case from #4841 ("test: c-cedilla 123" matching "test:\\s*([[:alpha:]][[:alnum:]]*)"). Limitation: characters whose Unicode general category is OTHER_LETTER (CJK ideographs, Hebrew, Arabic, Devanagari, ...), TITLECASE_LETTER, MODIFIER_LETTER, or LETTER_NUMBER cannot be distinguished from UNASSIGNED with the CLDC11 API surface and remain unmatched by [[:alpha:]] / [[:alnum:]]. Lifting that limitation requires either the RE_UNICODE preprocessor branch or extending the CLDC11 stub -- both out of scope for this fix. Tests document the limitation by asserting only on cased scripts. Verified: javac -bootclasspath CLDC11.jar -source 1.5 -target 1.5 compiles RECharacter and RE cleanly; mvn test from core-unittests runs all 10 RETest tests, including the regression for the exact failing input from the issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed The fix now uses only methods the stub does expose -- Known limitation: Verified locally:
|
Fixes #4841.
Summary
RECharacter.getType()returnedUNASSIGNEDfor every char>= 128, so[[:alpha:]],[[:alnum:]],[[:lower:]], and[[:upper:]]silently failed on non-Latin letters. The reported regextest:\s*([[:alpha:]][[:alnum:]]*)did not match"test: ç123"because the leadingç(c-cedilla) was treated as unassigned.java.lang.Character.getType(c)forc >= 128in the non-RE_UNICODEbranch. TheRECharacterconstants (UPPERCASE_LETTER=1,LOWERCASE_LETTER=2, ...) are the Unicode general-category numeric codes and matchCharacter.getType()'s return values exactly, so abytecast is safe. TheRE_UNICODEpreprocessor branch keeps its existing table-based lookup with aUNASSIGNEDfallthrough.javacuses the platform default encoding) by using\uXXXXescapes.Test plan
mvn -Dtest=RETest testfrommaven/core-unittests— 10/10 pass (5 pre-existing + 5 new).RECharacterchange makes 4 of the 5 new tests fail with the expected non-Latin matching errors; the 5th (testPosixDigitIsAsciiOnlyForOtherNumbers) documents that[[:digit:]]remains decimal-digit-only, which is true in both states.testNestedPosixAlphaCharacterClassSupport,testLegacyPosixAlphaCharacterClassSupport,testPosixClassesAndEscapes, etc.) continue to pass — no behavioral change for ASCII input.🤖 Generated with Claude Code