Skip to content

7.0.236

Choose a tag to compare

@shai-almog shai-almog released this 01 May 05:32
· 320 commits to master since this release
1750d63
Fix POSIX character classes for non-ASCII letters in RE (#4841) (#4843)

* Fix POSIX character classes for non-ASCII letters in RE (#4841)

RECharacter.getType() returned UNASSIGNED for any char >= 128, so
[[:alpha:]], [[:alnum:]], [[:lower:]], and [[:upper:]] silently failed
to match non-Latin letters. As reported in #4841, the regex
"test:\s*([[:alpha:]][[:alnum:]]*)" did not match "test: c123" when the
identifier began with a non-ASCII letter.

Delegate to java.lang.Character.getType(c) for c >= 128 in the
non-RE_UNICODE branch. The RECharacter constants (UPPERCASE_LETTER=1,
LOWERCASE_LETTER=2, ...) are the Unicode general-category numeric codes
and match Character.getType()'s return values exactly, so a byte cast
is safe. The RE_UNICODE preprocessor branch keeps its existing
table-based lookup with a UNASSIGNED fallthrough.

Add five tests covering Latin-with-cedilla, Greek, Cyrillic, CJK
ideographs, vulgar fractions, and currency symbols, including a
regression test for the exact failing case from the issue. Tests use
\uXXXX escapes to keep sources ASCII-only (CI javac uses the platform
default encoding).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Avoid Character.getType in RE; the framework compiles against CLDC11 stub

The CI Ant build sets -bootclasspath to Ports/CLDC11/dist/CLDC11.jar,
whose java.lang.Character stub does not expose getType, isLetter, or
isLetterOrDigit. The previous fix used Character.getType(c), which
compiled fine under the maven build (full JDK rt.jar) but fails the
Ant build with "cannot find symbol: method getType(char)".

Compose the same effect from the methods that the CLDC11 stub does
expose: isLowerCase, isUpperCase, isDigit, isSpaceChar. This covers
cased letters in Latin (with diacritics), Greek, and Cyrillic, plus
decimal digits and space separators -- enough to fix the reported
case from #4841 ("test: c-cedilla 123" matching
"test:\\s*([[:alpha:]][[:alnum:]]*)").

Limitation: characters whose Unicode general category is OTHER_LETTER
(CJK ideographs, Hebrew, Arabic, Devanagari, ...), TITLECASE_LETTER,
MODIFIER_LETTER, or LETTER_NUMBER cannot be distinguished from
UNASSIGNED with the CLDC11 API surface and remain unmatched by
[[:alpha:]] / [[:alnum:]]. Lifting that limitation requires either
the RE_UNICODE preprocessor branch or extending the CLDC11 stub --
both out of scope for this fix. Tests document the limitation by
asserting only on cased scripts.

Verified: javac -bootclasspath CLDC11.jar -source 1.5 -target 1.5
compiles RECharacter and RE cleanly; mvn test from core-unittests
runs all 10 RETest tests, including the regression for the exact
failing input from the issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Shai Almog <67850168+shai-almog@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>