Skip to content

Fix POSIX character classes for non-ASCII letters in RE (#4841)#4843

Merged
shai-almog merged 2 commits intomasterfrom
fix-4841-posix-non-latin-letters
May 1, 2026
Merged

Fix POSIX character classes for non-ASCII letters in RE (#4841)#4843
shai-almog merged 2 commits intomasterfrom
fix-4841-posix-non-latin-letters

Conversation

@liannacasper
Copy link
Copy Markdown
Collaborator

Fixes #4841.

Summary

  • RECharacter.getType() returned UNASSIGNED for every char >= 128, so [[:alpha:]], [[:alnum:]], [[:lower:]], and [[:upper:]] silently failed on non-Latin letters. The reported regex test:\s*([[:alpha:]][[:alnum:]]*) did not match "test: ç123" because the leading ç (c-cedilla) was treated as unassigned.
  • Delegate to java.lang.Character.getType(c) for c >= 128 in the non-RE_UNICODE branch. The RECharacter constants (UPPERCASE_LETTER=1, LOWERCASE_LETTER=2, ...) are the Unicode general-category numeric codes and match Character.getType()'s return values exactly, so a byte cast is safe. The RE_UNICODE preprocessor branch keeps its existing table-based lookup with a UNASSIGNED fallthrough.
  • Add five tests covering Latin-with-cedilla, Greek, Cyrillic, CJK ideographs, vulgar fractions, and currency symbols. One test is a regression for the exact failing case from the issue and asserts both the match and the captured group. Sources stay ASCII-only (CI javac uses the platform default encoding) by using \uXXXX escapes.

Test plan

  • mvn -Dtest=RETest test from maven/core-unittests — 10/10 pass (5 pre-existing + 5 new).
  • Confirmed regression: temporarily reverting only the RECharacter change makes 4 of the 5 new tests fail with the expected non-Latin matching errors; the 5th (testPosixDigitIsAsciiOnlyForOtherNumbers) documents that [[:digit:]] remains decimal-digit-only, which is true in both states.
  • Pre-existing tests (testNestedPosixAlphaCharacterClassSupport, testLegacyPosixAlphaCharacterClassSupport, testPosixClassesAndEscapes, etc.) continue to pass — no behavioral change for ASCII input.

🤖 Generated with Claude Code

RECharacter.getType() returned UNASSIGNED for any char >= 128, so
[[:alpha:]], [[:alnum:]], [[:lower:]], and [[:upper:]] silently failed
to match non-Latin letters. As reported in #4841, the regex
"test:\s*([[:alpha:]][[:alnum:]]*)" did not match "test: c123" when the
identifier began with a non-ASCII letter.

Delegate to java.lang.Character.getType(c) for c >= 128 in the
non-RE_UNICODE branch. The RECharacter constants (UPPERCASE_LETTER=1,
LOWERCASE_LETTER=2, ...) are the Unicode general-category numeric codes
and match Character.getType()'s return values exactly, so a byte cast
is safe. The RE_UNICODE preprocessor branch keeps its existing
table-based lookup with a UNASSIGNED fallthrough.

Add five tests covering Latin-with-cedilla, Greek, Cyrillic, CJK
ideographs, vulgar fractions, and currency symbols, including a
regression test for the exact failing case from the issue. Tests use
\uXXXX escapes to keep sources ASCII-only (CI javac uses the platform
default encoding).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

✅ Continuous Quality Report

Test & Coverage

Static Analysis

  • SpotBugs [Report archive]
    • ByteCodeTranslator: 0 findings (no issues)
    • android: 0 findings (no issues)
    • codenameone-maven-plugin: 0 findings (no issues)
    • core-unittests: 0 findings (no issues)
    • ios: 0 findings (no issues)
  • PMD: 0 findings (no issues) [Report archive]
  • Checkstyle: 0 findings (no issues) [Report archive]

Generated automatically by the PR CI workflow.

@shai-almog
Copy link
Copy Markdown
Collaborator

shai-almog commented May 1, 2026

Compared 86 screenshots: 86 matched.

Native Android coverage

  • 📊 Line coverage: 9.75% (5291/54243 lines covered) [HTML preview] (artifact android-coverage-report, jacocoAndroidReport/html/index.html)
    • Other counters: instruction 7.67% (26003/339142), branch 3.48% (1132/32522), complexity 4.52% (1410/31163), method 7.92% (1153/14567), class 12.97% (253/1950)
    • Lowest covered classes
      • kotlin.collections.kotlin.collections.ArraysKt___ArraysKt – 0.00% (0/6327 lines covered)
      • kotlin.collections.unsigned.kotlin.collections.unsigned.UArraysKt___UArraysKt – 0.00% (0/2384 lines covered)
      • org.jacoco.agent.rt.internal_b6258fc.asm.org.jacoco.agent.rt.internal_b6258fc.asm.ClassReader – 0.00% (0/1519 lines covered)
      • kotlin.collections.kotlin.collections.CollectionsKt___CollectionsKt – 0.00% (0/1148 lines covered)
      • org.jacoco.agent.rt.internal_b6258fc.asm.org.jacoco.agent.rt.internal_b6258fc.asm.MethodWriter – 0.00% (0/923 lines covered)
      • kotlin.sequences.kotlin.sequences.SequencesKt___SequencesKt – 0.00% (0/730 lines covered)
      • kotlin.text.kotlin.text.StringsKt___StringsKt – 0.00% (0/623 lines covered)
      • org.jacoco.agent.rt.internal_b6258fc.asm.org.jacoco.agent.rt.internal_b6258fc.asm.Frame – 0.00% (0/564 lines covered)
      • kotlin.collections.kotlin.collections.ArraysKt___ArraysJvmKt – 0.00% (0/495 lines covered)
      • kotlinx.coroutines.kotlinx.coroutines.JobSupport – 0.00% (0/423 lines covered)

✅ Native Android screenshot tests passed.

Native Android coverage

  • 📊 Line coverage: 9.75% (5291/54243 lines covered) [HTML preview] (artifact android-coverage-report, jacocoAndroidReport/html/index.html)
    • Other counters: instruction 7.67% (26003/339142), branch 3.48% (1132/32522), complexity 4.52% (1410/31163), method 7.92% (1153/14567), class 12.97% (253/1950)
    • Lowest covered classes
      • kotlin.collections.kotlin.collections.ArraysKt___ArraysKt – 0.00% (0/6327 lines covered)
      • kotlin.collections.unsigned.kotlin.collections.unsigned.UArraysKt___UArraysKt – 0.00% (0/2384 lines covered)
      • org.jacoco.agent.rt.internal_b6258fc.asm.org.jacoco.agent.rt.internal_b6258fc.asm.ClassReader – 0.00% (0/1519 lines covered)
      • kotlin.collections.kotlin.collections.CollectionsKt___CollectionsKt – 0.00% (0/1148 lines covered)
      • org.jacoco.agent.rt.internal_b6258fc.asm.org.jacoco.agent.rt.internal_b6258fc.asm.MethodWriter – 0.00% (0/923 lines covered)
      • kotlin.sequences.kotlin.sequences.SequencesKt___SequencesKt – 0.00% (0/730 lines covered)
      • kotlin.text.kotlin.text.StringsKt___StringsKt – 0.00% (0/623 lines covered)
      • org.jacoco.agent.rt.internal_b6258fc.asm.org.jacoco.agent.rt.internal_b6258fc.asm.Frame – 0.00% (0/564 lines covered)
      • kotlin.collections.kotlin.collections.ArraysKt___ArraysJvmKt – 0.00% (0/495 lines covered)
      • kotlinx.coroutines.kotlinx.coroutines.JobSupport – 0.00% (0/423 lines covered)

Benchmark Results

Detailed Performance Metrics

Metric Duration
Base64 payload size 8192 bytes
Base64 benchmark iterations 6000
Base64 native encode 1115.000 ms
Base64 CN1 encode 165.000 ms
Base64 encode ratio (CN1/native) 0.148x (85.2% faster)
Base64 native decode 734.000 ms
Base64 CN1 decode 243.000 ms
Base64 decode ratio (CN1/native) 0.331x (66.9% faster)
Image encode benchmark status skipped (SIMD unsupported)

@shai-almog
Copy link
Copy Markdown
Collaborator

shai-almog commented May 1, 2026

Compared 42 screenshots: 42 matched.
✅ Native iOS screenshot tests passed.

Benchmark Results

  • VM Translation Time: 0 seconds
  • Compilation Time: 196 seconds

Build and Run Timing

Metric Duration
Simulator Boot 77000 ms
Simulator Boot (Run) 1000 ms
App Install 15000 ms
App Launch 9000 ms
Test Execution 301000 ms

…stub

The CI Ant build sets -bootclasspath to Ports/CLDC11/dist/CLDC11.jar,
whose java.lang.Character stub does not expose getType, isLetter, or
isLetterOrDigit. The previous fix used Character.getType(c), which
compiled fine under the maven build (full JDK rt.jar) but fails the
Ant build with "cannot find symbol: method getType(char)".

Compose the same effect from the methods that the CLDC11 stub does
expose: isLowerCase, isUpperCase, isDigit, isSpaceChar. This covers
cased letters in Latin (with diacritics), Greek, and Cyrillic, plus
decimal digits and space separators -- enough to fix the reported
case from #4841 ("test: c-cedilla 123" matching
"test:\\s*([[:alpha:]][[:alnum:]]*)").

Limitation: characters whose Unicode general category is OTHER_LETTER
(CJK ideographs, Hebrew, Arabic, Devanagari, ...), TITLECASE_LETTER,
MODIFIER_LETTER, or LETTER_NUMBER cannot be distinguished from
UNASSIGNED with the CLDC11 API surface and remain unmatched by
[[:alpha:]] / [[:alnum:]]. Lifting that limitation requires either
the RE_UNICODE preprocessor branch or extending the CLDC11 stub --
both out of scope for this fix. Tests document the limitation by
asserting only on cased scripts.

Verified: javac -bootclasspath CLDC11.jar -source 1.5 -target 1.5
compiles RECharacter and RE cleanly; mvn test from core-unittests
runs all 10 RETest tests, including the regression for the exact
failing input from the issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@liannacasper
Copy link
Copy Markdown
Collaborator Author

Pushed cf8ee6a to address the CI failure: javac complained that Character.getType(char) was missing because the framework is built with -bootclasspath ../Ports/CLDC11/dist/CLDC11.jar, and the CLDC11 stub does not expose getType / isLetter / isLetterOrDigit.

The fix now uses only methods the stub does expose -- isLowerCase, isUpperCase, isDigit, isSpaceChar -- composing the same effect for the cased scripts (Latin with diacritics, Greek, Cyrillic) plus decimal digits and separators. That is enough for the reported failing case "test: c-cedilla 123" against "test:\\s*([[:alpha:]][[:alnum:]]*)".

Known limitation: OTHER_LETTER (CJK, Hebrew, Arabic, ...), TITLECASE_LETTER, MODIFIER_LETTER, and LETTER_NUMBER cannot be distinguished from UNASSIGNED with the CLDC11 API surface, so [[:alpha:]] / [[:alnum:]] still won't match them. Lifting that requires either the existing RE_UNICODE preprocessor branch (full Unicode tables) or extending the CLDC11 stub -- both out of scope for this issue. The tests are scoped accordingly.

Verified locally:

  • javac -bootclasspath CLDC11.jar -source 1.5 -target 1.5 src/com/codename1/util/regex/*.java -- compiles clean.
  • mvn -Dtest=RETest test -- 10/10 pass (5 pre-existing + 5 new, including the exact regression from the issue).

@shai-almog shai-almog merged commit 1750d63 into master May 1, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RE POSIX support in class RE (regex) does not match with POSIX characters

2 participants