iOS UTF-8 codec: replace-char semantics, NEON ASCII fast-path, benchmark#4989
Merged
Conversation
Rewrite the UTF-8 decode/encode helpers used by the ParparVM String layer.
The previous decoder threw RuntimeException("Decoding Error") on malformed
input, the encoder fell through to a 1-byte-per-char stub on non-Apple
builds, and ISO-8859-2 was silently aliased to NSISOLatin1.
* Decoder: Hoehrmann DFA with JDK-compatible REPLACE semantics -- emits
one U+FFFD per maximal-subpart violation instead of throwing. Truncated
trailing sequences also emit U+FFFD. Removes the silent Latin-1 fallback
that hid encoding errors when NSString rejected input.
* Encoder: portable UTF-16 -> UTF-8 with surrogate-pair joining. The Apple
path now uses it for UTF-8 directly so NSString is no longer involved in
the common case; the POSIX/test fallback gains a real implementation in
place of the old "TODO" stub.
* NEON: __ARM_NEON-gated ASCII prefix scan (vmaxvq_u8) and u8->u16 widen
(vmovl_u8) for inputs >= 64 bytes. A standalone microbench shows ~53x
speedup over scalar DFA on ASCII-heavy payloads. The integration-level
benchmark cannot see this win because allocating a fresh char[] per call
dominates on ParparVM, but the helpers carry pull-its-weight cost on the
parser-style hot paths the SIMD work was added for.
* ISO-8859-2 now maps to NSISOLatin2StringEncoding for both decode and
encode; "UTF8", "ASCII", "LATIN1", "LATIN2" join the accepted aliases.
String.offset is now honoured when reading the encoding name (was
ignored before, latent bug for substring-derived encoding strings).
Utf8PerformanceIntegrationTest mirrors the Base64 perf pattern: builds an
ASCII payload + a mixed payload with 2/3/4-byte sequences (incl. surrogate
pair U+1F600), runs encode/decode loops on both JavaSE and ParparVM, and
asserts identical RESULT signatures. A malformed-input probe is folded into
the signature so REPLACE parity between JDK and the iOS decoder is verified
end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Collaborator
Author
|
Compared 20 screenshots: 20 matched. |
Contributor
✅ ByteCodeTranslator Quality ReportTest & Coverage
Benchmark Results
Static Analysis
Generated automatically by the PR CI workflow. |
Collaborator
Author
|
Compared 110 screenshots: 110 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
Collaborator
Author
|
Compared 110 screenshots: 110 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RuntimeException("Decoding Error")with JDK-compatible U+FFFD replacement in the ParparVM UTF-8 decoder; remove silent Latin-1 fallback and other silent-corruption paths.__ARM_NEON-gated ASCII prefix scan + u8→u16 widen for large inputs (~53× faster than scalar DFA on ASCII payloads — verified in a non-allocating microbench).ISO-8859-2silently aliased toNSISOLatin1, addUTF8/ASCII/LATIN1/LATIN2aliases, honourString.offsetwhen reading the encoding name.Utf8PerformanceIntegrationTestmirroring the Base64 perf pattern. Runs ASCII + mixed-byte payloads through JavaSE and ParparVM, asserts identical signatures, and folds a malformed-input probe into the signature so REPLACE parity is verified end-to-end.Why
new String(bytes, "UTF-8")usesCodingErrorAction.REPLACE; this PR brings ParparVM in line.ISO-8859-2returning Latin-1 bytes was a silent data-corruption bug.simdjsonand CoreFoundation use, and we have the infrastructure inIOSSimd.mto use the same pattern here.Test plan
mvn test -Dtest=Utf8PerformanceIntegrationTest(passes: JavaSE + ParparVM produce identical RESULT signatures including malformed-input probe).mvn test -Dtest=Base64PerformanceIntegrationTest(regression — still passes after the codec rewrite).Notes
char[]and ParparVM's GC bookkeeping dominates. The integration test still serves correctness + regression; the perf gain is in the helpers themselves for callers in tight parser loops.Decoding Errorthrow will instead receive a string containing U+FFFD. This matches JDK behaviour but is a behavioural diff worth flagging in release notes.🤖 Generated with Claude Code