HDDS-15341. EC write can fail with ArrayIndexOutOfBoundsException due to CoderUtil emptyChunk resize race#10324
Open
smengcl wants to merge 3 commits into
Open
HDDS-15341. EC write can fail with ArrayIndexOutOfBoundsException due to CoderUtil emptyChunk resize race#10324smengcl wants to merge 3 commits into
smengcl wants to merge 3 commits into
Conversation
…ion due to CoderUtil emptyChunk resize race
Generated-by: Codex (GPT-5.5)
adoroszlai
reviewed
May 21, 2026
Contributor
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @smengcl for the patch.
adoroszlai
reviewed
May 21, 2026
Contributor
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @smengcl for updating the patch, LGTM.
BTW, you might want to report/fix this in Hadoop, too.
Contributor
Author
Thanks @adoroszlai . Good point. Let me check on Hadoop as well. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Fix a race in
CoderUtil.getEmptyChunk()that can cause EC writes to fail withArrayIndexOutOfBoundsExceptionduring parity encoding.Problem
This can be hit when multiple EC key output streams in the same client JVM use the Java raw EC encoder concurrently with different encode/reset lengths. Each
ECKeyOutputStreamhas its own encoder, but all Java encoders share the staticCoderUtil.emptyChunkcache. If native ISA-L is unavailable or not selected, the JavaRSRawEncoderclears parity output buffers throughCoderUtil.resetOutputBuffers(). Under concurrent close/flush paths, especially with partial final stripes of different sizes, one stream can grow the shared zero buffer for a larger encode while another smaller encode races and shrinks it, causing the larger encode’s laterSystem.arraycopy()to throwArrayIndexOutOfBoundsException.This issue is avoided if native lib (ISA-L) is in-use. The issue can only be hit when fallback builtin-java codec is being used, where you may see messages like this printed on the client:
CoderUtil.resetBuffer(byte[] buffer, int offset, int len)gets a shared zero-filled buffer fromgetEmptyChunk(len)and then calls:The old getEmptyChunk() implementation checked emptyChunk.length before entering the synchronized block, unconditionally replaced the shared static buffer inside the lock, and returned the shared static field after leaving the lock. This allowed a smaller concurrent caller to shrink the shared cached buffer after a larger caller had grown it.
An interleaving that repros the issue:
This is a TOCTOU-style race on the shared emptyChunk cache.
With buggy code
sequenceDiagram participant S as Small caller<br/>getEmptyChunk(4097) participant L as Large caller<br/>getEmptyChunk(8194) participant C as static emptyChunk Note over C: initial length = 4096 S->>C: read length 4096 < 4097 Note over S: pauses before synchronized block L->>C: read length 4096 < 8194 L->>C: synchronized: emptyChunk = byte[8194] Note over L: pauses before final return emptyChunk S->>C: synchronized: emptyChunk = byte[4097] S-->>S: returns byte[4097] L->>C: final return reads static emptyChunk C-->>L: returns byte[4097] L->>L: resetBuffer(..., len=8194) L->>L: System.arraycopy(src byte[4097], len 8194) Note over L: ArrayIndexOutOfBoundsExceptionAfter this fix
sequenceDiagram participant S as Small caller<br/>getEmptyChunk(4097) participant L as Large caller<br/>resetBuffer(..., len=8194) participant C as static emptyChunk Note over C: initial length = 4096 S->>C: chunk = emptyChunk<br/>length 4096 < 4097 Note over S: pauses before synchronized block L->>C: getEmptyChunk(8194)<br/>chunk length 4096 < 8194 L->>C: synchronized: re-read chunk L->>C: emptyChunk = byte[8194] C-->>L: return local chunk byte[8194] S->>C: synchronized: re-read chunk C-->>S: sees byte[8194] S-->>S: return byte[8194] L->>L: System.arraycopy(empty byte[8194], 0,<br/>buffer, offset, len 8194) Note over L: succeeds because source length >= lenWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15341
How was this patch tested?