Skip to content

HDDS-15341. EC write can fail with ArrayIndexOutOfBoundsException due to CoderUtil emptyChunk resize race#10324

Open
smengcl wants to merge 3 commits into
apache:masterfrom
smengcl:HDDS-15341-ec-client-race
Open

HDDS-15341. EC write can fail with ArrayIndexOutOfBoundsException due to CoderUtil emptyChunk resize race#10324
smengcl wants to merge 3 commits into
apache:masterfrom
smengcl:HDDS-15341-ec-client-race

Conversation

@smengcl
Copy link
Copy Markdown
Contributor

@smengcl smengcl commented May 21, 2026

What changes were proposed in this pull request?

Fix a race in CoderUtil.getEmptyChunk() that can cause EC writes to fail with ArrayIndexOutOfBoundsException during parity encoding.

Problem

This can be hit when multiple EC key output streams in the same client JVM use the Java raw EC encoder concurrently with different encode/reset lengths. Each ECKeyOutputStream has its own encoder, but all Java encoders share the static CoderUtil.emptyChunk cache. If native ISA-L is unavailable or not selected, the Java RSRawEncoder clears parity output buffers through CoderUtil.resetOutputBuffers(). Under concurrent close/flush paths, especially with partial final stripes of different sizes, one stream can grow the shared zero buffer for a larger encode while another smaller encode races and shrinks it, causing the larger encode’s later System.arraycopy() to throw ArrayIndexOutOfBoundsException.

This issue is avoided if native lib (ISA-L) is in-use. The issue can only be hit when fallback builtin-java codec is being used, where you may see messages like this printed on the client:

W20260513 08:22:52.526697 4325 ErasureCodeNative.java:55] 854bfd7fdbf38f0c:f3db82230000000c] ISA-L support is not available in your platform... using builtin-java codec where applicable


CoderUtil.resetBuffer(byte[] buffer, int offset, int len) gets a shared zero-filled buffer from getEmptyChunk(len) and then calls:

System.arraycopy(empty, 0, buffer, offset, len);

The old getEmptyChunk() implementation checked emptyChunk.length before entering the synchronized block, unconditionally replaced the shared static buffer inside the lock, and returned the shared static field after leaving the lock. This allowed a smaller concurrent caller to shrink the shared cached buffer after a larger caller had grown it.

ArrayIndexOutOfBoundsException: java.lang.ArrayIndexOutOfBoundsException
	at java.lang.System.arraycopy(Native Method)
	at org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetBuffer(CoderUtil.java:76)
	at org.apache.ozone.erasurecode.rawcoder.CoderUtil.resetOutputBuffers(CoderUtil.java:96)
	at org.apache.ozone.erasurecode.rawcoder.RSRawEncoder.doEncode(RSRawEncoder.java:69)
	at org.apache.ozone.erasurecode.rawcoder.RawErasureEncoder.encode(RawErasureEncoder.java:88)
	at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.generateParityCells(ECKeyOutputStream.java:305)
	at org.apache.hadoop.ozone.client.io.ECKeyOutputStream.close(ECKeyOutputStream.java:475)
	at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:105)
	at org.apache.hadoop.fs.ozone.OzoneFSOutputStream.close(OzoneFSOutputStream.java:70)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:77)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)

An interleaving that repros the issue:

  1. emptyChunk starts as byte[4096].
  2. Thread A calls getEmptyChunk(4097) and blocks before entering the synchronized block.
  3. Thread B calls getEmptyChunk(8194), enters the synchronized block, and sets emptyChunk = byte[8194].
  4. Thread A resumes and unconditionally sets emptyChunk = byte[4097].
  5. Thread B returns the shared static emptyChunk, now byte[4097].
  6. System.arraycopy(..., len=8194) throws ArrayIndexOutOfBoundsException.

This is a TOCTOU-style race on the shared emptyChunk cache.

With buggy code

sequenceDiagram
    participant S as Small caller<br/>getEmptyChunk(4097)
    participant L as Large caller<br/>getEmptyChunk(8194)
    participant C as static emptyChunk

    Note over C: initial length = 4096

    S->>C: read length 4096 < 4097
    Note over S: pauses before synchronized block

    L->>C: read length 4096 < 8194
    L->>C: synchronized: emptyChunk = byte[8194]
    Note over L: pauses before final return emptyChunk

    S->>C: synchronized: emptyChunk = byte[4097]
    S-->>S: returns byte[4097]

    L->>C: final return reads static emptyChunk
    C-->>L: returns byte[4097]

    L->>L: resetBuffer(..., len=8194)
    L->>L: System.arraycopy(src byte[4097], len 8194)
    Note over L: ArrayIndexOutOfBoundsException
Loading

After this fix

sequenceDiagram
    participant S as Small caller<br/>getEmptyChunk(4097)
    participant L as Large caller<br/>resetBuffer(..., len=8194)
    participant C as static emptyChunk

    Note over C: initial length = 4096

    S->>C: chunk = emptyChunk<br/>length 4096 < 4097
    Note over S: pauses before synchronized block

    L->>C: getEmptyChunk(8194)<br/>chunk length 4096 < 8194
    L->>C: synchronized: re-read chunk
    L->>C: emptyChunk = byte[8194]
    C-->>L: return local chunk byte[8194]

    S->>C: synchronized: re-read chunk
    C-->>S: sees byte[8194]
    S-->>S: return byte[8194]

    L->>L: System.arraycopy(empty byte[8194], 0,<br/>buffer, offset, len 8194)
    Note over L: succeeds because source length >= len
Loading

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15341

How was this patch tested?

  • Added a deterministic unit test that reproduces (w/o the fix) the stale-check/shrink interleaving without Byteman or additional dependencies. The test blocks a smaller caller on CoderUtil.class, grows the cache with a larger caller, then releases the smaller caller and verifies the cache is not shrunk.

smengcl added 2 commits May 21, 2026 02:43
Generated-by: Codex (GPT-5.5)
@smengcl smengcl added bug Something isn't working AI-gen labels May 21, 2026
Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smengcl for the patch.

@adoroszlai adoroszlai changed the title HDDS-15341. EC client write can fail with ArrayIndexOutOfBoundsException due to CoderUtil emptyChunk resize race HDDS-15341. EC write can fail with ArrayIndexOutOfBoundsException due to CoderUtil emptyChunk resize race May 21, 2026
@ivandika3 ivandika3 requested a review from xichen01 May 21, 2026 12:11
Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterxcli peterxcli self-requested a review May 21, 2026 17:39
@smengcl
Copy link
Copy Markdown
Contributor Author

smengcl commented May 21, 2026

Thanks @smengcl for updating the patch, LGTM.

BTW, you might want to report/fix this in Hadoop, too.

apache/hadoop@71d216d/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/erasurecode/rawcoder/CoderUtil.java#L44-L53

Thanks @adoroszlai .

Good point. Let me check on Hadoop as well.

@smengcl smengcl marked this pull request as ready for review May 21, 2026 20:00
Copilot AI review requested due to automatic review settings May 21, 2026 20:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI-gen bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants