Skip to content

Comments

[SPARK-44881[COMMON]Executor stucked on retrying to fetch shuffle data when java.lang.OutOfMemoryError. unable to create native thread exception occurred.#42572

Closed
hgs19921112 wants to merge 5 commits intoapache:masterfrom
hgs19921112:master

Conversation

@hgs19921112
Copy link

@hgs19921112 hgs19921112 commented Aug 20, 2023

Executor stucked on retrying to fetch shuffle data when java.lang.OutOfMemoryError. unable to create native thread exception occurred.

What changes were proposed in this pull request?

No.

Why are the changes needed?

Avoid executor stucks.
The reason why executor stucked is because in RetryingBlockTransferor#initiateRetry
failed to create native thread when execute code executorService.submit.But the
member currentListener in RetryingBlockTransferor has changed before the retry
submit. But the Exception occurred in the old currentListener. But in RetryingBlockTransferor.RetryingBlockTransferListener#handleBlockTransferFailure will
judge this == currentListener , but this is the old one , so this not equals currentListener.
The RetryingBlockTransferListener can not forward the failure to BlockFetchingListener in
ShuffleBlockFetcherIterator. And ShuffleBlockFetcherIterator can not get data from ShuffleBlockFetcherIterator#results.
Then it happened.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By orginal spark unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

… when `java.lang.OutOfMemoryError. unable to create native thread` exception occurred.
@github-actions github-actions bot added the CORE label Aug 20, 2023
@hgs19921112 hgs19921112 changed the title [SPARK-44881][CORE]Executor stucked on retrying to fetch shuffle data when java.lang.OutOfMemoryError. unable to create native thread exception occurred. [SPARK-44881[COMMON]Executor stucked on retrying to fetch shuffle data when java.lang.OutOfMemoryError. unable to create native thread exception occurred. Aug 20, 2023
@mridulm
Copy link
Contributor

mridulm commented Aug 20, 2023

Why would this help ? Please add more details to the description.

If this is still work in progress, please mark it as draft.

@hgs19921112
Copy link
Author

hgs19921112 commented Aug 20, 2023

The reason why executor stucked is because in RetryingBlockTransferor#initiateRetry
failed to create native thread. But the member currentListener in RetryingBlockTransferor
has changed before the retry submit. But the Exception occurred in the old currentListener.
But in RetryingBlockTransferor.RetryingBlockTransferListener#handleBlockTransferFailure will
judge this == currentListener , but this is the old one , so this not equals currentListener.
The RetryingBlockTransferListener can not forward the failure to BlockFetchingListener in
ShuffleBlockFetcherIterator. And ShuffleBlockFetcherIterator can not get data from ShuffleBlockFetcherIterator#results.
Then it happened.
@mridulm

@hdaikoku
Copy link
Contributor

This seems to be the same issue as #42426

@hgs19921112
Copy link
Author

This seems to be the same issue as #42426

Yes.

@fengDianDemaNong
Copy link

[0.025s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached.
Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
/opt/spark/bin/spark-class: line 96: CMD: bad array subscript

I have also encountered this issue several times during Spark on k8s. Is there a solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants