[SPARK-44881[COMMON]Executor stucked on retrying to fetch shuffle data when java.lang.OutOfMemoryError. unable to create native thread exception occurred.#42572
[SPARK-44881[COMMON]Executor stucked on retrying to fetch shuffle data when java.lang.OutOfMemoryError. unable to create native thread exception occurred.#42572hgs19921112 wants to merge 5 commits intoapache:masterfrom hgs19921112:master
java.lang.OutOfMemoryError. unable to create native thread exception occurred.#42572Conversation
… when `java.lang.OutOfMemoryError. unable to create native thread` exception occurred.
java.lang.OutOfMemoryError. unable to create native thread exception occurred.java.lang.OutOfMemoryError. unable to create native thread exception occurred.
|
Why would this help ? Please add more details to the description. If this is still work in progress, please mark it as draft. |
|
The reason why executor stucked is because in |
|
This seems to be the same issue as #42426 |
Yes. |
|
[0.025s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 0k, detached. I have also encountered this issue several times during Spark on k8s. Is there a solution |
Executor stucked on retrying to fetch shuffle data when
java.lang.OutOfMemoryError. unable to create native threadexception occurred.What changes were proposed in this pull request?
No.
Why are the changes needed?
Avoid executor stucks.
The reason why executor stucked is because in
RetryingBlockTransferor#initiateRetryfailed to create native thread when execute code
executorService.submit.But themember
currentListenerinRetryingBlockTransferorhas changed before the retrysubmit. But the Exception occurred in the old
currentListener. But inRetryingBlockTransferor.RetryingBlockTransferListener#handleBlockTransferFailurewilljudge
this == currentListener, butthisis the old one , sothisnot equalscurrentListener.The
RetryingBlockTransferListenercan not forward the failure toBlockFetchingListenerinShuffleBlockFetcherIterator. AndShuffleBlockFetcherIteratorcan not get data fromShuffleBlockFetcherIterator#results.Then it happened.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
By orginal spark unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.