New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker #24408
Conversation
ping @cloud-fan @jiangxb1987 @jerryshao @srowen please help review, thanks! |
Test build #104712 has finished for PR 24408 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As usual I'm not super familiar with this code, but from reading your analysis and the code and the change, it does seem more correct to me.
LGTM if tests pass. cc @jiangxb1987 |
Jenkins, retest this please. |
Test build #104742 has finished for PR 24408 at commit
|
Jenkins, retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem is if the ExecutorRunner got shutdown before calling fetchAndRunExecutor()
, then it will send wrong state to the master. We shall also update the shutdownHook.
Ideally this change shall be on the right way, but the sync between Master and Workers are extremely subtle so I would be quite conservative to making any changes without adding some test cases.
Then the worker sends nothing, and master should timeout this worker eventually. +1 for adding some tests |
The worker will send ExecutorState.LAUNCHING again, while actually it shall send out ExecutorState.FAILED in this case. See
|
Test build #104744 has finished for PR 24408 at commit
|
Test build #104749 has finished for PR 24408 at commit
|
@jiangxb1987 are you OK with this change now? |
ping @jiangxb1987 |
@jiangxb1987 (maybe @cloud-fan ) this is looking OK to me but would most like a second look at it. |
LGTM |
retest this please |
Test build #105052 has finished for PR 24408 at commit
|
Thanks! Merged to master! |
Thanks @srowen @cloud-fan @jiangxb1987 ! |
What changes were proposed in this pull request?
This is a long standing issue which I met before and I've seen other people got trouble with it:
test cases stuck on "local-cluster mode" of ReplSuite?
Spark tests hang on local machine due to "testGuavaOptional" in JavaAPISuite
When running test under local-cluster mode with wrong SPARK_HOME(spark.test.home), test just get stuck and no response forever. After looking into SPARK_WORKER_DIR, I found there's endless executor directories under it. So, this explains what happens during test getting stuck.
The whole process looks like:
LaunchExecutor
to WorkerLaunchExecutor
, launches ExecutorRunner asynchronously and sendsExecutorStateChanged(state=RUNNING)
to Mater immediatelyExecutorStateChanged(state=RUNNING)
and reset_retyCount
to 0.ExecutorStateChanged(state=FAILED)
to Worker, Worker forwards the msg to MasterExecutorStateChanged(state=FAILED)
. Since Master always reset_retyCount
when it receives RUNNING msg, so, event if a Worker fails to launch executor for continuous many times,_retryCount
would never exceedmaxExecutorRetries
. So, Master continue to launch executor and fall into the dead loop.The problem exists in step 3. Worker sends
ExecutorStateChanged(state=RUNNING)
to Master immediately while executor is still launching. And, when Master receive that msg, it believes the executor has launched successfully, and reset_retryCount
subsequently. However, that's not true.This pr suggests to remove step 3 and requires Worker only send
ExecutorStateChanged(state=RUNNING)
after executor has really launched successfully.How was this patch tested?
Tested Manually.