Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker #24408

Closed
wants to merge 2 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Apr 18, 2019

What changes were proposed in this pull request?

This is a long standing issue which I met before and I've seen other people got trouble with it:
test cases stuck on "local-cluster mode" of ReplSuite?
Spark tests hang on local machine due to "testGuavaOptional" in JavaAPISuite

When running test under local-cluster mode with wrong SPARK_HOME(spark.test.home), test just get stuck and no response forever. After looking into SPARK_WORKER_DIR, I found there's endless executor directories under it. So, this explains what happens during test getting stuck.

The whole process looks like:

  1. Driver submits an app to Master and asks for N executors
  2. Master inits executor state with LAUNCHING and sends LaunchExecutor to Worker
  3. Worker receives LaunchExecutor, launches ExecutorRunner asynchronously and sends ExecutorStateChanged(state=RUNNING) to Mater immediately
  4. Master receives ExecutorStateChanged(state=RUNNING) and reset _retyCount to 0.
  5. ExecutorRunner throws exception during executor launching, sends ExecutorStateChanged(state=FAILED) to Worker, Worker forwards the msg to Master
  6. Master receives ExecutorStateChanged(state=FAILED). Since Master always reset _retyCount when it receives RUNNING msg, so, event if a Worker fails to launch executor for continuous many times, _retryCount would never exceed maxExecutorRetries. So, Master continue to launch executor and fall into the dead loop.

The problem exists in step 3. Worker sends ExecutorStateChanged(state=RUNNING) to Master immediately while executor is still launching. And, when Master receive that msg, it believes the executor has launched successfully, and reset _retryCount subsequently. However, that's not true.

This pr suggests to remove step 3 and requires Worker only send ExecutorStateChanged(state=RUNNING) after executor has really launched successfully.

How was this patch tested?

Tested Manually.

@Ngone51 Ngone51 changed the title Avoid Master falls into dead loop while launching executor failed in Worker [SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker Apr 18, 2019
@Ngone51
Copy link
Member Author

Ngone51 commented Apr 18, 2019

ping @cloud-fan @jiangxb1987 @jerryshao @srowen

please help review, thanks!

@SparkQA
Copy link

SparkQA commented Apr 18, 2019

Test build #104712 has finished for PR 24408 at commit ea54dd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual I'm not super familiar with this code, but from reading your analysis and the code and the change, it does seem more correct to me.

@cloud-fan
Copy link
Contributor

LGTM if tests pass. cc @jiangxb1987

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 19, 2019

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104742 has finished for PR 24408 at commit ea54dd4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 19, 2019

Jenkins, retest this please.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem is if the ExecutorRunner got shutdown before calling fetchAndRunExecutor(), then it will send wrong state to the master. We shall also update the shutdownHook.

Ideally this change shall be on the right way, but the sync between Master and Workers are extremely subtle so I would be quite conservative to making any changes without adding some test cases.

@cloud-fan
Copy link
Contributor

if the ExecutorRunner got shutdown before calling fetchAndRunExecutor()

Then the worker sends nothing, and master should timeout this worker eventually. +1 for adding some tests

@jiangxb1987
Copy link
Contributor

jiangxb1987 commented Apr 19, 2019

if the ExecutorRunner got shutdown before calling fetchAndRunExecutor()

Then the worker sends nothing, and master should timeout this worker eventually. +1 for adding some tests

The worker will send ExecutorState.LAUNCHING again, while actually it shall send out ExecutorState.FAILED in this case. See

worker.send(ExecutorStateChanged(appId, execId, state, message, exitCode))

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104744 has finished for PR 24408 at commit ea54dd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2019

Test build #104749 has finished for PR 24408 at commit 0be6b91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 24, 2019

@jiangxb1987 are you OK with this change now?

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 26, 2019

ping @jiangxb1987

@srowen
Copy link
Member

srowen commented Apr 30, 2019

@jiangxb1987 (maybe @cloud-fan ) this is looking OK to me but would most like a second look at it.

@jiangxb1987
Copy link
Contributor

LGTM

@jiangxb1987
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented May 1, 2019

Test build #105052 has finished for PR 24408 at commit 0be6b91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

Thanks! Merged to master!

@Ngone51
Copy link
Member Author

Ngone51 commented May 5, 2019

Thanks @srowen @cloud-fan @jiangxb1987 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants