[SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker #24408

Ngone51 · 2019-04-18T16:16:48Z

What changes were proposed in this pull request?

This is a long standing issue which I met before and I've seen other people got trouble with it:
test cases stuck on "local-cluster mode" of ReplSuite?
Spark tests hang on local machine due to "testGuavaOptional" in JavaAPISuite

When running test under local-cluster mode with wrong SPARK_HOME(spark.test.home), test just get stuck and no response forever. After looking into SPARK_WORKER_DIR, I found there's endless executor directories under it. So, this explains what happens during test getting stuck.

The whole process looks like:

Driver submits an app to Master and asks for N executors
Master inits executor state with LAUNCHING and sends LaunchExecutor to Worker
Worker receives LaunchExecutor, launches ExecutorRunner asynchronously and sends ExecutorStateChanged(state=RUNNING) to Mater immediately
Master receives ExecutorStateChanged(state=RUNNING) and reset _retyCount to 0.
ExecutorRunner throws exception during executor launching, sends ExecutorStateChanged(state=FAILED) to Worker, Worker forwards the msg to Master
Master receives ExecutorStateChanged(state=FAILED). Since Master always reset _retyCount when it receives RUNNING msg, so, event if a Worker fails to launch executor for continuous many times, _retryCount would never exceed maxExecutorRetries. So, Master continue to launch executor and fall into the dead loop.

The problem exists in step 3. Worker sends ExecutorStateChanged(state=RUNNING) to Master immediately while executor is still launching. And, when Master receive that msg, it believes the executor has launched successfully, and reset _retryCount subsequently. However, that's not true.

This pr suggests to remove step 3 and requires Worker only send ExecutorStateChanged(state=RUNNING) after executor has really launched successfully.

How was this patch tested?

Tested Manually.

Ngone51 · 2019-04-18T16:21:14Z

ping @cloud-fan @jiangxb1987 @jerryshao @srowen

please help review, thanks!

SparkQA · 2019-04-18T18:23:38Z

Test build #104712 has finished for PR 24408 at commit ea54dd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

As usual I'm not super familiar with this code, but from reading your analysis and the code and the change, it does seem more correct to me.

cloud-fan · 2019-04-19T01:45:32Z

LGTM if tests pass. cc @jiangxb1987

Ngone51 · 2019-04-19T06:16:36Z

Jenkins, retest this please.

SparkQA · 2019-04-19T07:05:01Z

Test build #104742 has finished for PR 24408 at commit ea54dd4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-04-19T07:50:42Z

Jenkins, retest this please.

jiangxb1987

One problem is if the ExecutorRunner got shutdown before calling fetchAndRunExecutor(), then it will send wrong state to the master. We shall also update the shutdownHook.

Ideally this change shall be on the right way, but the sync between Master and Workers are extremely subtle so I would be quite conservative to making any changes without adding some test cases.

cloud-fan · 2019-04-19T09:36:29Z

if the ExecutorRunner got shutdown before calling fetchAndRunExecutor()

Then the worker sends nothing, and master should timeout this worker eventually. +1 for adding some tests

jiangxb1987 · 2019-04-19T09:42:51Z

if the ExecutorRunner got shutdown before calling fetchAndRunExecutor()

Then the worker sends nothing, and master should timeout this worker eventually. +1 for adding some tests

The worker will send ExecutorState.LAUNCHING again, while actually it shall send out ExecutorState.FAILED in this case. See

spark/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala

Line 110 in ea54dd4

worker.send(ExecutorStateChanged(appId, execId, state, message, exitCode))

SparkQA · 2019-04-19T09:59:45Z

Test build #104744 has finished for PR 24408 at commit ea54dd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-19T18:37:41Z

Test build #104749 has finished for PR 24408 at commit 0be6b91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-04-24T15:36:48Z

@jiangxb1987 are you OK with this change now?

Ngone51 · 2019-04-26T07:37:55Z

ping @jiangxb1987

srowen · 2019-04-30T13:41:52Z

@jiangxb1987 (maybe @cloud-fan ) this is looking OK to me but would most like a second look at it.

jiangxb1987 · 2019-05-01T00:54:40Z

LGTM

jiangxb1987 · 2019-05-01T00:54:49Z

retest this please

SparkQA · 2019-05-01T03:05:34Z

Test build #105052 has finished for PR 24408 at commit 0be6b91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2019-05-03T22:53:08Z

Thanks! Merged to master!

Ngone51 · 2019-05-05T02:35:01Z

Thanks @srowen @cloud-fan @jiangxb1987 !

fix dead loop

ea54dd4

Ngone51 changed the title ~~Avoid Master falls into dead loop while launching executor failed in Worker~~ [SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker Apr 18, 2019

srowen reviewed Apr 18, 2019

View reviewed changes

jiangxb1987 requested changes Apr 19, 2019

View reviewed changes

address comment

0be6b91

jiangxb1987 closed this in 51de86b May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker #24408

[SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker #24408

Ngone51 commented Apr 18, 2019

Ngone51 commented Apr 18, 2019

SparkQA commented Apr 18, 2019

srowen left a comment

cloud-fan commented Apr 19, 2019

Ngone51 commented Apr 19, 2019

SparkQA commented Apr 19, 2019

Ngone51 commented Apr 19, 2019

jiangxb1987 left a comment

cloud-fan commented Apr 19, 2019

jiangxb1987 commented Apr 19, 2019 •

edited

SparkQA commented Apr 19, 2019

SparkQA commented Apr 19, 2019

srowen commented Apr 24, 2019

Ngone51 commented Apr 26, 2019

srowen commented Apr 30, 2019

jiangxb1987 commented May 1, 2019

jiangxb1987 commented May 1, 2019

SparkQA commented May 1, 2019

jiangxb1987 commented May 3, 2019

Ngone51 commented May 5, 2019

[SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker #24408

[SPARK-27510][CORE] Avoid Master falls into dead loop while launching executor failed in Worker #24408

Conversation

Ngone51 commented Apr 18, 2019

What changes were proposed in this pull request?

How was this patch tested?

Ngone51 commented Apr 18, 2019

SparkQA commented Apr 18, 2019

srowen left a comment

Choose a reason for hiding this comment

cloud-fan commented Apr 19, 2019

Ngone51 commented Apr 19, 2019

SparkQA commented Apr 19, 2019

Ngone51 commented Apr 19, 2019

jiangxb1987 left a comment

Choose a reason for hiding this comment

cloud-fan commented Apr 19, 2019

jiangxb1987 commented Apr 19, 2019 • edited

SparkQA commented Apr 19, 2019

SparkQA commented Apr 19, 2019

srowen commented Apr 24, 2019

Ngone51 commented Apr 26, 2019

srowen commented Apr 30, 2019

jiangxb1987 commented May 1, 2019

jiangxb1987 commented May 1, 2019

SparkQA commented May 1, 2019

jiangxb1987 commented May 3, 2019

Ngone51 commented May 5, 2019

jiangxb1987 commented Apr 19, 2019 •

edited