Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35011][CORE] Fix false active executor in UI that caused by BlockManager reregistration #34536

Closed
wants to merge 2 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Nov 9, 2021

What changes were proposed in this pull request?

Also post the event SparkListenerExecutorRemoved when removing an executor, which is known by BlockManagerMaster but unknown to SchedulerBackend.

Why are the changes needed?

In #32114, it reports an issue that BlockManagerMaster could register a BlockManager from a dead executor due to reregistration mechanism. The side effect is, the executor will be shown on the UI as an active one, though it's already dead indeed.

In #32114, we tried to avoid such reregistration for a to-be-dead executor. However, I just realized that we can actually leave such reregistration alone since HeartbeatReceiver.expireDeadHosts should clean up those BlockManagers in the end. The problem is, the corresponding executors in UI can't be cleaned along with the BlockManagers cleaning. Because executors in UI can only be cleaned by SparkListenerExecutorRemoved,
while BlockManagers cleaning only post SparkListenerBlockManagerRemoved (which is ignored by AppStatusListener).

Does this PR introduce any user-facing change?

Yes, users would see the false active executor be removed in the end.

How was this patch tested?

Pass existing tests.

@Ngone51 Ngone51 self-assigned this Nov 9, 2021
@github-actions github-actions bot added the CORE label Nov 9, 2021
@Ngone51
Copy link
Member Author

Ngone51 commented Nov 9, 2021

@sumeetgajjar @mridulm @attilapiros Please take a look, thanks!

@SparkQA
Copy link

SparkQA commented Nov 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49503/

@SparkQA
Copy link

SparkQA commented Nov 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49503/

@SparkQA
Copy link

SparkQA commented Nov 9, 2021

Test build #145032 has finished for PR 34536 at commit e66744b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @Ngone51 . Since it's pending on new UT addition, I'll keep this PR open here. BTW, I'm okay with merging without UT in this case.

@dongjoon-hyun
Copy link
Member

BTW, @Ngone51 . Could you check your repo? It seems that GitHub Action is not triggered.

@mridulm
Copy link
Contributor

mridulm commented Nov 9, 2021

For any mutation of the executorDataMap, we always fire a corresponding event currently - no ?
That is, scheduler state itself is always updated via SparkListenerExecutorAdded/SparkListenerExecutorRemoved.
So why would we end up in a situation where we need SparkListenerExecutorRemoved to be fired again ?

What I mean is, we have only two mutations of executorDataMap, namely:

  • In RegisterExecutor here and
  • In removeExecutor here

In both of these cases, there is a companion event which is fired - so why do we need to fire an event if executor is missing ?

@sumeetgajjar
Copy link
Contributor

So why would we end up in a situation where we need SparkListenerExecutorRemoved to be fired again ?

Hi @mridulm, I believe the additional SparkListenerExecutorRemoved is to nullify the effects from re-registration of a BlockManager from a dead executor.

Please consider the following sequence of events:

@Ngone51 can you please confirm if my example is valid?

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 10, 2021

@Ngone51 can you please confirm if my example is valid?

Exactly.

In both of these cases, there is a companion event which is fired - so why do we need to fire an event if executor is missing ?

@mridulm In the case mentioned by @sumeetgajjar , the executor is missing from the scheduler backend but still exits in BlockManagerMaster / UI.

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 10, 2021

BTW, @Ngone51 . Could you check your repo? It seems that GitHub Action is not triggered.

@dongjoon-hyun Thanks for the reminder. Have rebased my repo.

@mridulm
Copy link
Contributor

mridulm commented Nov 10, 2021

SparkListenerExecutorAdded and SparkListenerExecutorRemoved are distinct from blockmanager events.
Based on what I currently see, can you clarify why SparkListenerExecutorRemoved needs to be fired ?

If there is downstream use of blockmanager and executor events interchangably, we should fix that instead of duplicating event ? (I am assuming reference to AppStatusListener was for this ?)

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49515/

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 10, 2021

SparkListenerExecutorAdded and SparkListenerExecutorRemoved are distinct from blockmanager events.
Based on what I currently see, can you clarify why SparkListenerExecutorRemoved needs to be fired ?

So, first of all, we should know that there's a case (reported by SPARK-35011) where the executor doesn't exist in the scheduler backend but exist in BlockMangerMaster(in the way of BlockManager). In this case, only a SparkListenerBlockManagerAdded event that is fired during BlockManager registration. And on the AppStatusListener side, whenever there's a SparkListenerExecutorAdded or SparkListenerBlockManagerAdded, it'd create a live executor entity for the executor. Therefore, we'd have a live executor in UI in the case of SPARK35011, even if the executor is dead indeed.

For such registered BlockManagers, fortunately, we have HeartbeatReceiver.expireDeadHosts to remove them in the end, which fires a SparkListenerBlockManagerRemoved during removal. Note that, there won't be a SparkListenerExecutorRemoved fired since scheduler backend (executorDataMap) already doesn't contain the executor.

However, for AppStatusListener, it only accepts SparkListenerExecutorRemoved to remove a live executor in UI but not SparkListenerBlockManagerRemoved. Therefore, we need to fire a separate SparkListenerExecutorRemoved for it.

If there is downstream use of blockmanager and executor events interchangably, we should fix that instead of duplicating event ? (I am assuming reference to AppStatusListener was for this ?)

Yes, it's AppStatusListener that needs the event. If we fix in AppStatusListener, we'd miss the exact executor loss reason in UI (SparkListenerExecutorRemoved contains a loss reason field but SparkListenerBlockManagerRemoved doesn't). So I choose to duplicate the event instead of fixing in AppStatusListener.

Copy link
Contributor

@sumeetgajjar sumeetgajjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ngone51,
LGTM, in case you decide to merge this change without a UT, please let me know, maybe I can salvage a test from my previous PR to test it.

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49515/

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 10, 2021

@sumeetgajjar Thanks..but I don't think the previous tests would help in this PR since we didn't touch re-registration stuff in this PR. I'd leave no UT with this PR since it's hard to make a UT to test.

@SparkQA
Copy link

SparkQA commented Nov 10, 2021

Test build #145044 has finished for PR 34536 at commit ce22c09.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MultivariateGaussian(NamedTuple):

@mridulm
Copy link
Contributor

mridulm commented Nov 10, 2021

My main concern is as follows - we have a SparkListenerExecutorRemoved published for a previous SparkListenerExecutorAdded: as a state transition, it is complete.

A thought exercise - how about modify AppStatusListener to add an executor only in case of SparkListenerExecutorAdded ?
SparkListenerExecutorAdded should be fired before SparkListenerBlockManagerAdded iirc (except for driver which we special case ?).

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 10, 2021

A thought exercise - how about modify AppStatusListener to add an executor only in case of SparkListenerExecutorAdded ?

The problem is, SparkListenerBlockManagerAdded provides memory-related info (e.g., maxMem, maxOnHeapMem, maxOffHeapMem) which isn't provided by SparkListenerExecutorAdded. So, SparkListenerBlockManagerAdded is a necessary event unlike SparkListenerBlockManagerRemoved.

Do you think we can add this memory-related info to SparkListenerExecutorAdded too? Though Spark event should be considered as public API, adding new fields doesn't seem to break compatibility.

@mridulm
Copy link
Contributor

mridulm commented Nov 10, 2021

You are right, this sucks: I am not seeing an easy way forward.
Let us merge this for time being and try to find a solution for this in (hopefully) 3.3; thoughts ?

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 10, 2021

Sure, sgtm!

@dongjoon-hyun
Copy link
Member

According to the above discussion, I merged this for Apache Spark 3.3 for further development.

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 12, 2021

Thanks all!!

@Ngone51 Ngone51 deleted the SPARK-35011 branch November 12, 2021 01:03
@wankunde
Copy link
Contributor

wankunde commented Nov 17, 2021

For such registered BlockManagers, fortunately, we have HeartbeatReceiver.expireDeadHosts to remove them in the end, which fires a SparkListenerBlockManagerRemoved during removal. Note that, there won't be a SparkListenerExecutorRemoved fired since scheduler backend (executorDataMap) already doesn't contain the executor.

@Ngone51 @sumeetgajjar

HeartbeatReceiver.expireDeadHosts will not clean those BlockManager if the executor is killed with reason Executor heartbeat timed out. Executors could heartbeat timed out because of network issue, or some other reason like SPARK-20977

Am I right ?

Driver Logs

21/11/13 05:06:20,999 WARN [dispatcher-event-loop-36] spark.HeartbeatReceiver:69 : Removing executor 3056 with no recent heartbeats: 350149 ms exceeds timeout 300000 ms
21/11/13 05:06:20,999 INFO [kill-executor-thread] cluster.YarnClientSchedulerBackend:57 : Requesting to kill executor(s) 3056
21/11/13 05:06:21,000 INFO [kill-executor-thread] cluster.YarnClientSchedulerBackend:57 : Actual list of executor(s) to be killed is 3056
21/11/13 05:06:21,000 INFO [dispatcher-event-loop-8] yarn.ApplicationMaster$AMEndpoint:57 : Driver requested to kill executor(s) 3056.
21/11/13 05:06:21,000 INFO [dispatcher-CoarseGrainedScheduler] cluster.YarnSchedulerBackend$YarnDriverEndpoint:57 : Asked to remove executor 3056 with reason Executor heartbeat timed out after 350149 ms
21/11/13 05:06:21,000 ERROR [dispatcher-CoarseGrainedScheduler] cluster.YarnScheduler:73 : Lost executor 3056 on executor_host: Executor heartbeat timed out after 350149 ms
21/11/13 05:06:21,041 WARN [dispatcher-CoarseGrainedScheduler] scheduler.TaskSetManager:69 : Lost task 262191.0 in stage 452627.0 (TID 245764597, executor_host, executor 3056): ExecutorLostFailure (executor 3056 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 350149 ms
21/11/13 05:06:21,067 WARN [dispatcher-CoarseGrainedScheduler] scheduler.TaskSetManager:69 : Lost task 259149.0 in stage 452627.0 (TID 245761130, executor_host, executor 3056): ExecutorLostFailure (executor 3056 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 350149 ms
21/11/13 05:06:22,068 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Trying to remove executor 3056 from BlockManagerMaster.
21/11/13 05:06:22,072 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Removing block manager BlockManagerId(3056, executor_host, 30504, None)
21/11/13 05:06:22,073 INFO [dag-scheduler-event-loop] storage.BlockManagerMaster:57 : Removed 3056 successfully in removeExecutor
21/11/13 05:06:22,962 INFO [dispatcher-BlockManagerMaster] storage.BlockManagerMasterEndpoint:57 : Registering block manager executor_host:30504 with 88.5 GiB RAM, BlockManagerId(3056, executor_host, 30504, None)

Executor Logs

21/11/13 05:06:21,004 INFO [dispatcher-Executor] executor.YarnCoarseGrainedExecutorBackend:57 : Driver commanded a shutdown
21/11/13 05:06:22,215 INFO [block-manager-future-0] storage.BlockManagerMaster:57 : Registering BlockManager BlockManagerId(3056, executor_host, 30504, None)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants