-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-35011][CORE] Fix false active executor in UI that caused by BlockManager reregistration #34536
Conversation
@sumeetgajjar @mridulm @attilapiros Please take a look, thanks! |
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #145032 has finished for PR 34536 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @Ngone51 . Since it's pending on new UT addition, I'll keep this PR open here. BTW, I'm okay with merging without UT in this case.
BTW, @Ngone51 . Could you check your repo? It seems that GitHub Action is not triggered. |
For any mutation of the What I mean is, we have only two mutations of In both of these cases, there is a companion event which is fired - so why do we need to fire an event if executor is missing ? |
Hi @mridulm, I believe the additional Please consider the following sequence of events:
@Ngone51 can you please confirm if my example is valid? |
Exactly.
@mridulm In the case mentioned by @sumeetgajjar , the executor is missing from the scheduler backend but still exits in |
@dongjoon-hyun Thanks for the reminder. Have rebased my repo. |
If there is downstream use of blockmanager and executor events interchangably, we should fix that instead of duplicating event ? (I am assuming reference to |
Kubernetes integration test starting |
So, first of all, we should know that there's a case (reported by SPARK-35011) where the executor doesn't exist in the scheduler backend but exist in For such registered However, for
Yes, it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ngone51,
LGTM, in case you decide to merge this change without a UT, please let me know, maybe I can salvage a test from my previous PR to test it.
Kubernetes integration test status failure |
@sumeetgajjar Thanks..but I don't think the previous tests would help in this PR since we didn't touch re-registration stuff in this PR. I'd leave no UT with this PR since it's hard to make a UT to test. |
Test build #145044 has finished for PR 34536 at commit
|
My main concern is as follows - we have a A thought exercise - how about modify |
The problem is, Do you think we can add this memory-related info to |
You are right, this sucks: I am not seeing an easy way forward. |
Sure, sgtm! |
According to the above discussion, I merged this for Apache Spark 3.3 for further development. |
Thanks all!! |
Am I right ? Driver Logs
Executor Logs
|
What changes were proposed in this pull request?
Also post the event
SparkListenerExecutorRemoved
when removing an executor, which is known byBlockManagerMaster
but unknown toSchedulerBackend
.Why are the changes needed?
In #32114, it reports an issue that
BlockManagerMaster
could register aBlockManager
from a dead executor due to reregistration mechanism. The side effect is, the executor will be shown on the UI as an active one, though it's already dead indeed.In #32114, we tried to avoid such reregistration for a to-be-dead executor. However, I just realized that we can actually leave such reregistration alone since
HeartbeatReceiver.expireDeadHosts
should clean up thoseBlockManager
s in the end. The problem is, the corresponding executors in UI can't be cleaned along with theBlockManager
s cleaning. Because executors in UI can only be cleaned bySparkListenerExecutorRemoved
,while
BlockManager
s cleaning only postSparkListenerBlockManagerRemoved
(which is ignored byAppStatusListener
).Does this PR introduce any user-facing change?
Yes, users would see the false active executor be removed in the end.
How was this patch tested?
Pass existing tests.