New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14228][CORE][YARN] Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped #19741
Conversation
exception: Could not find CoarseGrainedScheduler or it has been stopped
cc @jerryshao |
From my understanding, the above exception seems no harm to the Spark application, just running into some threading corner case during stop, am I right? |
Thanks @jerryshao for looking into this.
Yeah, It doesn't cause any functional problem but these exceptions create suspect in the user's mind that some thing went wrong with the spark application, and then they would start diagnosing/debugging the cause which wastes lot of time or creates wrong impression. I think this should be fixed to avoid the suspect/confusion. |
@@ -268,8 +268,13 @@ private[spark] abstract class YarnSchedulerBackend( | |||
logWarning(reason.toString) | |||
driverEndpoint.ask[Boolean](r).onFailure { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using ask
here is doing anything useful at all. Instead, using send
would be cheaper, and the handler for RemoveExecutor
can log whatever errors it runs into instead.
Then this could just be:
if (!stopped.get()) {
driverEndpoint.send(...)
}
(And the driver endpoint will probably need a minor change too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vanzin for the review.
I think it is a good suggestion, there are other places where the RemoveExecutor
message is sending using the ask
, are you suggesting to change those as well?
(And the driver endpoint will probably need a minor change too.)
You mean moving the case RemoveExecutor(executorId, reason)
from receiveAndReply
to receive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when you use send
, all the related things as you mentioned above should be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC there are two different RemoveExecutor
messages in the code, as confusing as that may be.
But if this one is used in multiple places then it's probably not worth changing right now, unless you're up for verifying the return value is not needed in the other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there are multiple places this message is being used, but all of them are just logging for the failure. I am thinking this logging may be useful to diagnose the failures in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point out to a couple of those? I only see this RemoveExecutor
being handled in CoarseGrainedSchedulerBackend.scala
. You could just log any errors there (the RPC layer will already log any communication issues).
(All the references in the block manager code are for a different RemoveExecutor
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see these places other than CoarseGrainedSchedulerBackend.scala
and the one present in the PR.
spark/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
Line 169 in 1e82335
RemoveExecutor(executorId, new ExecutorLossReason(reason)) |
Line 250 in 6735433
driverEndpoint.ask[Boolean](message) |
spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
Line 653 in a2db5c5
driverRef.send(RemoveExecutor(eid, exitReason)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as you said, there's only logging going on. There's only two things that can happen:
- the handler throws an exception instead of replying
- the RPC layer hits an error
The first can be logged at the handler (it not already logged by the RPC layer). The second is already logged by the RPC layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds fine, will update the PR with the changes from ask
to send
for RemoveExecutor
.
ok to test |
Test build #83908 has finished for PR 19741 at commit
|
Test build #83907 has finished for PR 19741 at commit
|
Test build #83947 has finished for PR 19741 at commit
|
Test build #84007 has finished for PR 19741 at commit
|
retest this please |
Test build #84445 has finished for PR 19741 at commit
|
@devaraj-kavali looks like you'll need to update the kubernetes backend now... |
for sending RemoveExecutor message
Test build #84511 has finished for PR 19741 at commit
|
LGTM, merging to master. |
What changes were proposed in this pull request?
I see the two instances where the exception is occurring.
Instance 1:
In CoarseGrainedSchedulerBackend.scala, driver-revive-thread starts with DriverEndpoint.onStart() and keeps sending the ReviveOffers messages periodically till it gets shutdown as part DriverEndpoint.onStop(). There is no proper coordination between the driver-revive-thread(shutdown) and the RpcEndpoint unregister, RpcEndpoint unregister happens first and then driver-revive-thread shuts down as part of DriverEndpoint.onStop(), In-between driver-revive-thread may try to send the ReviveOffers message which is leading to the above exception.
To fix this issue, this PR moves the shutting down of driver-revive-thread to CoarseGrainedSchedulerBackend.stop() which executes before the DriverEndpoint unregister.
Instance 2:
Here YarnDriverEndpoint tries to send remove executor messages after the Yarn scheduler backend service stop, which is leading to the above exception. To avoid the above exception,
In this PR, chosen the 2) option which adds a log message in the case of onFailure without the exception stack trace since the option 1) would need to to go through for every remove executor message.
How was this patch tested?
I verified it manually, I don't see these exceptions with the PR changes.