Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29965][core] Ensure that killed executors don't re-register with driver. #26630

Closed
wants to merge 2 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Nov 21, 2019

There are 3 different issues that cause the same underlying problem: an executor
that the driver has killed during downscaling registers back with the block
manager in the driver, and the block manager from that point on keeps trying
to contact the dead executor.

The first one is that the heartbeat receiver was asking unknown executors to
re-register when receiving a heartbeat. That code path only really happens
when the executor dies because of a driver killing it, so there's no reason
to re-register.

The second one is a race between the heartbeat receiver and the DAG scheduler.
Both received notifications of an executor's addition and removal
asynchronously (the first one via the listener bus and an async local RPC,
the second via its own separate internal message queue). This led to
situations where they disagreed about which executors were really alive; the
change makes it so the heartbeat receiver is updated first, and once that's
done, then the DAG scheduler can update itself. This ensures the hearbeat
receiver knows which executors not to ask to re-register.

The third one is because the block manager couldn't differentiate between
an unknown executor (like one that's been removed) and an executor that needs
to re-register (like one the scheduler decided to unregister because of
too many fetch failures). The change adds code in the block manager master to
track which executors have been removed, so that instead of asking them to
re-register, it just ignores them.

While there I simplified the executor shutdown a bit since it was doing
some stuff unnecessarily.

Tested with existing unit tests, and by repeatedly running workloads on k8s
with dynamic allocation; previously I'd hit these different issues somewhat
often, with the fixes I'm not able to reproduce them.

…th driver.

There are 3 different issues that cause the same underlying problem: an executor
that the driver has killed during downscaling registers back with the block
manager in the driver, and the block manager from that point on keeps trying
to contact the dead executor.

The first one is that the heartbeat receiver was asking unknown executors to
re-register when receiving a heartbeat. That code path only really happens
when the executor dies because of a driver killing it, so there's no reason
to re-register.

The second one is a race between the heartbeat receiver and the DAG scheduler.
Both received notifications of an executor's addition and removal
asynchronously (the first one via the listener bus *and* an async local RPC,
the second via its own separate internal message queue). This led to
situations where they disagreed about which executors were really alive; the
change makes it so the heartbeat receiver is updated first, and once that's
done, then the DAG scheduler can update itself. This ensures the hearbeat
receiver knows which executors not to ask to re-register.

The third one is because the block manager couldn't differentiate between
an unknown executor (like one that's been removed) and an executor that needs
to re-register (like one the scheduler decided to unregister because of
too many fetch failures). The change adds code in the block manager master to
track which executors have been removed, so that instead of asking them to
re-register, it just ignores them.

While there I simplified the executor shutdown a bit since it was doing
some stuff unnecessarily.

Tested with existing unit tests, and by repeatedly runnins worklogs on k8s
with dynamic allocation; previously I'd hit these different issues somewhat
often, with the fixes I'm not able to reproduce them.
@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114255 has finished for PR 26630 at commit 313a6bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RemoveExecutor(execId: String, removedFromSpark: Boolean)

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Dec 15, 2019

Test build #115345 has finished for PR 26630 at commit 313a6bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class RemoveExecutor(execId: String, removedFromSpark: Boolean)

@tgravescs
Copy link
Contributor

@vanzin I see this is old but you just updated, assume no one got to it before and its good to review?
If so I'll take a look early next week.

@SparkQA
Copy link

SparkQA commented Jan 24, 2020

Test build #117362 has finished for PR 26630 at commit ed0b6e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Feb 26, 2020

Test build #118935 has finished for PR 26630 at commit ed0b6e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

cc @dbtsai and @holdenk

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Apr 2, 2020

Test build #120694 has finished for PR 26630 at commit ed0b6e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin vanzin closed this Aug 26, 2020
@AdamGS
Copy link

AdamGS commented Jun 20, 2021

@dongjoon-hyun I would love to pick this back up and make sure it compiles/works with the most recent. Is that OK?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants