[SPARK-29965][core] Ensure that killed executors don't re-register with driver. #26630

…th driver. There are 3 different issues that cause the same underlying problem: an executor that the driver has killed during downscaling registers back with the block manager in the driver, and the block manager from that point on keeps trying to contact the dead executor. The first one is that the heartbeat receiver was asking unknown executors to re-register when receiving a heartbeat. That code path only really happens when the executor dies because of a driver killing it, so there's no reason to re-register. The second one is a race between the heartbeat receiver and the DAG scheduler. Both received notifications of an executor's addition and removal asynchronously (the first one via the listener bus *and* an async local RPC, the second via its own separate internal message queue). This led to situations where they disagreed about which executors were really alive; the change makes it so the heartbeat receiver is updated first, and once that's done, then the DAG scheduler can update itself. This ensures the hearbeat receiver knows which executors not to ask to re-register. The third one is because the block manager couldn't differentiate between an unknown executor (like one that's been removed) and an executor that needs to re-register (like one the scheduler decided to unregister because of too many fetch failures). The change adds code in the block manager master to track which executors have been removed, so that instead of asking them to re-register, it just ignores them. While there I simplified the executor shutdown a bit since it was doing some stuff unnecessarily. Tested with existing unit tests, and by repeatedly runnins worklogs on k8s with dynamic allocation; previously I'd hit these different issues somewhat often, with the fixes I'm not able to reproduce them.

Commits on Jan 24, 2020

Merge branch 'master' into SPARK-29965

vanzin committed Jan 24, 2020

Configuration menu

View commit details

Copy full SHA for ed0b6e9

Browse repository at this point

Copy the full SHA

ed0b6e9 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29965][core] Ensure that killed executors don't re-register with driver. #26630

[SPARK-29965][core] Ensure that killed executors don't re-register with driver. #26630

Commits on Nov 21, 2019

Commits on Jan 24, 2020