Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29965][core] Ensure that killed executors don't re-register with driver. #26630

Closed
wants to merge 2 commits into from

Commits on Nov 21, 2019

  1. [SPARK-29965][core] Ensure that killed executors don't re-register wi…

    …th driver.
    
    There are 3 different issues that cause the same underlying problem: an executor
    that the driver has killed during downscaling registers back with the block
    manager in the driver, and the block manager from that point on keeps trying
    to contact the dead executor.
    
    The first one is that the heartbeat receiver was asking unknown executors to
    re-register when receiving a heartbeat. That code path only really happens
    when the executor dies because of a driver killing it, so there's no reason
    to re-register.
    
    The second one is a race between the heartbeat receiver and the DAG scheduler.
    Both received notifications of an executor's addition and removal
    asynchronously (the first one via the listener bus *and* an async local RPC,
    the second via its own separate internal message queue). This led to
    situations where they disagreed about which executors were really alive; the
    change makes it so the heartbeat receiver is updated first, and once that's
    done, then the DAG scheduler can update itself. This ensures the hearbeat
    receiver knows which executors not to ask to re-register.
    
    The third one is because the block manager couldn't differentiate between
    an unknown executor (like one that's been removed) and an executor that needs
    to re-register (like one the scheduler decided to unregister because of
    too many fetch failures). The change adds code in the block manager master to
    track which executors have been removed, so that instead of asking them to
    re-register, it just ignores them.
    
    While there I simplified the executor shutdown a bit since it was doing
    some stuff unnecessarily.
    
    Tested with existing unit tests, and by repeatedly runnins worklogs on k8s
    with dynamic allocation; previously I'd hit these different issues somewhat
    often, with the fixes I'm not able to reproduce them.
    Marcelo Vanzin committed Nov 21, 2019
    Configuration menu
    Copy the full SHA
    313a6bf View commit details
    Browse the repository at this point in the history

Commits on Jan 24, 2020

  1. Configuration menu
    Copy the full SHA
    ed0b6e9 View commit details
    Browse the repository at this point in the history