Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite #29422

Closed
wants to merge 4 commits into from

Commits on Aug 17, 2020

  1. [CORE] Fix regressions in decommissioning

    The DecommissionWorkerSuite started becoming flaky and it revealed a real regression. Recent PR's (apache#28085 and apache#29211) neccessitate a small reworking of the decommissioning logic.
    
    Before getting into that, let me describe the intended behavior of decommissioning:
    
    If a fetch failure happens where the source executor was decommissioned, we want to treat that as an eager signal to clear all shuffle state associated with that executor. In addition if we know that the host was decommissioned, we want to forget about all map statuses from all other executors on that decommissioned host. This is what the test "decommission workers ensure that fetch failures lead to rerun" is trying to test. This invariant is important to ensure that decommissioning a host does not lead to multiple fetch failures that might fail the job.
    
    - Per apache#29211, the executors now eagerly exit on decommissioning and thus the executor is lost before the fetch failure even happens. (I tested this by waiting some seconds before triggering the fetch failure). When an executor is lost, we forget its decommissioning information. The fix is to keep the decommissioning information around for some time after removal with some extra logic to finally purge it after a timeout.
    
    - Per apache#28085, when the executor is lost, it forgets the shuffle state about just that executor and increments the shuffleFileLostEpoch. This incrementing precludes the clearing of state of the entire host when the fetch failure happens. This PR elects to only change this codepath for the special case of decommissioning, without any other side effects. This whole version keeping stuff is complex and it has effectively not been semantically changed since 2013! The fix here is also simple: Ignore the shuffleFileLostEpoch when the shuffle status is being cleared due to a fetch failure resulting from host decommission.
    
    These two fixes are local to decommissioning only and don't change other behavior.
    
    Also added some more tests to TaskSchedulerImpl to ensure that the decommissioning information is indeed purged after a timeout.
    
    Also hardened the test DecommissionWorkerSuite to make it wait for successful job completion.
    dagrawal3409 committed Aug 17, 2020
    Configuration menu
    Copy the full SHA
    dfc124c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    11e4adc View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    a8cd1de View commit details
    Browse the repository at this point in the history
  4. @cloudfan's comments

    dagrawal3409 committed Aug 17, 2020
    Configuration menu
    Copy the full SHA
    df128e5 View commit details
    Browse the repository at this point in the history