Skip to content

[SPARK-44609][K8S] Remove executor pod from PodsAllocator if it was removed from scheduler backend#42297

Closed
alibiyeslambek wants to merge 2 commits intoapache:masterfrom
alibiyeslambek:spark-44609
Closed

[SPARK-44609][K8S] Remove executor pod from PodsAllocator if it was removed from scheduler backend#42297
alibiyeslambek wants to merge 2 commits intoapache:masterfrom
alibiyeslambek:spark-44609

Conversation

@alibiyeslambek
Copy link

What changes were proposed in this pull request?

Remove executor pod from schedulerKnownNewlyCreatedExecs in ExecutorPodsAllocator if schedulerBackend no longer knows about that executor.

Why are the changes needed?

If executor was created and removed in a short period, then it is possible that the creation was not captured by any snapshots. In this case, we should remove the executor from schedulerKnownNewlyCreatedExecs list, otherwise it will get stuck in the list and new executor will never be requested.

(This can often happen if watch connection is closed for some reason and driver has to fallback on ExecutorPodsPollingSnapshotSource)

Please check jira issue to get more details

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Unit tests test("SPARK-44609: Do not track an executor if it was removed from scheduler backend.")
  • Tested locally by disabling informers and manually killing executor pods and seeing the new ones coming up within the spark.kubernetes.executor.apiPollingInterval 30s

newlyCreatedExecutors.filterKeys(schedulerKnownExecs.contains(_)).mapValues(_._1)
newlyCreatedExecutors --= schedulerKnownNewlyCreatedExecs.keySet

// If executor was created and removed in a short period, then it is possible that the creation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a little more about this situation?

If executor was created and removed in a short period

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun this is happening when underlying node is decommissioned in Kubernetes. If that happens shortly after executor is scheduled on that node, we hit the bug in question.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dongjoon-hyun, sorry for the late reply. Here's an example of a failure scenario:

  • Driver requests an executor
  • exec-1 gets created and registers with driver
  • exec-1 is moved from newlyCreatedExecutors to schedulerKnownNewlyCreatedExecs
  • exec-1 got deleted very quickly (~1-30 sec) after registration
  • ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
  • ExecutorPodsPollingSnapshotSource fails to catch the creation because it runs every 30 secs, but executor was removed much quicker after creation
  • exec-1 is never removed from schedulerKnownNewlyCreatedExecs
  • ExecutorPodsAllocator will never request new executor because it’s slot is occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.

@LuciferYang LuciferYang changed the title [SPARK-44609][KUBERNETES] Remove executor pod from PodsAllocator if it was removed from scheduler backend [SPARK-44609][K8S] Remove executor pod from PodsAllocator if it was removed from scheduler backend Jan 29, 2024
@github-actions
Copy link

github-actions bot commented May 9, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 9, 2024
@github-actions github-actions bot closed this May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants