Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24432] Support dynamic allocation without external shuffle service #24083

Closed
wants to merge 8 commits into from

Conversation

lwwmanning
Copy link
Contributor

@lwwmanning lwwmanning commented Mar 13, 2019

What changes were proposed in this pull request?

This PR adds a limited version of dynamic allocation that does not require the external shuffle service, and thus works on kubernetes (but does not support preemption).

The basic approach is to track which executors are holding shuffle files, and of those, which have shuffle files depended on by active stages. If an executor contains shuffle files depended on by an active stage, then we treat it as "active" (i.e., prevent the ExecutorAllocationManager from marking it as "idle"). If an executor contains only shuffle files that are not dependencies of active stages, then we treat those shuffle files similarly to cached data (i.e., configurable idle timeout that defaults to "infinity").

We also introduce the concept of "shuffle biased task scheduling", a heuristic attempt to schedule tasks for maximal efficacy of dynamic allocation. We do this by attempting to minimize the number of executors that contain (active) shuffle files, by packing as many tasks as possible onto "active" executors first, followed by scheduling them on executors with only inactive shuffle files, and finally all remaining executors.

This is a port of a series of PRs on Palantir's spark fork: palantir#427 palantir#445 palantir#446 palantir#447

Partially addresses https://issues.apache.org/jira/browse/SPARK-24432

cc: @rynorris @mccheah @robert3005

How was this patch tested?

We added additional tests explicitly as part of the PR, and did additional manual testing on small YARN and k8s clusters (partially documented on palantir#446). Then we successfully rolled this out for a small subset of workloads in production at Palantir, running entirely on kubernetes.

Please review http://spark.apache.org/contributing.html before opening a pull request.

lwwmanning and others added 4 commits March 13, 2019 17:07
Allows dynamically scaling executors up and down without external shuffle service. Tracks shuffle locations to know if executors can be safely scaled down.
@mcheah
Copy link

mcheah commented Mar 13, 2019

@mccheah

@lwwmanning
Copy link
Contributor Author

whoops, thanks @mcheah, sorry for the mis-tag!

@mccheah
Copy link
Contributor

mccheah commented Mar 13, 2019

ok to test

@mccheah
Copy link
Contributor

mccheah commented Mar 13, 2019

cc @vanzin @rxin. Would a feature like this require an SPIP?

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103446 has finished for PR 24083 at commit 42de6cf.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103447 has finished for PR 24083 at commit 740632d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103449 has finished for PR 24083 at commit 0fe8cdb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@markhamstra
Copy link
Contributor

@mccheah it does feel to me like more discussion is required than is appropriate for a PR. Whether that means a separate JIRA ticket or a SPIP, I'm undecided.

@SparkQA
Copy link

SparkQA commented Mar 13, 2019

Test build #103451 has finished for PR 24083 at commit 9318f99.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Mar 13, 2019

I don't think this needs an SPIP (it's all internal changes), but agree that it should be its own separate bug in JIRA.

It would also be better to separate the k8s-side changes into a different bug/PR, since those fix a different problem than the other changes.

SPARK-24432 is an umbrella bug, so you could just create sub-tasks for it.

@vanzin
Copy link
Contributor

vanzin commented Mar 15, 2019

Ping?

I'm seeing two or 3 separate PRs here; the ExecutorAllocationManager changes + tracking of active stages being the most obvious. "Shuffle biased scheduling" builds on top of the active stage tracking, but feels separate from the dynamic allocation changes. And the k8s changes definitely should be a separate PR.

Also, instead of adding YARN tests, I'd do that in core. YarnClusterSuite does not run when you don't change YARN code, so it often doesn't run when changes are made to the core code in this area. Using a local-cluster[...] for this works fine (I've been using it in some code I've been playing with), and even runs a bit faster than tests in the YARN suite.

@vanzin
Copy link
Contributor

vanzin commented Mar 26, 2019

If there's not going to be any activity here I'll just close this PR and let someone else handle this if they want...

@lwwmanning
Copy link
Contributor Author

Apologize, was on vacation last week.

I'll file separate sub-issues in JIRA for both the k8s changes and dynamic-allocation-without-ESS.

Will close this PR and split out as three separate PRs, in order: (1) k8s changes, (2) ExecutorAllocationManager changes / active stage tracking (with additional tests in core), (3) shuffle biased task scheduling.

@lwwmanning lwwmanning closed this Mar 26, 2019
@ScrapCodes
Copy link
Member

@lwwmanning Hi, any progress on this work? Do you want some help?

@ScrapCodes
Copy link
Member

@lwwmanning and @vanzin, There are no updates here or on the issue. IMO this approach is worth considering, it may have its downside, but having it is an option in spark on kube, seems reasonable to me. Shall I take this work forward from here?

@vanzin
Copy link
Contributor

vanzin commented May 28, 2019

I agree that this could be useful, and you're free to open a PR if you want. But I actually do have an implementation of this that is based on some code that is currently under review, and I plan to submit it when the PR it depends on is merged.

@ScrapCodes
Copy link
Member

ScrapCodes commented Aug 5, 2019

Hi @vanzin, Do you have an update? Can you please provide link to the PR you were mentioning here?

@vanzin
Copy link
Contributor

vanzin commented Aug 5, 2019

SPARK-27963

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants