Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825

Closed

Conversation

holdenk
Copy link
Contributor

@holdenk holdenk commented Jan 31, 2023

What changes were proposed in this pull request?

Log Allocation Stalls when we are unable to allocate any pods (but wish to) during a K8s snapshot event.
Trigger Allocation event without blocking on snapshot provided that there is enough room in maxPendingPods.

Why are the changes needed?

Spark on K8s dynamic allocation can be difficult to debug, prone to stalling in heavily loaded clusters, and waiting for snapshot events has an unnecessary delay for pod allocation.

Does this PR introduce any user-facing change?

New log messages, faster pod scale up.

How was this patch tested?

Modified existing test to verify that we are both triggering allocation with pending pods and tracking when we are stalled.

@@ -141,9 +143,26 @@ class ExecutorPodsAllocator(
totalExpectedExecutorsPerResourceProfileId.put(rp.id, numExecs)
}
logDebug(s"Set total expected execs to $totalExpectedExecutorsPerResourceProfileId")
if (numOutstandingPods.get() == 0) {
if (numOutstandingPods.get() < maxPendingPods) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default of KUBERNETES_MAX_PENDING_PODS is Int.MaxValue (too keep the old behaviour when it was introduced) and the numOutstandingPods main intention was to slow down upscaling at very steep peaks:
https://github.com/apache/spark/blob/b5b40113a64b4dbbcd4efe86da4409f2be8e9c56/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L397-L399

What about to use the allocation batch size (more a factor of it as good lower limit)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable to me I'll change it over.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I'm -1 for SPARK-42261 K8s will not allocate more execs if there are any pending execs until next snapshot because that was a good feature to prevent pending resource (pod and dependent resources like PVCs) explosions which results EKS control plane congestion and a waste of money.

@holdenk
Copy link
Contributor Author

holdenk commented Feb 6, 2023

Ok @dongjoon-hyun I hear the -1 concern around excessive scale-up -- but blocking scale-up on snapshots being delivered seems like kind of a hack (what we are doing today). What about if we use "allocation batch size" to gate it like @attilapiros suggested?

The other option would be to add this as a default feature off flag (e.g. enableAllocationWithPendingPods and set it to false by default).

The current logic which exists in the master branch seems to be based on 0343854 / #25236 from @vanzin which introduced hasPendingPods (which we now in master track the number of instead) and seems in line with the stated goals of that (" More responsive dynamic allocation with K8S"). I don't see anything mentioned in it about reducing the number of pending resources in EKS but maybe there was a discussion off-PR/off-list I don't see.

Do any of those work for your concern?

Just as an aside I'm a little surprised with a veto ( https://spark.apache.org/committers.html / https://www.apache.org/foundation/voting.html ) this early in the conversation around a proposed change, is there some context I'm missing? Did y'all get a stuck cluster with the previous behavior?

@holdenk holdenk force-pushed the investigate-spark-on-dynamic-scale-stalls branch from b5b4011 to 153e1df Compare February 6, 2023 23:28
@dongjoon-hyun
Copy link
Member

As I mentioned in the previous comment, Apache Spark are known to get explosions of pending pods and subsequent EKS resources. I was surprised that SPARK-42261 is filed as a bug, @holdenk .

is there some context I'm missing? Did y'all get a stuck cluster with the previous behavior?

I thought you are using StatefulSet only?

@holdenk
Copy link
Contributor Author

holdenk commented Feb 7, 2023

We're exploring a mixture of different ways of handling allocation depending on the job.

So I filed SPARK-42261 as a bug because (from the commit history) not allocating until a snapshot is triggered does not seem to be an intentional feature and it slows down scale up in (what to me) is an unexpected way. I'm happy to change it to "improvement" from bug.

But more to the point: does having this as an opt-in feature make you feel comfortable with dropping the -1? Or do you think that having this an optional feature is bad? I think it's reasonable for us to want to support non-EKS deployments for fast scale up (not everyone is going to use EKS let alone PVCs on EKS).

@dongjoon-hyun
Copy link
Member

Yes, last week, I was only -1 for SPARK-42261 K8s will not allocate more execs if there are any pending execs until next snapshot part because it's reported as a bug. For SPARK-42260, I wasn't sure because its Target Version was 3.4.1.

For the following, I'm open for new feature approaches.

What about if we use "allocation batch size" to gate it like @attilapiros suggested?

For the following (which is more important), I didn't catch up on this PR's latest commits yet, but I didn't mean to prevent alternatives. Even in Spark 3.4, I believe we can backport a required K8s alternatives still if we need it, @holdenk .

But more to the point: does having this as an opt-in feature make you feel comfortable with dropping the -1? Or do you think that having this an optional feature is bad? I think it's reasonable for us to want to support non-EKS deployments for fast scale up (not everyone is going to use EKS let alone PVCs on EKS).

@@ -161,7 +161,7 @@ class ExecutorPodsAllocatorSuite extends SparkFunSuite with BeforeAndAfter {
assert(ExecutorPodsAllocator.splitSlots(seq2, 4) === Seq(("a", 2), ("b", 1), ("c", 1)))
}

test("SPARK-36052: pending pod limit with multiple resource profiles") {
test("SPARK-42261: Allow allocations without snapshot up to min of max pending & alloc size.") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new test case while keeping the original test coverage by using configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will refactor next week.

"snapshot this allows an implicit feedback from the cluster manager in that if it is too " +
"busy snapshots may be backed up. Settings this to false increases the speed at which " +
"Spark an scale up but does increase the risk of an excessive number of pending " +
s"resources in some environments. See ${KUBERNETES_MAX_PENDING_PODS.key} " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for mentioning the risks.

assert(podsAllocatorUnderTest.stalledStartTime == null)
}

test("SPARK-36052: pending pod limit with multiple resource profiles & SPARK-42261") {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, is this the original test case? If you don't mind, could you move the following test cases after SPARK-36052 test case?

test("SPARK-42261: Allow allocations without snapshot up to min of max pending & alloc size.") {
test("SPARK-42261: Don't allow allocations without snapshot by default (except new rpID)") {

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 20, 2023
@github-actions github-actions bot closed this May 21, 2023
@holdenk holdenk reopened this Jul 26, 2023
@holdenk holdenk removed the Stale label Jul 26, 2023
… snapshot if we have headroom in max pending pods.
…kOnSnapshot to default to existing behaviour, and check allocation size and max pending size.
@holdenk holdenk force-pushed the investigate-spark-on-dynamic-scale-stalls branch from 153e1df to 24c9d0f Compare October 24, 2023 17:21
Copy link

github-actions bot commented Feb 2, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 2, 2024
@09306677806
Copy link

09306677806 commented Feb 2, 2024 via email

@github-actions github-actions bot closed this Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants