[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825

holdenk · 2023-01-31T18:56:33Z

What changes were proposed in this pull request?

Log Allocation Stalls when we are unable to allocate any pods (but wish to) during a K8s snapshot event.
Trigger Allocation event without blocking on snapshot provided that there is enough room in maxPendingPods.

Why are the changes needed?

Spark on K8s dynamic allocation can be difficult to debug, prone to stalling in heavily loaded clusters, and waiting for snapshot events has an unnecessary delay for pod allocation.

Does this PR introduce any user-facing change?

New log messages, faster pod scale up.

How was this patch tested?

Modified existing test to verify that we are both triggering allocation with pending pods and tracking when we are stalled.

attilapiros · 2023-02-01T00:09:02Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

@@ -141,9 +143,26 @@ class ExecutorPodsAllocator(
      totalExpectedExecutorsPerResourceProfileId.put(rp.id, numExecs)
    }
    logDebug(s"Set total expected execs to $totalExpectedExecutorsPerResourceProfileId")
-    if (numOutstandingPods.get() == 0) {
+    if (numOutstandingPods.get() < maxPendingPods) {


The default of KUBERNETES_MAX_PENDING_PODS is Int.MaxValue (too keep the old behaviour when it was introduced) and the numOutstandingPods main intention was to slow down upscaling at very steep peaks:
https://github.com/apache/spark/blob/b5b40113a64b4dbbcd4efe86da4409f2be8e9c56/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L397-L399

What about to use the allocation batch size (more a factor of it as good lower limit)?

That seems reasonable to me I'll change it over.

dongjoon-hyun

Sorry, but I'm -1 for SPARK-42261 K8s will not allocate more execs if there are any pending execs until next snapshot because that was a good feature to prevent pending resource (pod and dependent resources like PVCs) explosions which results EKS control plane congestion and a waste of money.

holdenk · 2023-02-06T22:33:52Z

Ok @dongjoon-hyun I hear the -1 concern around excessive scale-up -- but blocking scale-up on snapshots being delivered seems like kind of a hack (what we are doing today). What about if we use "allocation batch size" to gate it like @attilapiros suggested?

The other option would be to add this as a default feature off flag (e.g. enableAllocationWithPendingPods and set it to false by default).

The current logic which exists in the master branch seems to be based on 0343854 / #25236 from @vanzin which introduced hasPendingPods (which we now in master track the number of instead) and seems in line with the stated goals of that (" More responsive dynamic allocation with K8S"). I don't see anything mentioned in it about reducing the number of pending resources in EKS but maybe there was a discussion off-PR/off-list I don't see.

Do any of those work for your concern?

Just as an aside I'm a little surprised with a veto ( https://spark.apache.org/committers.html / https://www.apache.org/foundation/voting.html ) this early in the conversation around a proposed change, is there some context I'm missing? Did y'all get a stuck cluster with the previous behavior?

dongjoon-hyun · 2023-02-07T15:56:59Z

As I mentioned in the previous comment, Apache Spark are known to get explosions of pending pods and subsequent EKS resources. I was surprised that SPARK-42261 is filed as a bug, @holdenk .

is there some context I'm missing? Did y'all get a stuck cluster with the previous behavior?

I thought you are using StatefulSet only?

holdenk · 2023-02-07T18:37:55Z

We're exploring a mixture of different ways of handling allocation depending on the job.

So I filed SPARK-42261 as a bug because (from the commit history) not allocating until a snapshot is triggered does not seem to be an intentional feature and it slows down scale up in (what to me) is an unexpected way. I'm happy to change it to "improvement" from bug.

But more to the point: does having this as an opt-in feature make you feel comfortable with dropping the -1? Or do you think that having this an optional feature is bad? I think it's reasonable for us to want to support non-EKS deployments for fast scale up (not everyone is going to use EKS let alone PVCs on EKS).

dongjoon-hyun · 2023-02-08T04:19:31Z

Yes, last week, I was only -1 for SPARK-42261 K8s will not allocate more execs if there are any pending execs until next snapshot part because it's reported as a bug. For SPARK-42260, I wasn't sure because its Target Version was 3.4.1.

For the following, I'm open for new feature approaches.

What about if we use "allocation batch size" to gate it like @attilapiros suggested?

For the following (which is more important), I didn't catch up on this PR's latest commits yet, but I didn't mean to prevent alternatives. Even in Spark 3.4, I believe we can backport a required K8s alternatives still if we need it, @holdenk .

But more to the point: does having this as an opt-in feature make you feel comfortable with dropping the -1? Or do you think that having this an optional feature is bad? I think it's reasonable for us to want to support non-EKS deployments for fast scale up (not everyone is going to use EKS let alone PVCs on EKS).

Stale

dongjoon-hyun · 2023-02-08T04:58:12Z

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala

@@ -161,7 +161,7 @@ class ExecutorPodsAllocatorSuite extends SparkFunSuite with BeforeAndAfter {
    assert(ExecutorPodsAllocator.splitSlots(seq2, 4) === Seq(("a", 2), ("b", 1), ("c", 1)))
  }

-  test("SPARK-36052: pending pod limit with multiple resource profiles") {
+  test("SPARK-42261: Allow allocations without snapshot up to min of max pending & alloc size.") {


Please add a new test case while keeping the original test coverage by using configurations.

Sounds good, will refactor next week.

dongjoon-hyun · 2023-02-08T04:59:33Z

resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala

+        "snapshot this allows an implicit feedback from the cluster manager in that if it is too " +
+        "busy snapshots may be backed up. Settings this to false increases the speed at which " +
+        "Spark an scale up but does increase the risk of an excessive number of pending " +
+        s"resources in some environments. See ${KUBERNETES_MAX_PENDING_PODS.key} " +


Thank you for mentioning the risks.

dongjoon-hyun · 2023-02-08T05:01:57Z

.../core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocatorSuite.scala

+    assert(podsAllocatorUnderTest.stalledStartTime == null)
+  }
+
+  test("SPARK-36052: pending pod limit with multiple resource profiles & SPARK-42261") {


Oh, is this the original test case? If you don't mind, could you move the following test cases after SPARK-36052 test case?

test("SPARK-42261: Allow allocations without snapshot up to min of max pending & alloc size.") { test("SPARK-42261: Don't allow allocations without snapshot by default (except new rpID)") {

github-actions · 2023-05-20T00:17:23Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

… snapshot if we have headroom in max pending pods.

…kOnSnapshot to default to existing behaviour, and check allocation size and max pending size.

github-actions · 2024-02-02T00:17:56Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

09306677806 · 2024-02-02T00:32:01Z

Hi, I can't log in. My phone was hacked. I got into trouble. My whole life is gone. Can you give me a chance to log in? در تاریخ جمعه ۲ فوریه ۲۰۲۴،‏ ۰۳:۵۰ github-actions[bot] < ***@***.***> نوشت:

…

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! — Reply to this email directly, view it on GitHub <#39825 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANEZZREIL2F7GOQFF4VWTA3YRQWNHAVCNFSM6AAAAAAUMZSAEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRSGU2DAMZTGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

holdenk requested review from attilapiros and dongjoon-hyun January 31, 2023 18:56

github-actions bot added the KUBERNETES label Jan 31, 2023

attilapiros reviewed Feb 1, 2023

View reviewed changes

dongjoon-hyun previously requested changes Feb 1, 2023

View reviewed changes

holdenk force-pushed the investigate-spark-on-dynamic-scale-stalls branch from b5b4011 to 153e1df Compare February 6, 2023 23:28

dongjoon-hyun reviewed Feb 8, 2023

View reviewed changes

github-actions bot added the Stale label May 20, 2023

github-actions bot closed this May 21, 2023

holdenk reopened this Jul 26, 2023

holdenk removed the Stale label Jul 26, 2023

holdenk added 3 commits October 23, 2023 22:44

SPARK-42260: Log when K8s execs pods stalls.

2f08eef

SPARK-42261: Allow allocations to happen without waiting for the next…

0b9267b

… snapshot if we have headroom in max pending pods.

CR feedback: Make a new config param spark.kubernetes.allocation.bloc…

24c9d0f

…kOnSnapshot to default to existing behaviour, and check allocation size and max pending size.

holdenk force-pushed the investigate-spark-on-dynamic-scale-stalls branch from 153e1df to 24c9d0f Compare October 24, 2023 17:21

github-actions bot added the Stale label Feb 2, 2024

github-actions bot closed this Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825

[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825

holdenk commented Jan 31, 2023

attilapiros Feb 1, 2023

holdenk Feb 6, 2023

dongjoon-hyun left a comment •

edited

Loading

holdenk commented Feb 6, 2023

dongjoon-hyun commented Feb 7, 2023

holdenk commented Feb 7, 2023

dongjoon-hyun commented Feb 8, 2023

dongjoon-hyun Feb 8, 2023

holdenk Feb 8, 2023

dongjoon-hyun Feb 8, 2023

dongjoon-hyun Feb 8, 2023 •

edited

Loading

github-actions bot commented May 20, 2023

github-actions bot commented Feb 2, 2024

09306677806 commented Feb 2, 2024 via email

[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825

[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825

Conversation

holdenk commented Jan 31, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

attilapiros Feb 1, 2023

Choose a reason for hiding this comment

holdenk Feb 6, 2023

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

holdenk commented Feb 6, 2023

dongjoon-hyun commented Feb 7, 2023

holdenk commented Feb 7, 2023

dongjoon-hyun commented Feb 8, 2023

dongjoon-hyun Feb 8, 2023

Choose a reason for hiding this comment

holdenk Feb 8, 2023

Choose a reason for hiding this comment

dongjoon-hyun Feb 8, 2023

Choose a reason for hiding this comment

dongjoon-hyun Feb 8, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented May 20, 2023

github-actions bot commented Feb 2, 2024

09306677806 commented Feb 2, 2024 via email

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Feb 8, 2023 •

edited

Loading