-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825
[SPARK-42261][SPARK-42260][K8S] Log Allocation Stalls and Trigger Allocation event without blocking on snapshot #39825
Conversation
@@ -141,9 +143,26 @@ class ExecutorPodsAllocator( | |||
totalExpectedExecutorsPerResourceProfileId.put(rp.id, numExecs) | |||
} | |||
logDebug(s"Set total expected execs to $totalExpectedExecutorsPerResourceProfileId") | |||
if (numOutstandingPods.get() == 0) { | |||
if (numOutstandingPods.get() < maxPendingPods) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default of KUBERNETES_MAX_PENDING_PODS is Int.MaxValue (too keep the old behaviour when it was introduced) and the numOutstandingPods
main intention was to slow down upscaling at very steep peaks:
https://github.com/apache/spark/blob/b5b40113a64b4dbbcd4efe86da4409f2be8e9c56/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L397-L399
What about to use the allocation batch size (more a factor of it as good lower limit)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable to me I'll change it over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I'm -1 for SPARK-42261 K8s will not allocate more execs if there are any pending execs until next snapshot because that was a good feature to prevent pending resource (pod and dependent resources like PVCs) explosions which results EKS control plane congestion and a waste of money.
Ok @dongjoon-hyun I hear the -1 concern around excessive scale-up -- but blocking scale-up on snapshots being delivered seems like kind of a hack (what we are doing today). What about if we use "allocation batch size" to gate it like @attilapiros suggested? The other option would be to add this as a default feature off flag (e.g. The current logic which exists in the master branch seems to be based on 0343854 / #25236 from @vanzin which introduced Do any of those work for your concern? Just as an aside I'm a little surprised with a veto ( https://spark.apache.org/committers.html / https://www.apache.org/foundation/voting.html ) this early in the conversation around a proposed change, is there some context I'm missing? Did y'all get a stuck cluster with the previous behavior? |
b5b4011
to
153e1df
Compare
As I mentioned in the previous comment, Apache Spark are known to get explosions of pending pods and subsequent EKS resources. I was surprised that SPARK-42261 is filed as a bug, @holdenk .
I thought you are using |
We're exploring a mixture of different ways of handling allocation depending on the job. So I filed SPARK-42261 as a bug because (from the commit history) not allocating until a snapshot is triggered does not seem to be an intentional feature and it slows down scale up in (what to me) is an unexpected way. I'm happy to change it to "improvement" from bug. But more to the point: does having this as an opt-in feature make you feel comfortable with dropping the -1? Or do you think that having this an optional feature is bad? I think it's reasonable for us to want to support non-EKS deployments for fast scale up (not everyone is going to use EKS let alone PVCs on EKS). |
Yes, last week, I was only -1 for SPARK-42261 K8s will not allocate more execs if there are any pending execs until next snapshot part because it's reported as a bug. For SPARK-42260, I wasn't sure because its For the following, I'm open for new feature approaches.
For the following (which is more important), I didn't catch up on this PR's latest commits yet, but I didn't mean to prevent alternatives. Even in Spark 3.4, I believe we can backport a required K8s alternatives still if we need it, @holdenk .
|
@@ -161,7 +161,7 @@ class ExecutorPodsAllocatorSuite extends SparkFunSuite with BeforeAndAfter { | |||
assert(ExecutorPodsAllocator.splitSlots(seq2, 4) === Seq(("a", 2), ("b", 1), ("c", 1))) | |||
} | |||
|
|||
test("SPARK-36052: pending pod limit with multiple resource profiles") { | |||
test("SPARK-42261: Allow allocations without snapshot up to min of max pending & alloc size.") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a new test case while keeping the original test coverage by using configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will refactor next week.
"snapshot this allows an implicit feedback from the cluster manager in that if it is too " + | ||
"busy snapshots may be backed up. Settings this to false increases the speed at which " + | ||
"Spark an scale up but does increase the risk of an excessive number of pending " + | ||
s"resources in some environments. See ${KUBERNETES_MAX_PENDING_PODS.key} " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for mentioning the risks.
assert(podsAllocatorUnderTest.stalledStartTime == null) | ||
} | ||
|
||
test("SPARK-36052: pending pod limit with multiple resource profiles & SPARK-42261") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, is this the original test case? If you don't mind, could you move the following test cases after SPARK-36052
test case?
test("SPARK-42261: Allow allocations without snapshot up to min of max pending & alloc size.") {
test("SPARK-42261: Don't allow allocations without snapshot by default (except new rpID)") {
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
… snapshot if we have headroom in max pending pods.
…kOnSnapshot to default to existing behaviour, and check allocation size and max pending size.
153e1df
to
24c9d0f
Compare
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Hi, I can't log in. My phone was hacked. I got into trouble. My whole life
is gone. Can you give me a chance to log in?
در تاریخ جمعه ۲ فوریه ۲۰۲۴، ۰۳:۵۰ github-actions[bot] <
***@***.***> نوشت:
… We're closing this PR because it hasn't been updated in a while. This
isn't a judgement on the merit of the PR in any way. It's just a way of
keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to
remove the Stale tag!
—
Reply to this email directly, view it on GitHub
<#39825 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANEZZREIL2F7GOQFF4VWTA3YRQWNHAVCNFSM6AAAAAAUMZSAEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRSGU2DAMZTGQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
What changes were proposed in this pull request?
Log Allocation Stalls when we are unable to allocate any pods (but wish to) during a K8s snapshot event.
Trigger Allocation event without blocking on snapshot provided that there is enough room in maxPendingPods.
Why are the changes needed?
Spark on K8s dynamic allocation can be difficult to debug, prone to stalling in heavily loaded clusters, and waiting for snapshot events has an unnecessary delay for pod allocation.
Does this PR introduce any user-facing change?
New log messages, faster pod scale up.
How was this patch tested?
Modified existing test to verify that we are both triggering allocation with pending pods and tracking when we are stalled.