feat: rebalance cost-based autoscaler for best throughput#19646
Conversation
f5be5c2 to
8a3e6dd
Compare
kfaraz
left a comment
There was a problem hiding this comment.
In conclusion, does the auto-scaler now prefer under-provisioning by default?
| * Controls the steepness of the U-shape on the over-provisioning side. | ||
| */ | ||
| static final double OVER_PROVISIONING_PENALTY = 1.0; | ||
| static final double OVER_PROVISIONING_PENALTY = 2.5; |
There was a problem hiding this comment.
Why change this to 2.5 instead of 2.0?
There was a problem hiding this comment.
I tried 2.0, but 2.5 looked better in simulation.
There was a problem hiding this comment.
I feel 2x penalty on over-provisioning is already enough. Changing these constants very frequently will make it difficult to gather enough data to refine the constants empirically.
There was a problem hiding this comment.
Let's use 2 for now and see how it does in real clusters. This PR is already interchanging the under provisioned penalty and the over provisioning penalty, so we will have enough stuff to validate anyway.
There was a problem hiding this comment.
Okay, let's use 2, but I can assure you 2.5 looks even better on simulation. We'll see 😺
| * do not return infinitely large lag recovery times, at the expense of underestimating the lag cost. | ||
| */ | ||
| static final double MIN_PROCESSING_RATE = 1_000; | ||
| static final double MIN_PROCESSING_RATE = 5_000; |
There was a problem hiding this comment.
Why this change?
Technically, 1000 was also an arbitrary number but 5000 definitely seems to be on the higher side, especially when dealing with bulkier records with large (say JSON) column values. I would rather the user choose whether they want to prefer lag recovery or throughput by tweaking the weights.
There was a problem hiding this comment.
I decided to tweak it too and realized that 1000 is too permissible to scaleup during minimal lag. 5000 implies more strict behaviour in that critical state where we have not received metrics yet.
There was a problem hiding this comment.
If we have not received metrics yet, auto-scaling would be skipped since CostBasedAutoScaler.validateMetricsForScaling would return an error.
I have personally seen tasks in prod clusters maxing out at 5000 records/sec when dealing with large records.
So, using a large MIN_PROCESSING_RATE would cause us to always under-estimate lagRecoveryTime, irrespective of the actual avgProcessingRate.
As such, let's keep a low value of MIN_PROCESSING_RATE(maybe even as low as 100), since it is meant to be a safe-side measure that kicks in only when avgProcessingRate is very low.
The penalty for scale-up is already driven by the optimal task idleness, and can be controlled using the weights.
There was a problem hiding this comment.
In a future PR, I think we can remove the MIN_PROCESSING_RATE altogether and maybe use the window maxProcessingRate, but I haven't fully thought it through yet. It might have some unforeseen side effects.
There was a problem hiding this comment.
@kfaraz I forgot about Math.max(...) and then I reconsider my approach 😁
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
Found 1 issue.
Reviewed 6 of 6 changed files.
This is an automated review by Codex GPT-5.5
kfaraz
left a comment
There was a problem hiding this comment.
Left a non-blocking comment.
This PR retunes the cost-based autoscaler scoring for better throughput-oriented decisions and exposes
idealIdleRatioas a configurable knob for the U-shaped idle-cost function.It also updates autoscaler logging to show the effective idle ratio and max observed processing rate used during optimal task-count calculation.
Details
Picture 1
48partitions, starting from6tasks.Picture 2
125partitions, plot title showsstartTaskCount=25.Picture 3
500partitions, starting at the maximum task count.This PR has: