-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-30593][autoscaler] Determine restart time on the fly fo Autoscaler #711
Conversation
afc0bfe
to
4ff926f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the overall approach but I would recommend the following changes:
- Remove the newly introduced configs, this should be automatic and always on
- Track the start/end times for the restart in memory and only record the
observed_restart_time
in the autoscaler state store. This way we add minimal extra state that is easy to implement. - Instead of computing restart time from a fixed number of samples, use a simple moving average:
observed_restart_time = (prev_observed + new_observed) / 2
- During autoscaler logic use:
restart_time = min(conf_restart_time, observed_restart_time)
Commenting on some of the requests:
I think it is fair to have a ON/OFF switch. It should be on by default but we want to keep the ability to roll back to the old behavior.
We already have the start time of the last scaling in memory via the scaling history. We can then keep note of the end time once we detect the scaling is over. That leaves a little bit of error in case of downtime of the operator which will produce a long rescaling time. I think that should be fine though, since we cap at the max configured rescale time.
I think we can do an exponentially weighted average.
+1 |
@gyfora @mxm thanks for the feedback.
Are we talking about having a field in the |
I meant a flag in the ConfigMap. I guess something like
In the past we ran into the 1MB size limit for large pipelines with hundreds of vertices. That's why we don't want to increase the state size too much. We also added compression because of this. Of course we can add a new configmap. We have just not come around to do it and it might not be a good idea to put too much load on etcd. While it would be nice to have a separate config map for the scaling metrics and the scaling history, but it is much simpler to do it in one. |
Got it, this is how it actually works currently (modulo additional target vertices parallelism tracking that I will attempt to fetch from the history instead) - I thought it was more about the configmap vs in-memory discussion.
I see, let's I'll try to keep the size to the minimum then. |
Yes, I saw that. I think Gyula is right though, that it's enough to use in-memory state to start tracking a scaling execution. When the scaling is completed, we then persist the duration in the ConfigMap. The reason is that keeping track of the start time when we begin tracking would only be useful if we could use that state to recover, e.g. in case of downtime. However, that's only useful when the job does not complete rescaling during the downtime. If it completes before, we don't know how long a scaling took when we come back up. For simplicity, I think it makes sense to just track the scaling start time in memory and then persist the actual rescale time in the ConfigMap. |
Sounds reasonable. If we are optimizing state size at this level, how about we also remove |
4ff926f
to
03c28e5
Compare
@gyfora I'd like to propose to use this format:
Some reasoning:
@mxm, I removed the redundant tracking of the target parallelism from the |
03c28e5
to
b903dae
Compare
We can file a JIRA for this. This requires more changes since those metrics are also directly reported via Prometheus and removing them would be unexpected to users. |
4a01731
to
6bb5dd7
Compare
@mxm @gyfora the PR is ready for review @gyfora me and Max briefly discussed offline and came to the conclusion that starting with evaluating the maximum restart time capped by the RESTART_TIME setting is probably good enough for the first step. It has the benefit of giving the most "conservative" evaluation and we can add the moving average after some baseline testing. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alex!
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobAutoScalerImpl.java
Outdated
Show resolved
Hide resolved
/** | ||
* Class for tracking scaling details, including time it took for the job to transition to the | ||
* target parallelism. | ||
*/ | ||
@Data | ||
@NoArgsConstructor | ||
@AllArgsConstructor | ||
public class ScalingRecord { | ||
private Instant endTime; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just store Instant directly and remove this wrapper class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to have structures in place that would be easily extensible. If we are sure we do not not ever need to store anything else here we can store the instant directly.
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we are going to use Math.min(maxRestartTime, restartTimeFromConfig)
we should increase the default restart time value in the config from 5 minutes to at least 15 minutes (maybe even higher like 20-30)
With 5 minutes almost all prod jobs with many pods /slower envs or large state will probably take longer (let's say 10-15 minutes) and they will be cut back to 5 minutes leading to an underestimation and potentially multiple scaleups.
What do you think?
cc @mxm
What you describe is already the case. Any deployment which currently takes longer than the configured rescale time, will not scale accurately. By adding the feature in its current state, we are conservatively addressing those deployment which come back up much quicker. In our environment, typical rescale time is 1 minute or less. I think your request requires adding a new configuration I'm ok with adding |
6bb5dd7
to
1867e62
Compare
@mxm @gyfora I added the proposed To keep things simple for the users I decided to only cap by it when the |
ae79219
to
d5b0c03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up so quickly! 🙏
public static final ConfigOption<Duration> MAX_RESTART_TIME = | ||
autoScalerConfig("restart.time.max") | ||
.durationType() | ||
.defaultValue(Duration.ofMinutes(30)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have internal overrides for all configuration defaults, but I don't think this overly pessimistic value is a good default for the general public. I would set this to not more than 15 minutes, even that is still conservative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -142,9 +142,11 @@ public double getMaxRestartTimeSecondsOrDefault(Configuration conf) { | |||
} | |||
} | |||
} | |||
long restartTimeFromConfig = conf.get(AutoScalerOptions.RESTART_TIME).toSeconds(); | |||
long maxRestartTimeFromConfig = conf.get(AutoScalerOptions.MAX_RESTART_TIME).toSeconds(); | |||
return maxRestartTime == -1 | |||
? restartTimeFromConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always cap the RESTART_TIME by the configured MAX_RESTART_TIME (if configured).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyfora I'd like to hear your thoughts on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Max offline - we decided to rename the option to TRACKED_RESTART_TIME_LIMIT
to make the scope clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you Alex!
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/config/AutoScalerOptions.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/config/AutoScalerOptions.java
Outdated
Show resolved
Hide resolved
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java
Outdated
Show resolved
Hide resolved
3f326d5
to
2ba8132
Compare
2ba8132
to
93ae804
Compare
Thanks!
… On Nov 22, 2023, at 18:21, Alexander Fedulov ***@***.***> wrote:
@afedulov commented on this pull request.
In flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java <#711 (comment)>:
> + .allMatch(
+ entry -> {
+ var vertexID = entry.getKey();
+ var targetParallelism = entry.getValue();
+ var actualParallelism = actualParallelisms.getOrDefault(vertexID, -1);
+ return actualParallelism.equals(targetParallelism);
+ });
+ }
+
+ /**
+ * Retrieves the maximum restart time in seconds based on the provided configuration and scaling
+ * records. Defaults to the RESTART_TIME from configuration if the PREFER_TRACKED_RESTART_TIME
+ * option is set to false, or if there are no tracking records available. Otherwise, the maximum
+ * observed restart time is capped by the MAX_RESTART_TIME.
+ */
+ public double getMaxRestartTimeSecondsOrDefault(Configuration conf) {
93ae804 <93ae804>
—
Reply to this email directly, view it on GitHub <#711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMMZONPRUJBC6E3MV5X4TYFYYCTAVCNFSM6AAAAAA7KTCGMGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONBUHE4DEMRYGM>.
You are receiving this because you were mentioned.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I did not have time to add some of these comments earlier. I wonder why we decided to keep a history instead of the exponential moving average. That would eliminate the bookkeeping
<td><h5>job.autoscaler.restart.time.tracked.enabled</h5></td> | ||
<td style="word-wrap: break-word;">false</td> | ||
<td>Boolean</td> | ||
<td>Whether to use the actually observed rescaling restart times instead of the fixed 'job.autoscaler.restart.time' configuration. If set to true, the maximum restart duration over a number of samples will be used. The value of 'job.autoscaler.restart.time' will act as an upper bound.</td> | ||
</tr> | ||
<tr> | ||
<td><h5>job.autoscaler.restart.time.tracked.limit</h5></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late comment, we could consider changing this to job.autoscaler.restart.time-tracking.enabled
and job.autoscaler.restart.time-tracking.limit
so that we avoid having another config match a prefix (and therefore make it non yaml compliant)
Not a huge thing but if we want to support yaml configs like in flink we will have to fix it eventually
|
||
if (targetParallelismMatchesActual( | ||
targetParallelism, actualParallelism)) { | ||
value.setEndTime(now); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe log this on debug? so we have an overview if we want to debug this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
public class ScalingTracking { | ||
|
||
/** Details related to recent rescaling operations. */ | ||
private final TreeMap<Instant, ScalingRecord> scalingRecords = new TreeMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were going to keep a single record and exponential moving avg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this got buried in the notifications:
@gyfora me and Max briefly discussed offline and came to the conclusion that starting with
evaluating the maximum restart time capped by the RESTART_TIME setting is probably
good enough for the first step. It has the benefit of giving the most "conservative"
evaluation and we can add the moving average after some baseline testing. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this but this doesn't mention anything about history etc and refers to an offline discussion :)
Combined with the other comment related to the trimming issue (losing the restart info after 24h) I think the exponential moving avg is a simpler and slightly more robust initial approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EMA requires to know how many previous records in the window are taken into account because this determines the weight coefficient of the new record (smoothing factor). The length of the "window" of observation is also supposed to be fixed and not span all time from the beginning, so I am not sure we are talking about the classic definition of EMA. Maybe you could sketch the calculation you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
estimatedEMA = estimatedEMA * x + newMeasurmenet * (1-x)
we could start with x=0.5
which is pretty aggressive smoothing but should be fine give we don't have many scalings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question - we are missing the job-based data structure that keeps track of the past rescaling details. Should be need to add something in the future, with the current structure it is as simple as adding data fields to the ScalingRecord. I am OK with removing the map, but the question is - are we sure we won't require something similar in the future anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could be argued that we can always send statistics about previous rescalings as metrics, but why do we then keep the vertex-based scalingHistory?
var cutoffTime = now.minus(keptTimeSpan); | ||
|
||
// Remove records older than the cutoff time | ||
scalingRecords.headMap(cutoffTime).clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this clear the history if we don't scale for 24 hours? then we fall back to the config ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep at least 1 in the history. But given that scalings do not happen that often the history will always only have 1-2 records only. So EMA may be more robust. cc @mxm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed offline to keep at least one observation to avoid having to recalibrate the restart time when the observation history expires. EMA is a good alternative as well but for now we chose to take the max of the last observed restart times.
12bbd8b
to
70aa1cd
Compare
70aa1cd
to
d39e01d
Compare
The docs need to be regenerated after the last config change. |
7350dcf
to
677f76f
Compare
public static final ConfigOption<Boolean> PREFER_TRACKED_RESTART_TIME = | ||
autoScalerConfig("restart.time-tracking.enabled") | ||
.booleanType() | ||
.defaultValue(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be enabled by default. We can still change this default though.
Great work! Thanks @afedulov! |
…aler (apache#711) Currently the autoscaler uses a preconfigured restart time for the job. This PR adds the ability to dynamically adjust this based on the observed restart times for scale operations
This PR does not yet contain tests since I would like to first reach consensus on the general approach.
What is the purpose of the change
Currently the autoscaler uses a preconfigured restart time for the job. This PR adds the ability to dynamically adjust this based on the observed restart times for scale operations
Brief change log
ScalingTracking
for tracking job-scoped scaling data.High level approach:
The autoscaler adds a
ScalingTracking
object into the config map in the following format:It contains a map of applied scaling decisions together with the target topology and the
endTime
, sorted by the time when the scaling was applied (map's key). When the job transitions into the RUNNING state, the latest record is fetched, the current parallelism is compared with the target one and theendTime
is set to now (only if was not previously set).Verifying this change
Manual. Testing implementation will follow after the general approach gets approved.
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors
: (yes / no)Documentation