[FLINK-30593][autoscaler] Determine restart time on the fly fo Autoscaler #711

afedulov · 2023-11-14T10:54:07Z

This PR does not yet contain tests since I would like to first reach consensus on the general approach.

What is the purpose of the change

Currently the autoscaler uses a preconfigured restart time for the job. This PR adds the ability to dynamically adjust this based on the observed restart times for scale operations

Brief change log

Adds ScalingTracking for tracking job-scoped scaling data.
Adds autoscaler properties for enabling the observed restart times usage in the rescaling logic(opt in).

High level approach:
The autoscaler adds a ScalingTracking object into the config map in the following format:

 scalingTracking: |                                                                                                                                                         
 ---
    scalingRecords: 
      "2023-11-13T13:36:35.240245Z": 
         endTime: null
           targetParallelism: 
             e44d8435128fbb8c0406b9ae78407aec: 2

It contains a map of applied scaling decisions together with the target topology and the endTime, sorted by the time when the scaling was applied (map's key). When the job transitions into the RUNNING state, the latest record is fetched, the current parallelism is compared with the target one and the endTime is set to now (only if was not previously set).

Verifying this change

Manual. Testing implementation will follow after the general approach gets approved.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changes to the CustomResourceDescriptors: (yes / no)
Core observer or reconciler logic that is regularly executed: (yes / no)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

gyfora

I like the overall approach but I would recommend the following changes:

Remove the newly introduced configs, this should be automatic and always on
Track the start/end times for the restart in memory and only record the observed_restart_time in the autoscaler state store. This way we add minimal extra state that is easy to implement.
Instead of computing restart time from a fixed number of samples, use a simple moving average: observed_restart_time = (prev_observed + new_observed) / 2
During autoscaler logic use: restart_time = min(conf_restart_time, observed_restart_time)

mxm · 2023-11-14T16:00:23Z

Commenting on some of the requests:

Remove the newly introduced configs, this should be automatic and always on

I think it is fair to have a ON/OFF switch. It should be on by default but we want to keep the ability to roll back to the old behavior.

Track the start/end times for the restart in memory and only record the observed_restart_time in the autoscaler state store. This way we add minimal extra state that is easy to implement.

We already have the start time of the last scaling in memory via the scaling history. We can then keep note of the end time once we detect the scaling is over. That leaves a little bit of error in case of downtime of the operator which will produce a long rescaling time. I think that should be fine though, since we cap at the max configured rescale time.

Instead of computing restart time from a fixed number of samples, use a simple moving average: observed_restart_time = (prev_observed + new_observed) / 2

I think we can do an exponentially weighted average.

During autoscaler logic use: restart_time = min(conf_restart_time, observed_restart_time)

+1

afedulov · 2023-11-14T16:14:57Z

@gyfora @mxm thanks for the feedback.

We already have the start time of the last scaling in memory via the scaling history. We can then keep note of the end time once we detect the scaling is over. That leaves a little bit of error in case of downtime of the operator which will produce a long rescaling time. I think that should be fine though, since we cap at the max configured rescale time.

Are we talking about having a field in the ScalingExecutor? Because fetching scalingHistory won't be sufficient - we need some indication that a new rescaling was applied by the time we see the transition into RUNNING with the expected parallelism. This "flag" then needs to be cleaned.
General question: it feels like we are very focused on optimizing the size of this particular configmap. Can't we create a separate configmap, if this is a concern? We already store so much stuff in the flink-config-*** configmap (see the amount of logging configuration alone) that it feels like obsessing over whether we store two timestamps or one for 5-10 values we'll keep in state at the expense of significantly increased code complexity and potentially losing restart data is not worth it.

mxm · 2023-11-14T16:38:55Z

Are we talking about having a field in the ScalingExecutor? Because fetching scalingHistory won't be sufficient - we need some indication that a new rescaling was applied by the time we see the transition into RUNNING with the expected parallelism. This "flag" then needs to be cleaned.

I meant a flag in the ConfigMap. I guess something like startTime: endTime would be sufficient? The endTime would initially be null.

General question: it feels like we are very focused on optimizing the size of this particular configmap. Can't we create a separate configmap, if this is a concern?

In the past we ran into the 1MB size limit for large pipelines with hundreds of vertices. That's why we don't want to increase the state size too much. We also added compression because of this. Of course we can add a new configmap. We have just not come around to do it and it might not be a good idea to put too much load on etcd. While it would be nice to have a separate config map for the scaling metrics and the scaling history, but it is much simpler to do it in one.

afedulov · 2023-11-14T16:45:54Z

I meant a flag in the ConfigMap. I guess something like startTime: endTime would be sufficient? The endTime would initially be null.

Got it, this is how it actually works currently (modulo additional target vertices parallelism tracking that I will attempt to fetch from the history instead) - I thought it was more about the configmap vs in-memory discussion.

In the past we ran into the 1MB size limit for large pipelines with hundreds of vertices.

I see, let's I'll try to keep the size to the minimum then.

mxm · 2023-11-14T16:56:33Z

Got it, this is how it actually works currently (modulo additional target vertices parallelism tracking that I will attempt to fetch from the history instead) - I thought it was more about the configmap vs in-memory discussion.

Yes, I saw that. I think Gyula is right though, that it's enough to use in-memory state to start tracking a scaling execution. When the scaling is completed, we then persist the duration in the ConfigMap. The reason is that keeping track of the start time when we begin tracking would only be useful if we could use that state to recover, e.g. in case of downtime. However, that's only useful when the job does not complete rescaling during the downtime. If it completes before, we don't know how long a scaling took when we come back up. For simplicity, I think it makes sense to just track the scaling start time in memory and then persist the actual rescale time in the ConfigMap.

afedulov · 2023-11-14T18:45:54Z

Sounds reasonable. If we are optimizing state size at this level, how about we also remove RECOMMENDED_PARALLELISM and PARALLELISM from the history metrics, since they are duplicated as ScalingSummary top level fields?

afedulov · 2023-11-14T20:58:11Z

@gyfora I'd like to propose to use this format:

scalingTracking: |                                                                                                                             
---                                                                                                                                            
scalingRecords:
     "2023-11-14T20:20:39.250503Z": 
           endTime: "2023-11-14T20:22:04.360972Z"
      "2023-11-14T20:22:04.360972Z": 
           endTime: "2023-11-14T20:23:39.022220Z"

Some reasoning:

It allows to associate tracked restart time with scaling history, which could be useful for debugging. Without the start timestamp, this won't be possible.
It lays out a data structure that is open for extension. This makes it trivial to add additional fields if we ever need some other scaling- and not vertex-scoped data (which we currently do not anticipate ) in the future. The decision to move it into a different config map in such case is a separate issue.
The overhead when compared with just storing the restart time is pretty minimal.

@mxm, I removed the redundant tracking of the target parallelism from the ScalingRecords. The current implementation completely relies on the scalingHistory for this. I had to change the scaleResource signature to avoid fetching the tracking and the history twice (we need both in the JobAutoScalerImpl now). Clock in ScalingExecutor became unnecessary.

mxm · 2023-11-15T11:00:16Z

Sounds reasonable. If we are optimizing state size at this level, how about we also remove RECOMMENDED_PARALLELISM and PARALLELISM from the history metrics, since they are duplicated as ScalingSummary top level fields?

We can file a JIRA for this. This requires more changes since those metrics are also directly reported via Prometheus and removing them would be unexpected to users.

afedulov · 2023-11-16T21:59:51Z

@mxm @gyfora the PR is ready for review

@gyfora me and Max briefly discussed offline and came to the conclusion that starting with evaluating the maximum restart time capped by the RESTART_TIME setting is probably good enough for the first step. It has the benefit of giving the most "conservative" evaluation and we can add the moving average after some baseline testing. What do you think?

mxm

Thanks Alex!

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobAutoScalerImpl.java

mxm · 2023-11-20T10:01:50Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingRecord.java

+/**
+ * Class for tracking scaling details, including time it took for the job to transition to the
+ * target parallelism.
+ */
+@Data
+@NoArgsConstructor
+@AllArgsConstructor
+public class ScalingRecord {
+    private Instant endTime;
+}


Could we just store Instant directly and remove this wrapper class?

My idea was to have structures in place that would be easily extensible. If we are sure we do not not ever need to store anything else here we can store the instant directly.

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java

gyfora

I think if we are going to use Math.min(maxRestartTime, restartTimeFromConfig) we should increase the default restart time value in the config from 5 minutes to at least 15 minutes (maybe even higher like 20-30)

With 5 minutes almost all prod jobs with many pods /slower envs or large state will probably take longer (let's say 10-15 minutes) and they will be cut back to 5 minutes leading to an underestimation and potentially multiple scaleups.

What do you think?
cc @mxm

mxm · 2023-11-20T15:02:14Z

What you describe is already the case. Any deployment which currently takes longer than the configured rescale time, will not scale accurately. By adding the feature in its current state, we are conservatively addressing those deployment which come back up much quicker. In our environment, typical rescale time is 1 minute or less.

I think your request requires adding a new configuration MAX_RESTART_TIME because we want to (1) have a good default when we haven't yet estimated the rescale time (e.g. first scaling) via the existing setting, and (2) cap the determined rescale time for safety reasons (e.g. unexpected cluster downtime).

I'm ok with adding MAX_RESTART_TIME. It also solves the problem of adjusting rescale times of existing deployments. Most users already have a rescale time configured explicitly.

afedulov · 2023-11-20T19:07:25Z

@mxm @gyfora I added the proposed MAX_RESTART_TIME option with the default value of 30 minutes:
ae79219#diff-c63eb3ce6a3229c2e0e664d8032f966df8e0b90a67b5e1119d8ec1d862599348

To keep things simple for the users I decided to only cap by it when the PREFER_TRACKED_RESTART_TIME options is enabled. Otherwise, the restart time is completely governed by the current RESTART_TIME setting, as before, even it it exceeds the MAX_RESTART_TIME.
ae79219#diff-2241a4e55db07ff736cd174042179254aea0ca8d6884635b1146e5b5a9c17633R57

mxm

Thanks for following up so quickly! 🙏

mxm · 2023-11-21T10:34:14Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/config/AutoScalerOptions.java

+    public static final ConfigOption<Duration> MAX_RESTART_TIME =
+            autoScalerConfig("restart.time.max")
+                    .durationType()
+                    .defaultValue(Duration.ofMinutes(30))


We have internal overrides for all configuration defaults, but I don't think this overly pessimistic value is a good default for the general public. I would set this to not more than 15 minutes, even that is still conservative.

mxm · 2023-11-21T10:38:15Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java

@@ -142,9 +142,11 @@ public double getMaxRestartTimeSecondsOrDefault(Configuration conf) {
                }
            }
        }
+        long restartTimeFromConfig = conf.get(AutoScalerOptions.RESTART_TIME).toSeconds();
+        long maxRestartTimeFromConfig = conf.get(AutoScalerOptions.MAX_RESTART_TIME).toSeconds();
        return maxRestartTime == -1
                ? restartTimeFromConfig


I think we should always cap the RESTART_TIME by the configured MAX_RESTART_TIME (if configured).

@gyfora I'd like to hear your thoughts on this

Synced with Max offline - we decided to rename the option to TRACKED_RESTART_TIME_LIMIT to make the scope clear.

sounds good!

mxm

LGTM. Thank you Alex!

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/config/AutoScalerOptions.java

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java

…aler

mxm · 2023-11-23T10:17:49Z

Thanks!

…

On Nov 22, 2023, at 18:21, Alexander Fedulov ***@***.***> wrote: @afedulov commented on this pull request. In flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java <#711 (comment)>: > + .allMatch( + entry -> { + var vertexID = entry.getKey(); + var targetParallelism = entry.getValue(); + var actualParallelism = actualParallelisms.getOrDefault(vertexID, -1); + return actualParallelism.equals(targetParallelism); + }); + } + + /** + * Retrieves the maximum restart time in seconds based on the provided configuration and scaling + * records. Defaults to the RESTART_TIME from configuration if the PREFER_TRACKED_RESTART_TIME + * option is set to false, or if there are no tracking records available. Otherwise, the maximum + * observed restart time is capped by the MAX_RESTART_TIME. + */ + public double getMaxRestartTimeSecondsOrDefault(Configuration conf) { 93ae804 <93ae804> — Reply to this email directly, view it on GitHub <#711 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGMMZONPRUJBC6E3MV5X4TYFYYCTAVCNFSM6AAAAAA7KTCGMGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONBUHE4DEMRYGM>. You are receiving this because you were mentioned.

gyfora

Sorry I did not have time to add some of these comments earlier. I wonder why we decided to keep a history instead of the exponential moving average. That would eliminate the bookkeeping

gyfora · 2023-11-23T15:23:20Z

docs/layouts/shortcodes/generated/auto_scaler_configuration.html

+            <td><h5>job.autoscaler.restart.time.tracked.enabled</h5></td>
+            <td style="word-wrap: break-word;">false</td>
+            <td>Boolean</td>
+            <td>Whether to use the actually observed rescaling restart times instead of the fixed 'job.autoscaler.restart.time' configuration. If set to true, the maximum restart duration over a number of samples will be used. The value of 'job.autoscaler.restart.time' will act as an upper bound.</td>
+        </tr>
+        <tr>
+            <td><h5>job.autoscaler.restart.time.tracked.limit</h5></td>


Sorry for the late comment, we could consider changing this to job.autoscaler.restart.time-tracking.enabled and job.autoscaler.restart.time-tracking.limit so that we avoid having another config match a prefix (and therefore make it non yaml compliant)

Not a huge thing but if we want to support yaml configs like in flink we will have to fix it eventually

gyfora · 2023-11-23T15:25:09Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java

+
+                                if (targetParallelismMatchesActual(
+                                        targetParallelism, actualParallelism)) {
+                                    value.setEndTime(now);


Should we maybe log this on debug? so we have an overview if we want to debug this?

gyfora · 2023-11-23T15:27:49Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java

+public class ScalingTracking {
+
+    /** Details related to recent rescaling operations. */
+    private final TreeMap<Instant, ScalingRecord> scalingRecords = new TreeMap<>();


I thought we were going to keep a single record and exponential moving avg

I guess this got buried in the notifications:

@gyfora me and Max briefly discussed offline and came to the conclusion that starting with evaluating the maximum restart time capped by the RESTART_TIME setting is probably good enough for the first step. It has the benefit of giving the most "conservative" evaluation and we can add the moving average after some baseline testing. What do you think?

I saw this but this doesn't mention anything about history etc and refers to an offline discussion :)
Combined with the other comment related to the trimming issue (losing the restart info after 24h) I think the exponential moving avg is a simpler and slightly more robust initial approach

EMA requires to know how many previous records in the window are taken into account because this determines the weight coefficient of the new record (smoothing factor). The length of the "window" of observation is also supposed to be fixed and not span all time from the beginning, so I am not sure we are talking about the classic definition of EMA. Maybe you could sketch the calculation you have in mind?

estimatedEMA = estimatedEMA * x + newMeasurmenet * (1-x)
we could start with x=0.5 which is pretty aggressive smoothing but should be fine give we don't have many scalings

General question - we are missing the job-based data structure that keeps track of the past rescaling details. Should be need to add something in the future, with the current structure it is as simple as adding data fields to the ScalingRecord. I am OK with removing the map, but the question is - are we sure we won't require something similar in the future anyways?

I guess it could be argued that we can always send statistics about previous rescalings as metrics, but why do we then keep the vertex-based scalingHistory?

gyfora · 2023-11-23T15:43:11Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/ScalingTracking.java

+        var cutoffTime = now.minus(keptTimeSpan);
+
+        // Remove records older than the cutoff time
+        scalingRecords.headMap(cutoffTime).clear();


Wouldn't this clear the history if we don't scale for 24 hours? then we fall back to the config ?

We should keep at least 1 in the history. But given that scalings do not happen that often the history will always only have 1-2 records only. So EMA may be more robust. cc @mxm

We agreed offline to keep at least one observation to avoid having to recalibrate the restart time when the observation history expires. EMA is a good alternative as well but for now we chose to take the max of the last observed restart times.

…name

mxm · 2023-11-24T15:50:33Z

The docs need to be regenerated after the last config change.

mxm · 2023-11-24T16:43:32Z

flink-autoscaler/src/main/java/org/apache/flink/autoscaler/config/AutoScalerOptions.java

+    public static final ConfigOption<Boolean> PREFER_TRACKED_RESTART_TIME =
+            autoScalerConfig("restart.time-tracking.enabled")
+                    .booleanType()
+                    .defaultValue(false)


This should probably be enabled by default. We can still change this default though.

mxm · 2023-11-24T16:43:58Z

Great work! Thanks @afedulov!

afedulov · 2023-11-24T16:58:30Z

Thanks @mxm and @gyfora !

…aler (apache#711) Currently the autoscaler uses a preconfigured restart time for the job. This PR adds the ability to dynamically adjust this based on the observed restart times for scale operations

afedulov force-pushed the 30593-restart branch from afc0bfe to 4ff926f Compare November 14, 2023 13:18

gyfora requested changes Nov 14, 2023

View reviewed changes

afedulov force-pushed the 30593-restart branch from 4ff926f to 03c28e5 Compare November 14, 2023 20:40

afedulov force-pushed the 30593-restart branch from 03c28e5 to b903dae Compare November 14, 2023 21:03

afedulov force-pushed the 30593-restart branch 5 times, most recently from 4a01731 to 6bb5dd7 Compare November 16, 2023 21:54

afedulov marked this pull request as ready for review November 16, 2023 21:54

mxm reviewed Nov 20, 2023

View reviewed changes

gyfora requested changes Nov 20, 2023

View reviewed changes

afedulov force-pushed the 30593-restart branch from 6bb5dd7 to 1867e62 Compare November 20, 2023 17:17

afedulov force-pushed the 30593-restart branch from ae79219 to d5b0c03 Compare November 20, 2023 19:10

mxm reviewed Nov 21, 2023

View reviewed changes

mxm approved these changes Nov 22, 2023

View reviewed changes

afedulov force-pushed the 30593-restart branch from 3f326d5 to 2ba8132 Compare November 22, 2023 17:19

Alexander Fedulov added 2 commits November 22, 2023 18:20

[FLINK-30593][autoscaler] Determine restart time on the fly fo Autosc…

eb0ed8f

…aler

[fixup|review] Pass restartTime as Duration instead of double

93ae804

afedulov force-pushed the 30593-restart branch from 2ba8132 to 93ae804 Compare November 22, 2023 17:21

gyfora reviewed Nov 23, 2023

View reviewed changes

afedulov force-pushed the 30593-restart branch from 12bbd8b to 70aa1cd Compare November 24, 2023 14:41

[fixup|review] Always retain at least one scaling record; fix method …

d39e01d

…name

afedulov force-pushed the 30593-restart branch from 70aa1cd to d39e01d Compare November 24, 2023 14:42

[fixup|review] Improve restart time config names and add logging

75ed766

[fixup|review] Generate docs

677f76f

afedulov force-pushed the 30593-restart branch from 7350dcf to 677f76f Compare November 24, 2023 16:05

mxm reviewed Nov 24, 2023

View reviewed changes

mxm merged commit cdf32de into apache:main Nov 24, 2023
119 checks passed

[FLINK-30593][autoscaler] Determine restart time on the fly fo Autoscaler #711

[FLINK-30593][autoscaler] Determine restart time on the fly fo Autoscaler #711

Conversation

afedulov commented Nov 14, 2023 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

gyfora left a comment

Choose a reason for hiding this comment

mxm commented Nov 14, 2023 • edited Loading

afedulov commented Nov 14, 2023 • edited Loading

mxm commented Nov 14, 2023

afedulov commented Nov 14, 2023 • edited Loading

mxm commented Nov 14, 2023

afedulov commented Nov 14, 2023 • edited Loading

afedulov commented Nov 14, 2023 • edited Loading

mxm commented Nov 15, 2023

afedulov commented Nov 16, 2023 • edited Loading

mxm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afedulov Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

gyfora left a comment

Choose a reason for hiding this comment

mxm commented Nov 20, 2023 • edited Loading

afedulov commented Nov 20, 2023 • edited Loading

mxm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm left a comment

Choose a reason for hiding this comment

mxm commented Nov 23, 2023 via email

gyfora left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afedulov Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afedulov Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxm commented Nov 24, 2023

Choose a reason for hiding this comment

mxm commented Nov 24, 2023

afedulov commented Nov 24, 2023

afedulov commented Nov 14, 2023 •

edited

Loading

mxm commented Nov 14, 2023 •

edited

Loading

afedulov commented Nov 14, 2023 •

edited

Loading

afedulov commented Nov 14, 2023 •

edited

Loading

afedulov commented Nov 14, 2023 •

edited

Loading

afedulov commented Nov 14, 2023 •

edited

Loading

afedulov commented Nov 16, 2023 •

edited

Loading

afedulov Nov 20, 2023 •

edited

Loading

mxm commented Nov 20, 2023 •

edited

Loading

afedulov commented Nov 20, 2023 •

edited

Loading

afedulov Nov 23, 2023 •

edited

Loading

afedulov Nov 23, 2023 •

edited

Loading