[NOT FOR MERGING] Simulator for inference autoscaling #107954

jan-elastic · 2024-04-26T15:02:20Z

Legend

Request count / inference time:

orange: real value
blue: estimated value by running average

Wait time:

blue: max wait time (bucketed by minute)
orange: average wait time (bucketed by minute)

Note: all times are in seconds

jan-elastic · 2024-04-29T07:43:11Z

This is what you get with virtual cores / hyperthreading (notice the increase in latency/queue size when you start hitting hyperthreaded CPUs):

tveasey

This looks good. I think the main consideration is do we make the estimators more effective. For example, I think one natural choice would be to try and estimate rate dynamics (say derivative). Also I think we should consider rejigging the estimate of inference time to allow for change as a function of allocation count. My expectation is this might yield significantly lower latency spikes.

x-pack/dev-tools/simulate_autoscaling.py

…s a function of time and allocation count

…ing-simulator

…rch into autoscaling-simulator

jan-elastic · 2024-05-08T16:07:37Z

Example output:

tveasey

Mainly I think it is worthwhile commenting in a few places if other people need to follow this code later. Also I have one comment regarding initialisation. Anyway, results LGTM.

tveasey · 2024-05-08T16:18:02Z

x-pack/dev-tools/simulate_autoscaling.py

+  def add(self, value, variance, dynamics_changed=False) -> None:
+    process_variance = variance / self.smoothing_factor
+
+    if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100:


I would add a comment here for the record...

Suggested change

if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100:

# If we know we likely had a change in the quantity we're estimating or the prediction is 10 std off

# we inject extra noise in the dynamics for this step.

if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100:

One last thought. The estimator will behave badly with outliers: we'll immediately update the state to something in the vicinity of the outlier because of the 10 sigma rule.

For rate this is probably not so likely, but for inference time estimation there could be environmental factors which cause occasional very slow inferences. I would try simulating with occasional 10-20 x inference time values. As a minimum I would probably drop the 10 sigma rule for the inference time Q estimate. The approach which estimates the constant of proportionality between average inference time and allocation count as an extra state variable will also be robust to outliers.

tveasey · 2024-05-08T16:25:43Z

x-pack/dev-tools/simulate_autoscaling.py

+random.seed(170681)
+
+
+class Estimator:


I think it is useful to discuss the basic ideas behind this for anyone else coming to this code

Suggested change

class Estimator:

class Estimator:

"""

This implements a 1 d Kalman filter with manoeuvre detection. Rather than a derived dynamics model

we simply fix how much we want to smooth in the steady state.

"""

tveasey · 2024-05-08T16:36:59Z

x-pack/dev-tools/simulate_autoscaling.py

+    return self.value + math.sqrt(self.variance)
+
+
+class Simulator:


As above I would give some motivation for this:

Suggested change

class Simulator:

class Simulator:

"""

We simulate a Poisson process for inference arrivals with time varying rate parameter. This models a

variety of rate dynamics: smooth ramp, smooth periodic, shock and steady.

Inference is modelled as mean duration + uniform noise. The mean duration captures the behaviour of

vCPUs which is that throughput is largely constant after all physical cores on a node are occupied.

We assume perfect load balancing, i.e. that every allocation of inference is never waiting whilst there are

inferences to be done. Inferences can only be picked off once they have arrived and allocations can only

pick an inference "inference duration" after they picked their last inference. We assume they are selected

FIFO.

The key user parameters are the number of waiting inferences and the average and maximum delay to

receive each inference which can be calculated from the difference between when inference calls arrived

and when they are available.

"""

tveasey · 2024-05-08T16:44:40Z

x-pack/dev-tools/simulate_autoscaling.py

+    TIME_STEP = 0.001
+    time = START_TIME + TIME_STEP / 2
+    while time < END_TIME:
+      if random.random() < self.get_request_rate(time) * TIME_STEP:


This saturates if rate is > 1 / 0.001. I guess it doesn't matter for the parameters you use for simulation, but it would be a little more robust if you generate next time instead as t += numpy.random.exponential(scale=1/self.get_request_rate(time)).

Reading on I see this will have an impact on implementation of run stat estimation. This could all be resolved, but let's not bother to do that now. I think a comment is warranted that the simulator should only use rate << 1 / TIME_STEP.

tveasey · 2024-05-08T16:49:16Z

x-pack/dev-tools/simulate_autoscaling.py

+    latency_estimator = Estimator(0.1, 100, 1e6)
+    rate_estimator = Estimator(0, 100, 1e3)


It's important to tie the initial variance to the measurement variance.

Alternatively, my inclination would probably be to initialise variance as None and fix the gain for the first measurement, i.e. gain = 0.99 if variance is None else (self.variance + process_variance) / (self.variance + process_variance + variance).

tveasey · 2024-05-09T08:34:27Z

x-pack/dev-tools/simulate_autoscaling.py

+        self.real_loads.append((time, self.get_request_rate(time) * self.get_avg_inference_time() / self.num_allocations))
+
+      # autoscaling
+      needed_allocations_lower = latency_estimator.lower() * rate_estimator.lower()


Not important here, but in the actual system if we suddenly stop receiving inference requests having received them at a high rate we will never trigger down scaling if this is event driven. You will instead need some sort of heartbeat to check scaling.

Simulator for inference autoscaling.

7332193

jan-elastic requested a review from tveasey April 26, 2024 15:02

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 labels Apr 26, 2024

jan-elastic marked this pull request as draft April 26, 2024 15:02

gwbrown removed the needs:triage Requires assignment of a team area label label Apr 26, 2024

Account for virtual cores / hyperthreading

fb773f9

Improvements

687265a

tveasey reviewed Apr 29, 2024

View reviewed changes

x-pack/dev-tools/simulate_autoscaling.py Outdated Show resolved Hide resolved

x-pack/dev-tools/simulate_autoscaling.py Outdated Show resolved Hide resolved

tveasey and others added 9 commits April 29, 2024 21:42

Kalman filter for simple state model for average inference duration a…

98de1a1

…s a function of time and allocation count

Simplify and estimate average duration rather than rate

9db38bc

Allow extrapolation

a4dfff0

Correction

7576c72

Tweak

80c8d8a

Correct the dependency on allocations

d962c61

Merge branch 'main' of github.com:elastic/elasticsearch into autoscal…

7bf6e1d

…ing-simulator

Merge branch 'autoscaling-simulator' of github.com:elastic/elasticsea…

638263b

…rch into autoscaling-simulator

Improved autoscaling simulation

dea9c99

tveasey approved these changes May 8, 2024

View reviewed changes

tveasey reviewed May 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NOT FOR MERGING] Simulator for inference autoscaling #107954

[NOT FOR MERGING] Simulator for inference autoscaling #107954

jan-elastic commented Apr 26, 2024 •

edited

jan-elastic commented Apr 29, 2024

tveasey left a comment •

edited

jan-elastic commented May 8, 2024

tveasey left a comment •

edited

tveasey May 8, 2024

tveasey May 9, 2024 •

edited

tveasey May 8, 2024

tveasey May 8, 2024

tveasey May 8, 2024

tveasey May 8, 2024

tveasey May 8, 2024

tveasey May 9, 2024 •

edited

		return self.value + math.sqrt(self.variance)


		class Simulator:

-class Simulator:
+class Simulator:
+    """
+    We simulate a Poisson process for inference arrivals with time varying rate parameter. This models a
+    variety of rate dynamics: smooth ramp, smooth periodic, shock and steady.
+    Inference is modelled as mean duration + uniform noise. The mean duration captures the behaviour of
+    vCPUs which is that throughput is largely constant after all physical cores on a node are occupied.
+    We assume perfect load balancing, i.e. that every allocation of inference is never waiting whilst there are
+    inferences to be done. Inferences can only be picked off once they have arrived and allocations can only
+    pick an inference "inference duration" after they picked their last inference. We assume they are selected
+    FIFO.
+    The key user parameters are the number of waiting inferences and the average and maximum delay to
+    receive each inference which can be calculated from the difference between when inference calls arrived
+    and when they are available.
+    """

		latency_estimator = Estimator(0.1, 100, 1e6)
		rate_estimator = Estimator(0, 100, 1e3)

[NOT FOR MERGING] Simulator for inference autoscaling #107954

Are you sure you want to change the base?

[NOT FOR MERGING] Simulator for inference autoscaling #107954

Conversation

jan-elastic commented Apr 26, 2024 • edited

Legend

jan-elastic commented Apr 29, 2024

tveasey left a comment • edited

Choose a reason for hiding this comment

jan-elastic commented May 8, 2024

tveasey left a comment • edited

Choose a reason for hiding this comment

tveasey May 8, 2024

Choose a reason for hiding this comment

tveasey May 9, 2024 • edited

Choose a reason for hiding this comment

tveasey May 8, 2024

Choose a reason for hiding this comment

tveasey May 8, 2024

Choose a reason for hiding this comment

tveasey May 8, 2024

Choose a reason for hiding this comment

tveasey May 8, 2024

Choose a reason for hiding this comment

tveasey May 8, 2024

Choose a reason for hiding this comment

tveasey May 9, 2024 • edited

Choose a reason for hiding this comment

jan-elastic commented Apr 26, 2024 •

edited

tveasey left a comment •

edited

tveasey left a comment •

edited

tveasey May 9, 2024 •

edited

tveasey May 9, 2024 •

edited