Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NOT FOR MERGING] Simulator for inference autoscaling #107954

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

jan-elastic
Copy link
Contributor

@jan-elastic jan-elastic commented Apr 26, 2024

image

Legend

Request count / inference time:

  • orange: real value
  • blue: estimated value by running average

Wait time:

  • blue: max wait time (bucketed by minute)
  • orange: average wait time (bucketed by minute)

Note: all times are in seconds

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.15.0 labels Apr 26, 2024
@jan-elastic jan-elastic marked this pull request as draft April 26, 2024 15:02
@gwbrown gwbrown removed the needs:triage Requires assignment of a team area label label Apr 26, 2024
@jan-elastic
Copy link
Contributor Author

This is what you get with virtual cores / hyperthreading (notice the increase in latency/queue size when you start hitting hyperthreaded CPUs):
image

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I think the main consideration is do we make the estimators more effective. For example, I think one natural choice would be to try and estimate rate dynamics (say derivative). Also I think we should consider rejigging the estimate of inference time to allow for change as a function of allocation count. My expectation is this might yield significantly lower latency spikes.

x-pack/dev-tools/simulate_autoscaling.py Outdated Show resolved Hide resolved
x-pack/dev-tools/simulate_autoscaling.py Outdated Show resolved Hide resolved
@jan-elastic
Copy link
Contributor Author

Example output:
image

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly I think it is worthwhile commenting in a few places if other people need to follow this code later. Also I have one comment regarding initialisation. Anyway, results LGTM.

def add(self, value, variance, dynamics_changed=False) -> None:
process_variance = variance / self.smoothing_factor

if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here for the record...

Suggested change
if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100:
# If we know we likely had a change in the quantity we're estimating or the prediction is 10 std off
# we inject extra noise in the dynamics for this step.
if dynamics_changed or abs(value - self.value)**2 / (self.variance + variance) > 100:

Copy link
Contributor

@tveasey tveasey May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought. The estimator will behave badly with outliers: we'll immediately update the state to something in the vicinity of the outlier because of the 10 sigma rule.

For rate this is probably not so likely, but for inference time estimation there could be environmental factors which cause occasional very slow inferences. I would try simulating with occasional 10-20 x inference time values. As a minimum I would probably drop the 10 sigma rule for the inference time Q estimate. The approach which estimates the constant of proportionality between average inference time and allocation count as an extra state variable will also be robust to outliers.

random.seed(170681)


class Estimator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is useful to discuss the basic ideas behind this for anyone else coming to this code

Suggested change
class Estimator:
class Estimator:
"""
This implements a 1 d Kalman filter with manoeuvre detection. Rather than a derived dynamics model
we simply fix how much we want to smooth in the steady state.
"""

return self.value + math.sqrt(self.variance)


class Simulator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above I would give some motivation for this:

Suggested change
class Simulator:
class Simulator:
"""
We simulate a Poisson process for inference arrivals with time varying rate parameter. This models a
variety of rate dynamics: smooth ramp, smooth periodic, shock and steady.
Inference is modelled as mean duration + uniform noise. The mean duration captures the behaviour of
vCPUs which is that throughput is largely constant after all physical cores on a node are occupied.
We assume perfect load balancing, i.e. that every allocation of inference is never waiting whilst there are
inferences to be done. Inferences can only be picked off once they have arrived and allocations can only
pick an inference "inference duration" after they picked their last inference. We assume they are selected
FIFO.
The key user parameters are the number of waiting inferences and the average and maximum delay to
receive each inference which can be calculated from the difference between when inference calls arrived
and when they are available.
"""

TIME_STEP = 0.001
time = START_TIME + TIME_STEP / 2
while time < END_TIME:
if random.random() < self.get_request_rate(time) * TIME_STEP:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This saturates if rate is > 1 / 0.001. I guess it doesn't matter for the parameters you use for simulation, but it would be a little more robust if you generate next time instead as t += numpy.random.exponential(scale=1/self.get_request_rate(time)).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading on I see this will have an impact on implementation of run stat estimation. This could all be resolved, but let's not bother to do that now. I think a comment is warranted that the simulator should only use rate << 1 / TIME_STEP.

Comment on lines +115 to +116
latency_estimator = Estimator(0.1, 100, 1e6)
rate_estimator = Estimator(0, 100, 1e3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to tie the initial variance to the measurement variance.

Alternatively, my inclination would probably be to initialise variance as None and fix the gain for the first measurement, i.e. gain = 0.99 if variance is None else (self.variance + process_variance) / (self.variance + process_variance + variance).

self.real_loads.append((time, self.get_request_rate(time) * self.get_avg_inference_time() / self.num_allocations))

# autoscaling
needed_allocations_lower = latency_estimator.lower() * rate_estimator.lower()
Copy link
Contributor

@tveasey tveasey May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important here, but in the actual system if we suddenly stop receiving inference requests having received them at a high rate we will never trigger down scaling if this is event driven. You will instead need some sort of heartbeat to check scaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants