Skip to content

Dynamic Scaling #6

@pjb157

Description

@pjb157

We need some occupancy base scaling for model groups. Installing a HorizontalPodAutoscaler to watch the cpu or memory of the model group pods is a poor scaling metric for GPU bound workloads. We have a few options.

Scaling on Proxy Metrics

The choice of the metrics would mean we would need to know nothing about the inference engine running in the replicaset. It could be average length of request or the rate of requests which are available to routing components above the inference engine.

Scaling on KV Cache Occupancy

We would need to export the state of the KV Cache or at least its occupancy to an external component (could be prometheus, custom, ...) which would present the metrics or push decisions to the autoscaler.

This is a decision akin to the one discussed in this issue, both decisions are should we replica state from the inference engine at the distributed level or make the best decision based on good proxies/estimations.

We also have decide on the best method for implementing scaling. Keda is a useful technology which would give us the option of triggering on aggregated metrics or use an external push from a watcher of the inference engines states.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions