Dynamic Scaling

We need some occupancy base scaling for model groups. Installing a [HorizontalPodAutoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) to watch the cpu or memory of the model group pods is a poor scaling metric for GPU bound workloads. We have a few options.

## Scaling on Proxy Metrics

The choice of the metrics would mean we would need to know nothing about the inference engine running in the replicaset. It could be average length of request or the rate of requests which are available to routing components above the inference engine.

## Scaling on KV Cache Occupancy

We would need to export the state of the KV Cache or at least its occupancy to an external component (could be prometheus, custom, ...) which would present the metrics or push decisions to the autoscaler. 

This is a decision akin to the one discussed in [this issue](https://github.com/doublewordai/inference-stack/issues/4), both decisions are should we replica state from the inference engine at the distributed level or make the best decision based on good proxies/estimations.

We also have decide on the best method for implementing scaling. [Keda](https://keda.sh/) is a useful technology which would give us the option of triggering on aggregated metrics or use an external push from a watcher of the inference engines states.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Scaling #6

Scaling on Proxy Metrics

Scaling on KV Cache Occupancy

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dynamic Scaling #6

Description

Scaling on Proxy Metrics

Scaling on KV Cache Occupancy

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions