-
Notifications
You must be signed in to change notification settings - Fork 0
Description
We need some occupancy base scaling for model groups. Installing a HorizontalPodAutoscaler to watch the cpu or memory of the model group pods is a poor scaling metric for GPU bound workloads. We have a few options.
Scaling on Proxy Metrics
The choice of the metrics would mean we would need to know nothing about the inference engine running in the replicaset. It could be average length of request or the rate of requests which are available to routing components above the inference engine.
Scaling on KV Cache Occupancy
We would need to export the state of the KV Cache or at least its occupancy to an external component (could be prometheus, custom, ...) which would present the metrics or push decisions to the autoscaler.
This is a decision akin to the one discussed in this issue, both decisions are should we replica state from the inference engine at the distributed level or make the best decision based on good proxies/estimations.
We also have decide on the best method for implementing scaling. Keda is a useful technology which would give us the option of triggering on aggregated metrics or use an external push from a watcher of the inference engines states.