Add topology spread constraints to deployments #332

deadlycoconuts · 2023-04-12T11:12:44Z

Context

Topology spread constraints allow the distribution of pods across various topology domains to be controlled at a finer level than with pod affinities/anti-affinities. As the current implementation of the Turing API server does not allow Knative services (Turing Routers, enrichers, pyfunc/docker ensemblers) to be deployed with topology spread constraints, this PR introduces changes to allow the Turing API server operator to specify topology spread constraints in the Turing API configuration which will get propagated to all Knative services deployed.

In addition, in order to encourage pods of the same Knative service deployed by the Turing API server to be scheduled on a largest variety of nodes wherever possible to encourage high availability, an additional match label is added by default to all constraints specified in the Turing API configuration:

matchLabels:
    app: [knative-service-name]

The above behaviour occurs for all constraints, even those without any other match labels or label selectors specified.

Modifications

api/turing/cluster/knative_service.go - Addition of steps to inject the aforementioned additional match label to all constraints of the Knative service to be deployed
api/turing/cluster/servicebuilder/service_builder.go - Addition of methods to store the constraints specified in the Turing API configuration and copy them onto any Knative service to be deployed
api/turing/config/config.go - Addition of an additional TopologySpreadConstraints field to read the constraints stored in the Turing API configuration

…d to topology pod constraints

krithika369

Thanks for the concise PR, @deadlycoconuts! LGTM. 🚀

deadlycoconuts · 2023-04-14T05:04:09Z

Thanks for the quick review! Merging this! :D

**What this PR does / why we need it**: Similar to how caraml-dev/turing#332 allows an operator to define topology spread constraints for pods deployed by the API server, this PR also introduces changes the Merlin API server operator to specify topology spread constraints within the `environmentConfigs` of its chart values `.yaml` file, that will be propagated to all the model service pods deployed via the the API server: ```yaml environmentConfigs: - name: "merlin-dev" is_default: true cluster: "dev-cluster" deployment_timeout: "10m" namespace_timeout: "2m" max_cpu: "8" max_memory: "8Gi" topology_spread_constraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway ... ``` Similarly, in order to encourage pods of the same Knative service deployed by the Merlin API server to be scheduled on a largest variety of nodes wherever possible to encourage high availability, an additional match label is added by default to all constraints specified in the Turing API configuration: ```yaml matchLabels: app: [knative-service-name] ``` The above behaviour occurs for all constraints, even those without any other match labels or label selectors specified. ### Some extra details Since Merlin model service deployments (both predictors and transformers) are deployed as KServe `InferenceService`-s, which themselves are built upon Knative services (see [here](https://kserve.github.io/website/0.10/get_started/first_isvc/) for more info) that can be either of the `Serverless` or `RawDeployment` type, both of which generate different values for the `metadata.labels.app` field (label) in the pod configuration (pod manifest file), based on the `InfererenceService` names: `Serverless` type: - `InferenceService` name: `sklearn-sample` Autogenerated label: `app=sklearn-sample-1-predictor-default-00001` Schema: `[isvc name]-[predictor/transformer]-default-[6 digit 0-padded revision number]` `RawDeployment` type: - `InferenceService` name: `sklearn-sample` Autogenerated label: `isvc.sklearn-sample-1-predictor-default` Schema: `isvc.[isvc name]-[predictor/transformer]-default` where the `isvc name` includes the version number (of the model service) automatically [generated](https://github.com/caraml-dev/merlin/blob/d455cad616a89093c7db81de9580324f0cf64fe5/api/models/service.go#L52) by the API server. Hence separate methods are defined to handle these two cases differently. Note in particular that any model redeployment of a `RawDeployment`-type `InferenceService` does not change the `app` label value, though it would change a `Serverless`-type's label, via the increment in the Knative service revision number. As such, whenever a `Severless` deployment is redeployed, we need to determine the to-be-deployed revision number (that will be automatically created by the Knative controller/webhook) by retrieving the `LatestReadyRevision` of the existing deployment and then incrementing it by 1. This is slightly different from how the Turing API server sets the `app` label value to be exactly the same as that of the Knative service name (the Turing API server deploys routers directly as Knative services). **Which issue(s) this PR fixes**:  Fixes # **Does this PR introduce a user-facing change?**: ```release-note NONE ``` **Checklist** - [ ] Added unit test, integration, and/or e2e tests - [ ] Tested locally - [ ] Updated documentation - [ ] Update Swagger spec if the PR introduce API changes - [ ] Regenerated Golang and Python client if the PR introduce API changes

deadlycoconuts added 2 commits April 12, 2023 15:32

Add topology spread constraints

836563f

Add unit tests

92f8a22

deadlycoconuts self-assigned this Apr 12, 2023

deadlycoconuts added 3 commits April 13, 2023 11:50

Add config unit tests

4dc71a4

Change match expressions to match labels in label selector to be adde…

19dd129

…d to topology pod constraints

Simplify test topology spread constraints

420c789

deadlycoconuts requested a review from a team April 13, 2023 05:11

deadlycoconuts marked this pull request as ready for review April 13, 2023 05:11

krithika369 approved these changes Apr 14, 2023

View reviewed changes

Add topology spread constraints feature flag

efe883b

deadlycoconuts merged commit 7390e9e into caraml-dev:main Apr 14, 2023

deadlycoconuts deleted the add_topology_spread_constraints_to_deployments branch April 14, 2023 05:05

deadlycoconuts mentioned this pull request Apr 25, 2023

Add topology spread constraints to deployments caraml-dev/merlin#382

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add topology spread constraints to deployments #332

Add topology spread constraints to deployments #332

deadlycoconuts commented Apr 12, 2023 •

edited

Loading

krithika369 left a comment

deadlycoconuts commented Apr 14, 2023

Add topology spread constraints to deployments #332

Add topology spread constraints to deployments #332

Conversation

deadlycoconuts commented Apr 12, 2023 • edited Loading

Context

Modifications

krithika369 left a comment

Choose a reason for hiding this comment

deadlycoconuts commented Apr 14, 2023

deadlycoconuts commented Apr 12, 2023 •

edited

Loading