-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add topology spread constraints to deployments #332
Merged
deadlycoconuts
merged 6 commits into
caraml-dev:main
from
deadlycoconuts:add_topology_spread_constraints_to_deployments
Apr 14, 2023
Merged
Add topology spread constraints to deployments #332
deadlycoconuts
merged 6 commits into
caraml-dev:main
from
deadlycoconuts:add_topology_spread_constraints_to_deployments
Apr 14, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
krithika369
approved these changes
Apr 14, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the concise PR, @deadlycoconuts! LGTM. 🚀
Thanks for the quick review! Merging this! :D |
deadlycoconuts
deleted the
add_topology_spread_constraints_to_deployments
branch
April 14, 2023 05:05
5 tasks
krithika369
pushed a commit
to caraml-dev/merlin
that referenced
this pull request
May 11, 2023
**What this PR does / why we need it**: Similar to how caraml-dev/turing#332 allows an operator to define topology spread constraints for pods deployed by the API server, this PR also introduces changes the Merlin API server operator to specify topology spread constraints within the `environmentConfigs` of its chart values `.yaml` file, that will be propagated to all the model service pods deployed via the the API server: ```yaml environmentConfigs: - name: "merlin-dev" is_default: true cluster: "dev-cluster" deployment_timeout: "10m" namespace_timeout: "2m" max_cpu: "8" max_memory: "8Gi" topology_spread_constraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway ... ``` Similarly, in order to encourage pods of the same Knative service deployed by the Merlin API server to be scheduled on a largest variety of nodes wherever possible to encourage high availability, an additional match label is added by default to all constraints specified in the Turing API configuration: ```yaml matchLabels: app: [knative-service-name] ``` The above behaviour occurs for all constraints, even those without any other match labels or label selectors specified. ### Some extra details Since Merlin model service deployments (both predictors and transformers) are deployed as KServe `InferenceService`-s, which themselves are built upon Knative services (see [here](https://kserve.github.io/website/0.10/get_started/first_isvc/) for more info) that can be either of the `Serverless` or `RawDeployment` type, both of which generate different values for the `metadata.labels.app` field (label) in the pod configuration (pod manifest file), based on the `InfererenceService` names: `Serverless` type: - `InferenceService` name: `sklearn-sample` Autogenerated label: `app=sklearn-sample-1-predictor-default-00001` Schema: `[isvc name]-[predictor/transformer]-default-[6 digit 0-padded revision number]` `RawDeployment` type: - `InferenceService` name: `sklearn-sample` Autogenerated label: `isvc.sklearn-sample-1-predictor-default` Schema: `isvc.[isvc name]-[predictor/transformer]-default` where the `isvc name` includes the version number (of the model service) automatically [generated](https://github.com/caraml-dev/merlin/blob/d455cad616a89093c7db81de9580324f0cf64fe5/api/models/service.go#L52) by the API server. Hence separate methods are defined to handle these two cases differently. Note in particular that any model redeployment of a `RawDeployment`-type `InferenceService` does not change the `app` label value, though it would change a `Serverless`-type's label, via the increment in the Knative service revision number. As such, whenever a `Severless` deployment is redeployed, we need to determine the to-be-deployed revision number (that will be automatically created by the Knative controller/webhook) by retrieving the `LatestReadyRevision` of the existing deployment and then incrementing it by 1. This is slightly different from how the Turing API server sets the `app` label value to be exactly the same as that of the Knative service name (the Turing API server deploys routers directly as Knative services). **Which issue(s) this PR fixes**: <!-- *Automatically closes linked issue when PR is merged. Usage: `Fixes #<issue number>`, or `Fixes (paste link of issue)`. --> Fixes # **Does this PR introduce a user-facing change?**: ```release-note NONE ``` **Checklist** - [ ] Added unit test, integration, and/or e2e tests - [ ] Tested locally - [ ] Updated documentation - [ ] Update Swagger spec if the PR introduce API changes - [ ] Regenerated Golang and Python client if the PR introduce API changes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Topology spread constraints allow the distribution of pods across various topology domains to be controlled at a finer level than with pod affinities/anti-affinities. As the current implementation of the Turing API server does not allow Knative services (Turing Routers, enrichers, pyfunc/docker ensemblers) to be deployed with topology spread constraints, this PR introduces changes to allow the Turing API server operator to specify topology spread constraints in the Turing API configuration which will get propagated to all Knative services deployed.
In addition, in order to encourage pods of the same Knative service deployed by the Turing API server to be scheduled on a largest variety of nodes wherever possible to encourage high availability, an additional match label is added by default to all constraints specified in the Turing API configuration:
The above behaviour occurs for all constraints, even those without any other match labels or label selectors specified.
Modifications
api/turing/cluster/knative_service.go
- Addition of steps to inject the aforementioned additional match label to all constraints of the Knative service to be deployedapi/turing/cluster/servicebuilder/service_builder.go
- Addition of methods to store the constraints specified in the Turing API configuration and copy them onto any Knative service to be deployedapi/turing/config/config.go
- Addition of an additionalTopologySpreadConstraints
field to read the constraints stored in the Turing API configuration