Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topology spread constraints to deployments #332

Conversation

deadlycoconuts
Copy link
Contributor

@deadlycoconuts deadlycoconuts commented Apr 12, 2023

Context

Topology spread constraints allow the distribution of pods across various topology domains to be controlled at a finer level than with pod affinities/anti-affinities. As the current implementation of the Turing API server does not allow Knative services (Turing Routers, enrichers, pyfunc/docker ensemblers) to be deployed with topology spread constraints, this PR introduces changes to allow the Turing API server operator to specify topology spread constraints in the Turing API configuration which will get propagated to all Knative services deployed.

In addition, in order to encourage pods of the same Knative service deployed by the Turing API server to be scheduled on a largest variety of nodes wherever possible to encourage high availability, an additional match label is added by default to all constraints specified in the Turing API configuration:

matchLabels:
    app: [knative-service-name]

The above behaviour occurs for all constraints, even those without any other match labels or label selectors specified.

Modifications

  • api/turing/cluster/knative_service.go - Addition of steps to inject the aforementioned additional match label to all constraints of the Knative service to be deployed
  • api/turing/cluster/servicebuilder/service_builder.go - Addition of methods to store the constraints specified in the Turing API configuration and copy them onto any Knative service to be deployed
  • api/turing/config/config.go - Addition of an additional TopologySpreadConstraints field to read the constraints stored in the Turing API configuration

@deadlycoconuts deadlycoconuts self-assigned this Apr 12, 2023
@deadlycoconuts deadlycoconuts requested a review from a team April 13, 2023 05:11
@deadlycoconuts deadlycoconuts marked this pull request as ready for review April 13, 2023 05:11
Copy link
Collaborator

@krithika369 krithika369 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the concise PR, @deadlycoconuts! LGTM. 🚀

@deadlycoconuts
Copy link
Contributor Author

Thanks for the quick review! Merging this! :D

@deadlycoconuts deadlycoconuts merged commit 7390e9e into caraml-dev:main Apr 14, 2023
@deadlycoconuts deadlycoconuts deleted the add_topology_spread_constraints_to_deployments branch April 14, 2023 05:05
krithika369 pushed a commit to caraml-dev/merlin that referenced this pull request May 11, 2023
**What this PR does / why we need it**:
Similar to how caraml-dev/turing#332 allows an
operator to define topology spread constraints for pods deployed by the
API server, this PR also introduces changes the Merlin API server
operator to specify topology spread constraints within the
`environmentConfigs` of its chart values `.yaml` file, that will be
propagated to all the model service pods deployed via the the API
server:
```yaml
environmentConfigs:
  - name: "merlin-dev"
    is_default: true
    cluster: "dev-cluster"
    deployment_timeout: "10m"
    namespace_timeout: "2m"
    max_cpu: "8"
    max_memory: "8Gi"
    topology_spread_constraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
   ...
```

Similarly, in order to encourage pods of the same Knative service
deployed by the Merlin API server to be scheduled on a largest variety
of nodes wherever possible to encourage high availability, an additional
match label is added by default to all constraints specified in the
Turing API configuration:
```yaml
matchLabels:
    app: [knative-service-name]
```
The above behaviour occurs for all constraints, even those without any
other match labels or label selectors specified.

### Some extra details
Since Merlin model service deployments (both predictors and
transformers) are deployed as KServe `InferenceService`-s, which
themselves are built upon Knative services (see
[here](https://kserve.github.io/website/0.10/get_started/first_isvc/)
for more info) that can be either of the `Serverless` or `RawDeployment`
type, both of which generate different values for the
`metadata.labels.app` field (label) in the pod configuration (pod
manifest file), based on the `InfererenceService` names:

`Serverless` type:
- `InferenceService` name: `sklearn-sample`
  Autogenerated label: `app=sklearn-sample-1-predictor-default-00001`
Schema: `[isvc name]-[predictor/transformer]-default-[6 digit 0-padded
revision number]`

`RawDeployment` type:
- `InferenceService` name: `sklearn-sample`
  Autogenerated label: `isvc.sklearn-sample-1-predictor-default`
  Schema: `isvc.[isvc name]-[predictor/transformer]-default`

where the `isvc name` includes the version number (of the model service)
automatically
[generated](https://github.com/caraml-dev/merlin/blob/d455cad616a89093c7db81de9580324f0cf64fe5/api/models/service.go#L52)
by the API server.

Hence separate methods are defined to handle these two cases
differently. Note in particular that any model redeployment of a
`RawDeployment`-type `InferenceService` does not change the `app` label
value, though it would change a `Serverless`-type's label, via the
increment in the Knative service revision number. As such, whenever a
`Severless` deployment is redeployed, we need to determine the
to-be-deployed revision number (that will be automatically created by
the Knative controller/webhook) by retrieving the `LatestReadyRevision`
of the existing deployment and then incrementing it by 1.

This is slightly different from how the Turing API server sets the `app`
label value to be exactly the same as that of the Knative service name
(the Turing API server deploys routers directly as Knative services).

**Which issue(s) this PR fixes**:
<!--
*Automatically closes linked issue when PR is merged.
Usage: `Fixes #<issue number>`, or `Fixes (paste link of issue)`.
-->

Fixes #

**Does this PR introduce a user-facing change?**:
```release-note
NONE
```

**Checklist**

- [ ] Added unit test, integration, and/or e2e tests
- [ ] Tested locally
- [ ] Updated documentation
- [ ] Update Swagger spec if the PR introduce API changes
- [ ] Regenerated Golang and Python client if the PR introduce API
changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants