Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct tolerations value when deploying cilium-operator and cilium-etcd-operator via helm #15992

Merged
merged 2 commits into from
Aug 5, 2021

Conversation

michaelpetrov
Copy link
Contributor

Use correct tolerations value when deploying cilium-operator via helm. Without this the global tolerations value is used and the operator runs on all nodes including ones with taints.

Use correct tolerations value when deploying cilium-operator via helm.

Signed-off-by: Michael Petrov petrov.michael@gmail.com

@michaelpetrov michaelpetrov requested a review from a team as a code owner May 4, 2021 00:58
@michaelpetrov michaelpetrov requested review from a team and kaworu May 4, 2021 00:58
@maintainer-s-little-helper

This comment has been minimized.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 4, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 4, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 4, 2021
@kaworu kaworu added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label May 4, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 4, 2021
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @michaelpetrov and thanks for the PR. Patch LGTM.

Without this the global tolerations value is used and the operator runs on all nodes including ones with taints.

Can you elaborate on this because I see no difference between the global tolerations and the operator's ones by default:

$ yq '.tolerations,.operator.tolerations' install/kubernetes/cilium/values.yaml
[
  {
    "operator": "Exists"
  }
]
[
  {
    "operator": "Exists"
  }
]

@michaelpetrov
Copy link
Contributor Author

@kaworu on our clusters we have cpu and gpu nodes in two node pools, and the gpu nodes have a taint to avoid scheduling non-essential work on them (because they are far more expensive nodes). cilium-operator gets consistently scheduled onto the gpu nodes in our case and there is no way to override it in Helm. If we override the default global tolerations to not use the gpu nodes then we also lose the cilium agent on them since the setting is shared between the two components.

To be honest, I think the top level .Values.tolerations is a pretty bad idea in a multi-service chart where you want just the main agent and node-init everywhere and auxiliary services only on some nodes. Ideally it can be removed in favor of something closer to .Values.agent.tolerations but that would conflict with the way agent settings are currently done.

@maintainer-s-little-helper
Copy link

Commit b3747d1ec0b048d7346d86e8ef9fafb1102f67c8 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 4, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label May 4, 2021
@michaelpetrov michaelpetrov changed the title Use correct tolerations value when deploying cilium-operator via helm Use correct tolerations value when deploying cilium-operator and cilium-etcd-operator via helm May 4, 2021
@kaworu
Copy link
Member

kaworu commented May 6, 2021

@michaelpetrov thanks for the insight, all you wrote make perfect sense.

In our current Helm charts we have many top-level items that are about the agent. For example, .Values.image, .Values.affinity, .Values.tolerations (as you correctly pointed out), etc.

Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks for the PR!

@michaelpetrov
Copy link
Contributor Author

In our current Helm charts we have many top-level items that are about the agent. For example, .Values.image, .Values.affinity, .Values.tolerations (as you correctly pointed out), etc.

I'm actually even slightly worried about this diff because some existing chart installations might depend on .Values.tolerations to schedule things onto nodes that has various taints - so this change is not strictly backwards compatible. For our use case at OpenAI I do really want this separation though. Not sure what's the best way to improve things here.

@kaworu
Copy link
Member

kaworu commented May 17, 2021

I'm actually even slightly worried about this diff because some existing chart installations might depend on .Values.tolerations to schedule things onto nodes that has various taints - so this change is not strictly backwards compatible. For our use case at OpenAI I do really want this separation though. Not sure what's the best way to improve things here.

@michaelpetrov that is a valid concern and why the release note is minor, but granted that might not be enough safety net. I'll defer to @seanmwinn about the potential breakage.

To keep some level of backward compatibility, could we could use Helm's default function? e.g.

 {{- with (.Values.etcd.tolerations | default .Values.tolerations) }}

@stale

This comment has been minimized.

@stale stale bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 18, 2021
@pchaigno
Copy link
Member

test-me-please

@stale stale bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jun 29, 2021
@pchaigno pchaigno added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jul 7, 2021
@pchaigno
Copy link
Member

pchaigno commented Jul 7, 2021

@michaelpetrov Could you please rebase your PR on latest master? We've fixed a couple of flakes in the past few days and CI should have a better time after rebasing.

<!-- Description of change -->

Use correct tolerations value when deploying cilium-operator via helm. Without this the global `tolerations` value is used and the operator runs on all nodes including ones with taints.

```release-note
Use correct tolerations value when deploying cilium-operator via helm.
```
Signed-off-by: Michael Petrov <petrov.michael@gmail.com>
… helm

<!-- Description of change -->

Use correct tolerations value when deploying cilium-etcd-operator via helm. Without this, the global `tolerations` value is used and the operator runs on all nodes including ones with taints. Currently the `values.yaml` file claims it is available (but has been unused so far)

```release-note
Use correct tolerations value when deploying cilium-etcd-operator via helm.
```

Signed-off-by: Michael Petrov <petrov.michael@gmail.com>
@aanm
Copy link
Member

aanm commented Aug 5, 2021

test-me-please

@aanm
Copy link
Member

aanm commented Aug 5, 2021

Failures are unrelated, merging

@aanm aanm merged commit f77996c into cilium:master Aug 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants