New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporarily disable bpf clock probe #26981
Conversation
@@ -19,7 +19,7 @@ | |||
{{- $defaultEnableCnpStatusUpdates = "false" -}} | |||
{{- $defaultBpfMapDynamicSizeRatio = 0.0025 -}} | |||
{{- $defaultBpfMasquerade = "true" -}} | |||
{{- $defaultBpfClockProbe = "true" -}} | |||
{{- $defaultBpfClockProbe = "false" -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether it's the right place to change it :-/ Maybe someone from the @cilium/helm could review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct; in the daemon itself it defaults to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Martynas wasn't asking Helm vs. agent, but this section of the Helm template vs. the section above.
Won't a change of time measurement for existing installations cause all existing connections to be interrupted? I would expect that's why we have this here to ensure the setting is preserved is the cluster was initially deployed with it, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this logic here is very confusing. The way this upgradeCompatibility
section here works (docs here) is that we have the old pre-v1.8 default in the top of the file, and then for each version where we change the default value, we add a new section.
So this section here is for default values changed in v1.8 (where we apparently turned bpf clock probe on by default). So I would argue it's not the right place to change.
If we just want to turn it off by default again for new installations in v1.14+, we probably should add a v1.14 section where we do $defaultBpfClockProbe = "false"
. If people then explicitly declare they want to use v1.12 default values, then bpf clock probe would still be enabled, as it was in v1.12.
If, on the other hand, we want to turn of bpf clock probe even for existing installations, then we should probably add it to the values.yaml
and document this in the upgrade impact section of the docs. That way, users can set it back to true
if they know what they are doing.
To summarize:
- If we want to turn of bpf clock probe forcefully for existing installations, regardless of
upgradeCompatibility
, then let's add it tovalues.yaml
and document it in the upgrade notes so users can opt out - If we want to turn of bpf clock probe only for new installations, then let's add a
upgradeCompatibility
section for v1.14 or v1.15 (depending on the version where we want to turn it on by default)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current understanding is that this feature had been effectively turned off everywhere until #25795. Quoting from the PR desc:
The Cilium agent has been throwing the 'Auto-disabling "enable-bpf-clock-probe" feature since kernel doesn't expose /proc/timer_list' warning for a while now.
(admittedly, I don't know how long this "for a while now" is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it's already false in the values.yaml file, which I think was what Casey meant by his previous message:
cilium/install/kubernetes/cilium/values.yaml
Line 428 in 32118cc
clockProbe: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. So I did a quick check (helm template cilium ./install/kubernetes/cilium --namespace kube-system | grep enable-bpf-clock-probe
) and we have a bug in our Helm charts. BPF probes are currently turned on by default, because the above value in values.yaml
is not used and so we fall back on the v1.8 defaults.
The value in values.yaml
is called bpf.clockProbe
:
cilium/install/kubernetes/cilium/values.yaml
Line 428 in 32118cc
clockProbe: false |
But the way it is set in the config map is using bpfClockProbe
(no dot):
enable-bpf-clock-probe: {{ .Values.bpfClockProbe | quote }} |
I think we should rename the unused variable in values.yaml
to bpfClockProbe
, or make the configmap use both the defacto-used bpfClockProbe
and the intended bpf.clockProbe
value instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(admittedly, I don't know how long this "for a while now" is)
AFAIK, for a while is since that flag was added.
I think the current understanding is that this feature had been effectively turned off everywhere until #25795.
Unfortunately, that fix got shipped into v1.13.4, which means some customers may be running with that feature enabled now.
64e7564
to
afa7a03
Compare
/test |
@@ -19,7 +19,7 @@ | |||
{{- $defaultEnableCnpStatusUpdates = "false" -}} | |||
{{- $defaultBpfMapDynamicSizeRatio = 0.0025 -}} | |||
{{- $defaultBpfMasquerade = "true" -}} | |||
{{- $defaultBpfClockProbe = "true" -}} | |||
{{- $defaultBpfClockProbe = "false" -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this logic here is very confusing. The way this upgradeCompatibility
section here works (docs here) is that we have the old pre-v1.8 default in the top of the file, and then for each version where we change the default value, we add a new section.
So this section here is for default values changed in v1.8 (where we apparently turned bpf clock probe on by default). So I would argue it's not the right place to change.
If we just want to turn it off by default again for new installations in v1.14+, we probably should add a v1.14 section where we do $defaultBpfClockProbe = "false"
. If people then explicitly declare they want to use v1.12 default values, then bpf clock probe would still be enabled, as it was in v1.12.
If, on the other hand, we want to turn of bpf clock probe even for existing installations, then we should probably add it to the values.yaml
and document this in the upgrade impact section of the docs. That way, users can set it back to true
if they know what they are doing.
To summarize:
- If we want to turn of bpf clock probe forcefully for existing installations, regardless of
upgradeCompatibility
, then let's add it tovalues.yaml
and document it in the upgrade notes so users can opt out - If we want to turn of bpf clock probe only for new installations, then let's add a
upgradeCompatibility
section for v1.14 or v1.15 (depending on the version where we want to turn it on by default)
00fc2c7
to
fe91c52
Compare
Rename the Helm variable in values.yaml.tmpl as it was incorrectly named bpf.clockProbe instead of bpfClockProbe. This will cause the bpf clock probe to be disabled everywhere. Which is currently necessary as it can otherwise cause interrupted connections after an upgrade, due to the garbage collector removing valid entries. Signed-off-by: Margarita Manterola <marga@isovalent.com>
fe91c52
to
71a6f90
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helm change looks good to me now. But I do think we should also document this in upgrade impact Helm section here:
https://github.com/cilium/cilium/blob/main/Documentation/operations/upgrade.rst#helm-options
I agree that we should add some docs to describe this behaviour to users, but that can be submitted separately. |
We've disabled bpfClockProbe everywhere with cilium#26981 and backports. Document that existing connections may be interrupted when upgrading to 1.13.4, the only release to ship with this enabled. On the other hand, they may linger for too long if upgrading from that release to one where bpf-clock-probe is no longer enabled. Signed-off-by: Margarita Manterola <marga@isovalent.com>
We've disabled bpfClockProbe everywhere with #26981 and backports. Document that existing connections may be interrupted when upgrading to 1.13.4, the only release to ship with this enabled. On the other hand, they may linger for too long if upgrading from that release to one where bpf-clock-probe is no longer enabled. Signed-off-by: Margarita Manterola <marga@isovalent.com>
We've disabled bpfClockProbe everywhere with #26981 and backports. Document that existing connections may be interrupted when upgrading to 1.13.4, the only release to ship with this enabled. On the other hand, they may linger for too long if upgrading from that release to one where bpf-clock-probe is no longer enabled. Signed-off-by: Margarita Manterola <marga@isovalent.com>
When enabling the bpf clock probe, the garbage collector is sometimes removing valid conntrack entries, likely due to some error in the jiffies calculation.
Until we figure out a permanent solution, let's disable the the bpf-clock-probe by default.
Fixes: #26955