high performace hook: irqbalance: incorrect handling of short affinity mask #6145

ffromani · 2022-08-11T15:37:52Z

What happened?

The high performance hook support the dynamic IRQ balancing thanks some well known annotation.
Pods can opt out from IRQ handling exposing this annotation. On pod startup, crio will recompute the irqbalance ban mask and reapply irqbalance to make sure the cpus assigned to these pods will not handle IRQs - being added to the irqbalance ban mask.
This works only for guaranteed pods requesting integral CPUs.

To support this feature, crio needs to recompute the cpu ban mask. In turn, crio depends on encoding/hex.DecodeString for some heavy lifting. The function expects to deal with even-sized inbound strings (which are trivial transformations of the irqbalance ban mask).

So crio does left-aligned zero padding of the irq affinity mask at the beginning of computation.

The left-padded mask is then adjusted excluding any relevant cpu, and then inverted.
Here lies the problem: the padding character should always be "0" (ASCII zero), not "f", which is obtained inverting the original correct zero character.

The outcome of this behavior is that stray 'f' end up in the IRQ ban list in the irqbalance config file, and they are never removed.

To the best of my knowledge, this happens only when the IRQ affinity mask is a multiple of 4 but not of 8, so on systems with 4, 12, 20... cpus, which will have odd-numbered irq affinity mask reported by the kernel.
Machines like this are expected to be quite rare in production environment, but relatively common on CI (especially 4 and 12 cpus)

What did you expect to happen?

cri-o should compute the padding characters correctly using ASCII zero ("0") and not just trivially invert the computed mask, which however was padded correctly.
IOW, the problem is the computation of the inverted mask (= the ban list)

How can we reproduce it (as minimally and precisely as possible)?

run the dynamic IRQ balancing code on a system with 4 cpus, like the openshift CI: see openshift/cluster-node-tuning-operator@4821953 (e2e test added as part of openshift/cluster-node-tuning-operator#396)

Excerpt of a test run which demonstrates the behavior

[performance] Checking IRQBalance settings Verify irqbalance configuration handling 
  Should not overwrite the banned CPU set on tuned restart
  /home/fromani/go/src/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/irqbalance.go:70
Aug 11 13:39:53.221: [INFO]: MCP "worker" is targeting 3 node(s)
Aug 11 13:39:53.786: [INFO]: MCP "worker" is targeting 3 node(s)
STEP: verifying worker node "ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm"
Aug 11 13:39:54.058: [INFO]: found mcd machine-config-daemon-s5g7t for node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
Aug 11 13:39:55.062: [INFO]: banned CPUs setting: "IRQBALANCE_BANNED_CPUS=\"00000000,000000f0\""
Aug 11 13:39:55.062: [INFO]: banned CPUs: "00000000,000000f0"
Aug 11 13:39:55.062: [INFO]: banned CPUs on "ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm" when test begins: {4-7}
Aug 11 13:39:55.200: [INFO]: found mcd machine-config-daemon-s5g7t for node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
Aug 11 13:39:56.363: [INFO]: found mcd machine-config-daemon-s5g7t for node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
Aug 11 13:39:57.456: [INFO]: using testpod:
{"metadata":{"generateName":"test-","namespace":"performance-addon-operators-testing","creationTimestamp":null,"labels":{"test":""},"annotations":{"cpu-quota.crio.io":"disable","irq-load-balancing.crio.io":"disable"}},"spec":{"containers":[{"name":"test","image":"quay.io/openshift-kni/cnf-tests:4.9","command":["sleep","10h"],"resources":{"limits":{"cpu":"2","memory":"256Mi"}}}],"nodeName":"ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm","runtimeClassName":"performance-perfprof"},"status":{}}
Aug 11 13:40:02.009: [WARNING]: ->  AddedInterface Add eth0 [10.128.2.19/23] from ovn-kubernetes
Aug 11 13:40:02.009: [WARNING]: ->  Pulled Container image "quay.io/openshift-kni/cnf-tests:4.9" already present on machine
Aug 11 13:40:02.009: [WARNING]: ->  Created Created container test
Aug 11 13:40:02.009: [WARNING]: ->  Started Started container test
Aug 11 13:40:02.146: [INFO]: found mcd machine-config-daemon-s5g7t for node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
Aug 11 13:40:03.405: [INFO]: banned CPUs setting: "IRQBALANCE_BANNED_CPUS=\"00000000,000000fa\""
Aug 11 13:40:03.405: [INFO]: banned CPUs: "00000000,000000fa"
Aug 11 13:40:03.405: [INFO]: banned CPUs on "ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm" just before the tuned restart: {1,3-7}
STEP: getting a TuneD Pod running on node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
STEP: causing a restart of the tuned pod (deleting the pod) on ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
run command 'oc [delete pod --wait=true -n openshift-cluster-node-tuning-operator tuned-2cb42]':
  out=pod "tuned-2cb42" deleted
  err=
  ret=<nil>
STEP: getting again a TuneD Pod running on node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
STEP: waiting for the TuneD daemon running on node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
STEP: re-verifying worker node "ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm" after TuneD restart
Aug 11 13:40:13.896: [INFO]: found mcd machine-config-daemon-s5g7t for node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
Aug 11 13:40:14.880: [INFO]: banned CPUs setting: "IRQBALANCE_BANNED_CPUS=\"00000000,000000fa\""
Aug 11 13:40:14.880: [INFO]: banned CPUs: "00000000,000000fa"
Aug 11 13:40:14.880: [INFO]: banned CPUs on "ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm" after the tuned restart: {1,3-7}
Aug 11 13:40:14.880: [INFO]: deleting pod "test-w577h"
Aug 11 13:40:47.564: [INFO]: found mcd machine-config-daemon-s5g7t for node ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm
Aug 11 13:40:48.577: [INFO]: banned CPUs setting: "IRQBALANCE_BANNED_CPUS=\"00000000,000000f0\""
Aug 11 13:40:48.577: [INFO]: banned CPUs: "00000000,000000f0"
Aug 11 13:40:48.577: [INFO]: banned CPUs on "ci-ln-2h3mjpb-72292-2wnt7-worker-a-zbxpm" when test ends: {4-7}

note this:

Aug 11 13:40:48.577: [INFO]: banned CPUs setting: "IRQBALANCE_BANNED_CPUS=\"00000000,000000f0\""

stray unexpected "f" in the banned cpu mask

Anything else we need to know?

this issue is likely present since the introdution of the dynamic IRQ balancing feature.

CRI-O and Kubernetes version

$ crio --version
crio version 1.24.1-11.rhaos4.11.gitb0d2ef3.el8
Version:          1.24.1-11.rhaos4.11.gitb0d2ef3.el8
GoVersion:        go1.18.1
Compiler:         gc
Platform:         linux/amd64
Linkmode:         dynamic
BuildTags:        exclude_graphdriver_devicemapper, containers_image_ostree_stub, seccomp, selinux
SeccompEnabled:   true
AppArmorEnabled:  false

$ kubectl --version
$ oc version
Client Version: 4.10.9
Server Version: 4.11.0-0.nightly-2022-07-24-151159
Kubernetes Version: v1.24.0+9546431

OS version

# On Linux:
$ cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="411.86.202207210724-0"
VERSION_ID="4.11"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 411.86.202207210724-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.11/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.11"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.11"
OPENSHIFT_VERSION="4.11"
RHEL_VERSION="8.6"
OSTREE_VERSION="411.86.202207210724-0"

$ uname -a
Linux cnfdt12 4.18.0-372.16.1.el8_6.x86_64 #1 SMP Tue Jun 28 03:02:21 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

Additional environment details (AWS, VirtualBox, physical, etc.)

bare metal

The text was updated successfully, but these errors were encountered:

The high performance hook support the dynamic IRQ balancing thanks some well known annotation. Pods can opt out from IRQ handling exposing this annotation. On pod startup, crio will recompute the irqbalance ban mask and reapply irqbalance to make sure the cpus assigned to these pods will not handle IRQs - being added to the irqbalance ban mask. This works only for guaranteed pods requesting integral CPUs. To support this feature, crio needs to recompute the cpu ban mask. In turn, crio depends on encoding/hex.DecodeString for some of the heavy lifting. The function expects to deal with even-sized inbound strings (which are trivial transformations of the irqbalance ban mask). So crio does left-aligned zero padding of the irq affinity mask at the beginning of computation. The left-padded mask is then adjusted excluding any relevant cpu, and then inverted. Here lies the problem: the padding character should always be "0" (ASCII zero), not "f", which is obtained inverting the original correct zero character. The outcome of this behavior is that stray 'f' end up in the IRQ ban list in the irqbalance config file, and they are never removed. Fixing the padding or the mask inversion require relatively invasive code changes, for example to pass around the expected mask length and act accordingly. While not particularly challenging per se, these change don't fit smoothly in the current code. Another option is to clamp the computed masks to the expected length once the computation and the inversion has been done. We pursue this approach in this change. Fixes: cri-o#6145 Signed-off-by: Francesco Romani <fromani@redhat.com>

Testing the irqbalance handling on tuned restart highlighted a crio bug when handling odd-numbered cpu affinity masks. We expect this bug to have little impact on production environments because it's unlikely they will have a number of CPUs multiple by 4 but not multiple by 8, but still we filed cri-o/cri-o#6145 In order to have a backportable change, we fix our utility code to deal with incorrect padding. Signed-off-by: Francesco Romani <fromani@redhat.com>

The high performance hook support the dynamic IRQ balancing thanks some well known annotation. Pods can opt out from IRQ handling exposing this annotation. On pod startup, crio will recompute the irqbalance ban mask and reapply irqbalance to make sure the cpus assigned to these pods will not handle IRQs - being added to the irqbalance ban mask. This works only for guaranteed pods requesting integral CPUs. To support this feature, crio needs to recompute the cpu ban mask. In turn, crio depends on encoding/hex.DecodeString for some of the heavy lifting. The function expects to deal with even-sized inbound strings (which are trivial transformations of the irqbalance ban mask). So crio does left-aligned zero padding of the irq affinity mask at the beginning of computation. The left-padded mask is then adjusted excluding any relevant cpu, and then inverted. Here lies the problem: the padding character should always be "0" (ASCII zero), not "f", which is obtained inverting the original correct zero character. The outcome of this behavior is that stray 'f' end up in the IRQ ban list in the irqbalance config file, and they are never removed. Fixing the padding or the mask inversion require relatively invasive code changes, for example to pass around the expected mask length and act accordingly. While not particularly challenging per se, these change don't fit smoothly in the current code. Another option is to clamp the computed masks to the expected length once the computation and the inversion has been done. We pursue this approach in this change. Fixes: cri-o#6145 Signed-off-by: Francesco Romani <fromani@redhat.com>

Testing the irqbalance handling on tuned restart highlighted a crio bug when handling odd-numbered cpu affinity masks. We expect this bug to have little impact on production environments because it's unlikely they will have a number of CPUs multiple by 4 but not multiple by 8, but still we filed cri-o/cri-o#6145 In order to have a backportable change, we fix our utility code to deal with incorrect padding. Signed-off-by: Francesco Romani <fromani@redhat.com>

* test: perfprof: utils: make sure to unquote cpus Make sure to unquote the cpumask output to prevent false negatives Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: utils: robustness fixes Add testcases and log enhancement emerged during the local testing. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: tuned: disable the irqbalance plugin The tuned irqbalance plugin clears the irqbalance banned CPUs list when tuned starts. The list is then managed dynamically by the runtime handlers. On node restart, the tuned pod can be started AFTER the workload pods (kubelet nor kubernetes offers ordering guarantees when recovering the node state); clearing the banned CPUs list while pods are running and compromising the IRQ isolation guarantees. Same holds true if the NTO pod restarts for whatever reason. To prevent this disruption, we disable the irqbalance plugin entirely. Another component in the stack must now clear the irqbalance cpu ban list once per node reboot. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2105123 Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: add tests for cpu ban list handling Add a test to verify that a tuned restart will not clear the irqbalance cpu ban list, which is the key reason why we disabled the irqbalance tuned plugin earlier. Note there is no guarantee that any component in the stack will reset the irqbalance copu ban list once. crio inconditionally does a restore depending on a snapshot taken the first time the server runs, which is likely but not guaranteed to be correct. There's no way to declare or check the content of the value crio would reset. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: functests: utils: fix cpu mask padding Testing the irqbalance handling on tuned restart highlighted a crio bug when handling odd-numbered cpu affinity masks. We expect this bug to have little impact on production environments because it's unlikely they will have a number of CPUs multiple by 4 but not multiple by 8, but still we filed cri-o/cri-o#6145 In order to have a backportable change, we fix our utility code to deal with incorrect padding. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

Testing the irqbalance handling on tuned restart highlighted a crio bug when handling odd-numbered cpu affinity masks. We expect this bug to have little impact on production environments because it's unlikely they will have a number of CPUs multiple by 4 but not multiple by 8, but still we filed cri-o/cri-o#6145 In order to have a backportable change, we fix our utility code to deal with incorrect padding. Signed-off-by: Francesco Romani <fromani@redhat.com>

In commit c5cf0bd we added a workaround for the incorrect crio mask padding (see: cri-o/cri-o#6145 ) But the fix implemented here is partial. Address the gap. Signed-off-by: Francesco Romani <fromani@redhat.com>

* podsecurity: disable OCP label sync Per last OCP recommendation (internal doc) is not sufficient to declare the pod security labels, we also need to disable the label sync, which we do here. Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: crio mask fix workaround part 2 In commit c5cf0bd we added a workaround for the incorrect crio mask padding (see: cri-o/cri-o#6145 ) But the fix implemented here is partial. Address the gap. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

* test: perfprof: utils: make sure to unquote cpus Make sure to unquote the cpumask output to prevent false negatives Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: utils: robustness fixes Add testcases and log enhancement emerged during the local testing. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: tuned: disable the irqbalance plugin The tuned irqbalance plugin clears the irqbalance banned CPUs list when tuned starts. The list is then managed dynamically by the runtime handlers. On node restart, the tuned pod can be started AFTER the workload pods (kubelet nor kubernetes offers ordering guarantees when recovering the node state); clearing the banned CPUs list while pods are running and compromising the IRQ isolation guarantees. Same holds true if the NTO pod restarts for whatever reason. To prevent this disruption, we disable the irqbalance plugin entirely. Another component in the stack must now clear the irqbalance cpu ban list once per node reboot. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2105123 Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: add tests for cpu ban list handling Add a test to verify that a tuned restart will not clear the irqbalance cpu ban list, which is the key reason why we disabled the irqbalance tuned plugin earlier. Note there is no guarantee that any component in the stack will reset the irqbalance copu ban list once. crio inconditionally does a restore depending on a snapshot taken the first time the server runs, which is likely but not guaranteed to be correct. There's no way to declare or check the content of the value crio would reset. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: functests: utils: fix cpu mask padding Testing the irqbalance handling on tuned restart highlighted a crio bug when handling odd-numbered cpu affinity masks. We expect this bug to have little impact on production environments because it's unlikely they will have a number of CPUs multiple by 4 but not multiple by 8, but still we filed cri-o/cri-o#6145 In order to have a backportable change, we fix our utility code to deal with incorrect padding. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>

github-actions · 2022-09-11T00:03:02Z

A friendly reminder that this issue had no activity for 30 days.

In commit c5cf0bd we added a workaround for the incorrect crio mask padding (see: cri-o/cri-o#6145 ) But the fix implemented here is partial. Address the gap. Signed-off-by: Francesco Romani <fromani@redhat.com>

…ane (#467) * podsecurity: disable OCP label sync Per last OCP recommendation (internal doc) is not sufficient to declare the pod security labels, we also need to disable the label sync, which we do here. Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: crio mask fix workaround part 2 In commit c5cf0bd we added a workaround for the incorrect crio mask padding (see: cri-o/cri-o#6145 ) But the fix implemented here is partial. Address the gap. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com> Co-authored-by: Francesco Romani <fromani@redhat.com>

github-actions · 2022-12-11T00:03:04Z

Closing this issue since it had no activity in the past 90 days.

ffromani · 2022-12-11T10:36:09Z

/reopen

openshift-ci · 2022-12-11T10:36:12Z

@fromanirh: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani · 2023-03-28T14:30:59Z

/remove-lifecycle stale

github-actions · 2023-04-28T00:03:00Z

A friendly reminder that this issue had no activity for 30 days.

* test: perfprof: utils: make sure to unquote cpus Make sure to unquote the cpumask output to prevent false negatives Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: utils: robustness fixes Add testcases and log enhancement emerged during the local testing. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: tuned: disable the irqbalance plugin The tuned irqbalance plugin clears the irqbalance banned CPUs list when tuned starts. The list is then managed dynamically by the runtime handlers. On node restart, the tuned pod can be started AFTER the workload pods (kubelet nor kubernetes offers ordering guarantees when recovering the node state); clearing the banned CPUs list while pods are running and compromising the IRQ isolation guarantees. Same holds true if the NTO pod restarts for whatever reason. To prevent this disruption, we disable the irqbalance plugin entirely. Another component in the stack must now clear the irqbalance cpu ban list once per node reboot. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2105123 Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: add tests for cpu ban list handling Add a test to verify that a tuned restart will not clear the irqbalance cpu ban list, which is the key reason why we disabled the irqbalance tuned plugin earlier. Note there is no guarantee that any component in the stack will reset the irqbalance copu ban list once. crio inconditionally does a restore depending on a snapshot taken the first time the server runs, which is likely but not guaranteed to be correct. There's no way to declare or check the content of the value crio would reset. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: functests: utils: fix cpu mask padding Testing the irqbalance handling on tuned restart highlighted a crio bug when handling odd-numbered cpu affinity masks. We expect this bug to have little impact on production environments because it's unlikely they will have a number of CPUs multiple by 4 but not multiple by 8, but still we filed cri-o/cri-o#6145 In order to have a backportable change, we fix our utility code to deal with incorrect padding. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

* podsecurity: disable OCP label sync Per last OCP recommendation (internal doc) is not sufficient to declare the pod security labels, we also need to disable the label sync, which we do here. Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: crio mask fix workaround part 2 In commit c5cf0bd we added a workaround for the incorrect crio mask padding (see: cri-o/cri-o#6145 ) But the fix implemented here is partial. Address the gap. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

haircommander · 2023-05-24T16:32:55Z

@ffromani is this a CRI-O issue or one that should live somewhere else?

* test: perfprof: utils: make sure to unquote cpus Make sure to unquote the cpumask output to prevent false negatives Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: utils: robustness fixes Add testcases and log enhancement emerged during the local testing. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: tuned: disable the irqbalance plugin The tuned irqbalance plugin clears the irqbalance banned CPUs list when tuned starts. The list is then managed dynamically by the runtime handlers. On node restart, the tuned pod can be started AFTER the workload pods (kubelet nor kubernetes offers ordering guarantees when recovering the node state); clearing the banned CPUs list while pods are running and compromising the IRQ isolation guarantees. Same holds true if the NTO pod restarts for whatever reason. To prevent this disruption, we disable the irqbalance plugin entirely. Another component in the stack must now clear the irqbalance cpu ban list once per node reboot. Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2105123 Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: add tests for cpu ban list handling Add a test to verify that a tuned restart will not clear the irqbalance cpu ban list, which is the key reason why we disabled the irqbalance tuned plugin earlier. Note there is no guarantee that any component in the stack will reset the irqbalance copu ban list once. crio inconditionally does a restore depending on a snapshot taken the first time the server runs, which is likely but not guaranteed to be correct. There's no way to declare or check the content of the value crio would reset. Signed-off-by: Francesco Romani <fromani@redhat.com> * perfprof: functests: utils: fix cpu mask padding Testing the irqbalance handling on tuned restart highlighted a crio bug when handling odd-numbered cpu affinity masks. We expect this bug to have little impact on production environments because it's unlikely they will have a number of CPUs multiple by 4 but not multiple by 8, but still we filed cri-o/cri-o#6145 In order to have a backportable change, we fix our utility code to deal with incorrect padding. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

* podsecurity: disable OCP label sync Per last OCP recommendation (internal doc) is not sufficient to declare the pod security labels, we also need to disable the label sync, which we do here. Signed-off-by: Francesco Romani <fromani@redhat.com> * e2e: perfprof: crio mask fix workaround part 2 In commit c5cf0bd we added a workaround for the incorrect crio mask padding (see: cri-o/cri-o#6145 ) But the fix implemented here is partial. Address the gap. Signed-off-by: Francesco Romani <fromani@redhat.com> Signed-off-by: Francesco Romani <fromani@redhat.com>

github-actions · 2023-06-24T00:03:28Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2023-09-23T00:02:46Z

Closing this issue since it had no activity in the past 90 days.

The high performance hook support the dynamic IRQ balancing thanks some well known annotation. Pods can opt out from IRQ handling exposing this annotation. On pod startup, crio will recompute the irqbalance ban mask and reapply irqbalance to make sure the cpus assigned to these pods will not handle IRQs - being added to the irqbalance ban mask. This works only for guaranteed pods requesting integral CPUs. To support this feature, crio needs to recompute the cpu ban mask. In turn, crio depends on encoding/hex.DecodeString for some of the heavy lifting. The function expects to deal with even-sized inbound strings (which are trivial transformations of the irqbalance ban mask). So crio does left-aligned zero padding of the irq affinity mask at the beginning of computation. The left-padded mask is then adjusted excluding any relevant cpu, and then inverted. Here lies the problem: the padding character should always be "0" (ASCII zero), not "f", which is obtained inverting the original correct zero character. The outcome of this behavior is that stray 'f' end up in the IRQ ban list in the irqbalance config file, and they are never removed. Fixing the padding or the mask inversion require relatively invasive code changes, for example to pass around the expected mask length and act accordingly. While not particularly challenging per se, these change don't fit smoothly in the current code. Another option is to clamp the computed masks to the expected length once the computation and the inversion has been done. We pursue this approach in this change. Fixes: cri-o#6145 Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2024-03-06T16:29:02Z

/reopen

the issue, AFAIK, still exists

openshift-ci · 2024-03-06T16:29:05Z

@ffromani: Reopened this issue.

In response to this:

/reopen

the issue, AFAIK, still exists

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The high performance hook support the dynamic IRQ balancing thanks some well known annotation. Pods can opt out from IRQ handling exposing this annotation. On pod startup, crio will recompute the irqbalance ban mask and reapply irqbalance to make sure the cpus assigned to these pods will not handle IRQs - being added to the irqbalance ban mask. This works only for guaranteed pods requesting integral CPUs. To support this feature, crio needs to recompute the cpu ban mask. In turn, crio depends on encoding/hex.DecodeString for some of the heavy lifting. The function expects to deal with even-sized inbound strings (which are trivial transformations of the irqbalance ban mask). So crio does left-aligned zero padding of the irq affinity mask at the beginning of computation. The left-padded mask is then adjusted excluding any relevant cpu, and then inverted. Here lies the problem: the padding character should always be "0" (ASCII zero), not "f", which is obtained inverting the original correct zero character. The outcome of this behavior is that stray 'f' end up in the IRQ ban list in the irqbalance config file, and they are never removed. Fixing the padding or the mask inversion require relatively invasive code changes, for example to pass around the expected mask length and act accordingly. While not particularly challenging per se, these change don't fit smoothly in the current code. Another option is to clamp the computed masks to the expected length once the computation and the inversion has been done. We pursue this approach in this change. Fixes: cri-o#6145 Signed-off-by: Francesco Romani <fromani@redhat.com>

github-actions · 2024-04-06T00:03:10Z

A friendly reminder that this issue had no activity for 30 days.

github-actions · 2024-07-06T00:02:51Z

Closing this issue since it had no activity in the past 90 days.

ffromani added the kind/bug Categorizes issue or PR as related to a bug. label Aug 11, 2022

ffromani linked a pull request Aug 11, 2022 that will close this issue

irqbalance: fix padding of odd-sized cpu mask #6146

Open

ffromani mentioned this issue Aug 19, 2022

e2e: perfprof: crio mask fix workaround part 2 openshift/cluster-node-tuning-operator#437

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2022

github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 11, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2022

openshift-ci bot reopened this Dec 11, 2022

github-actions bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 12, 2022

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 28, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2023

haircommander added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 24, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2023

github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 23, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 23, 2023

openshift-ci bot reopened this Mar 6, 2024

github-actions bot removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2024

github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 6, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2024

kwilczynski removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 11, 2024

kwilczynski assigned kwilczynski, haircommander and ffromani Jul 11, 2024

kwilczynski reopened this Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high performace hook: irqbalance: incorrect handling of short affinity mask #6145

high performace hook: irqbalance: incorrect handling of short affinity mask #6145

ffromani commented Aug 11, 2022

github-actions bot commented Sep 11, 2022

github-actions bot commented Dec 11, 2022

ffromani commented Dec 11, 2022

openshift-ci bot commented Dec 11, 2022

ffromani commented Mar 28, 2023

github-actions bot commented Apr 28, 2023

haircommander commented May 24, 2023

github-actions bot commented Jun 24, 2023

github-actions bot commented Sep 23, 2023

ffromani commented Mar 6, 2024

openshift-ci bot commented Mar 6, 2024

github-actions bot commented Apr 6, 2024

github-actions bot commented Jul 6, 2024

high performace hook: irqbalance: incorrect handling of short affinity mask #6145

high performace hook: irqbalance: incorrect handling of short affinity mask #6145

Comments

ffromani commented Aug 11, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

CRI-O and Kubernetes version

OS version

Additional environment details (AWS, VirtualBox, physical, etc.)

github-actions bot commented Sep 11, 2022

github-actions bot commented Dec 11, 2022

ffromani commented Dec 11, 2022

openshift-ci bot commented Dec 11, 2022

ffromani commented Mar 28, 2023

github-actions bot commented Apr 28, 2023

haircommander commented May 24, 2023

github-actions bot commented Jun 24, 2023

github-actions bot commented Sep 23, 2023

ffromani commented Mar 6, 2024

openshift-ci bot commented Mar 6, 2024

github-actions bot commented Apr 6, 2024

github-actions bot commented Jul 6, 2024