New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-13747,OCPBUGS-14021: [1.26] Fix bugs with high performance hooks #7013
OCPBUGS-13747,OCPBUGS-14021: [1.26] Fix bugs with high performance hooks #7013
Conversation
@haircommander: This pull request references Jira Issue OCPBUGS-13163, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. This pull request references Jira Issue OCPBUGS-13163, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## release-1.26 #7013 +/- ##
================================================
- Coverage 42.31% 42.18% -0.13%
================================================
Files 127 128 +1
Lines 15003 15047 +44
================================================
Hits 6348 6348
- Misses 7979 8023 +44
Partials 676 676 |
@haircommander: This pull request references Jira Issue OCPBUGS-13163, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. This pull request references Jira Issue OCPBUGS-14021, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
server/container_stop.go
Outdated
@@ -82,5 +80,15 @@ func (s *Server) stopContainer(ctx context.Context, ctr *oci.Container, timeout | |||
log.Warnf(ctx, "Unable to write containers %s state to disk: %v", ctr.ID(), err) | |||
} | |||
|
|||
if hooks != nil { | |||
if err := hooks.PostStop(ctx, ctr, sb); err != nil { | |||
return fmt.Errorf("failed to run post-stop hook for container %q: %w", ctr.ID(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be just a log. The container is dead and there is nothing that can be done. Otherwise it will confuse the platform. See #7032
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact.. I wonder if it should have been below the stopContainer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cherry-picked 93c4c56
39ce52b
to
778452c
Compare
/retest |
/jira refresh |
@sdodson: This pull request references Jira Issue OCPBUGS-13163, which is invalid:
Comment This pull request references Jira Issue OCPBUGS-14021, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Spent a few minutes digging into the bug deps there's a weird situation where one of the bugs this depends on had previously gone to QA but got rejected because additional fixes were necessary and that bug is currently POST waiting on this PR to merge. Rather than try to disentangle all of that I'm just going to override the labels. We will likely need to manually move https://issues.redhat.com//browse/OCPBUGS-13747 to MODIFIED state once this merges. /label jira/valid-bug |
@sdodson: Can not set label jira/valid-bug: Must be member in one of these teams: [openshift-patch-managers openshift-staff-engineers openshift-release-oversight] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Peter Hunt <pehunt@redhat.com>
when libcontainer's cgroup package is operating on a systemd scope, it uses the Name and ScopePrefix fields to construct the path, rather than just the Name. Account for this. Signed-off-by: Peter Hunt <pehunt@redhat.com>
they probably won't be used, but they're called for sandbox stop, so call them just in case. Signed-off-by: Peter Hunt <pehunt@redhat.com>
to consolidate code and prevent skew Signed-off-by: Peter Hunt <pehunt@redhat.com>
disabling cpu load balancing requires all overlapping cpusets to have the cpuset.sched_load_balance field to 0. There is a bug where stale containers that have stopped still have the field set to 1, and cpumanager won't attempt to load balance the cpus away from them. Thus, adding a pod that needs cpu load balancing to be disabled after other pods have stopped causes the cpu load balancing to not apply in the kernel. Fix this by adding PostStop hooks for all containers (when high performance hooks are enabled). This hook will disable sched_load_balance in the container's cgroup. This cgroup is stale, as the containers is already stopped and won't be restarted, so this won't change behavior past allowing cpu load balancing to be disabled for new containers. Signed-off-by: Peter Hunt <pehunt@redhat.com>
Signed-off-by: Peter Hunt~ <pehunt@redhat.com>
Signed-off-by: Peter Hunt <pehunt@redhat.com>
The pod is already dead when the PostStop hook is called and so there is nothing to block anyway. Just log the error. Signed-off-by: Martin Sivak <msivak@redhat.com>
93c4c56
to
9a78d02
Compare
/lgtm |
@MarSik: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@haircommander @sdodson Folks, we need to get this merged. |
I don't think this solution is complete actually. Checking the 1.27 version there are some cases where the load balance isn't set right. I have found a potential source and am working on fixing it |
I forgot cri-o doesn't require valid bug/jira labels so not blocked on that. |
Signed-off-by: Peter Hunt <pehunt@redhat.com>
@haircommander: This pull request references Jira Issue OCPBUGS-13747, which is valid. 4 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (mniranja@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. This pull request references Jira Issue OCPBUGS-14021, which is valid. 4 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (mniranja@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@haircommander I assume we are waiting for #7059 here, right? |
No I am waiting on QE to verify 4.14 version, then am going to pull an eqivalent of #7106 into this, and then it's ready #7059 should not effect behavior only cleanup the code conceptually. I think it can be thought of as a follow on to this work but not at all required for it |
Signed-off-by: Peter Hunt <pehunt@redhat.com>
okay, I picked a version of #7120 for this branch. this should now have all of the required fixes and the 1.27 variant has been verified |
for ease of review:
|
/lgtm |
78941bf
into
cri-o:release-1.26
@haircommander: Jira Issue OCPBUGS-13747: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-13747 has been moved to the MODIFIED state. Jira Issue OCPBUGS-14021: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-14021 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[4.13.5](https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.5) has RHCOS [413.92.202307140015-0](https://releases-rhcos-art.apps.ocp-virt.prod.psi.redhat.com/diff.html?arch=x86_64&first_release=413.92.202306141213-0&first_stream=prod%2Fstreams%2F4.13-9.2&second_release=413.92.202307140015-0&second_stream=prod%2Fstreams%2F4.13-9.2) which has cri-o-0-1.26.3-11.rhaos4.13.git78941bf.el9-x86_64 and git78941bf is the merge of cri-o/cri-o#7013.
This is a manual cherry-pick of #7000 and #7012
/kind bug