Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-13747,OCPBUGS-14021: [1.26] Fix bugs with high performance hooks #7013

Conversation

haircommander
Copy link
Member

This is a manual cherry-pick of #7000 and #7012
/kind bug

Fix a bug with cpu quota annotation that manifests like:
`pod with cpu-quota.crio.io: disable fails with error: set CPU CFS quota: invalid slice name: /kubepods.slice`
Fix a bug where stopped containers break cpu load balancing being disabled

@openshift-ci-robot openshift-ci-robot added the jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. label Jun 2, 2023
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2023
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 2, 2023
@openshift-ci openshift-ci bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Jun 2, 2023
@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-13163, which is invalid:

  • expected dependent Jira Issue OCPBUGS-13747 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead
  • expected dependent Jira Issue OCPBUGS-13980 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-13163, which is invalid:

  • expected dependent Jira Issue OCPBUGS-13747 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead
  • expected dependent Jira Issue OCPBUGS-13980 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

This is a manual cherry-pick of #7000 and #7012
/kind bug

Fix a bug with cpu quota annotation that manifests like:
`pod with cpu-quota.crio.io: disable fails with error: set CPU CFS quota: invalid slice name: /kubepods.slice`
Fix a bug where stopped containers break cpu load balancing being disabled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 2, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 2, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels Jun 2, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 2, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2023
@codecov
Copy link

codecov bot commented Jun 2, 2023

Codecov Report

Merging #7013 (9a78d02) into release-1.26 (994242a) will decrease coverage by 0.13%.
The diff coverage is 7.40%.

❗ Current head 9a78d02 differs from pull request most recent head 5982014. Consider uploading reports for the commit 5982014 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           release-1.26    #7013      +/-   ##
================================================
- Coverage         42.31%   42.18%   -0.13%     
================================================
  Files               127      128       +1     
  Lines             15003    15047      +44     
================================================
  Hits               6348     6348              
- Misses             7979     8023      +44     
  Partials            676      676              

@haircommander haircommander changed the title OCPBUGS-13163,OCPBUGS-13163: [1.26] Fix bugs with high performance hooks OCPBUGS-13163,OCPBUGS-14021: [1.26] Fix bugs with high performance hooks Jun 5, 2023
@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-13163, which is invalid:

  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is ON_QA instead
  • expected dependent Jira Issue OCPBUGS-13747 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead
  • expected dependent Jira Issue OCPBUGS-13980 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-14021, which is invalid:

  • expected Jira Issue OCPBUGS-14021 to depend on a bug in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is a manual cherry-pick of #7000 and #7012
/kind bug

Fix a bug with cpu quota annotation that manifests like:
`pod with cpu-quota.crio.io: disable fails with error: set CPU CFS quota: invalid slice name: /kubepods.slice`
Fix a bug where stopped containers break cpu load balancing being disabled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -82,5 +80,15 @@ func (s *Server) stopContainer(ctx context.Context, ctr *oci.Container, timeout
log.Warnf(ctx, "Unable to write containers %s state to disk: %v", ctr.ID(), err)
}

if hooks != nil {
if err := hooks.PostStop(ctx, ctr, sb); err != nil {
return fmt.Errorf("failed to run post-stop hook for container %q: %w", ctr.ID(), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be just a log. The container is dead and there is nothing that can be done. Otherwise it will confuse the platform. See #7032

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact.. I wonder if it should have been below the stopContainer...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cherry-picked 93c4c56

@haircommander haircommander marked this pull request as ready for review June 8, 2023 17:24
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 8, 2023
@openshift-ci openshift-ci bot requested review from QiWang19 and wgahnagl June 8, 2023 17:24
@haircommander
Copy link
Member Author

/retest

@sdodson
Copy link

sdodson commented Jun 9, 2023

/jira refresh

@openshift-ci-robot
Copy link

@sdodson: This pull request references Jira Issue OCPBUGS-13163, which is invalid:

  • expected dependent Jira Issue OCPBUGS-13747 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is POST instead
  • expected dependent Jira Issue OCPBUGS-13980 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

This pull request references Jira Issue OCPBUGS-14021, which is invalid:

  • expected Jira Issue OCPBUGS-14021 to depend on a bug in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson
Copy link

sdodson commented Jun 9, 2023

Spent a few minutes digging into the bug deps there's a weird situation where one of the bugs this depends on had previously gone to QA but got rejected because additional fixes were necessary and that bug is currently POST waiting on this PR to merge. Rather than try to disentangle all of that I'm just going to override the labels. We will likely need to manually move https://issues.redhat.com//browse/OCPBUGS-13747 to MODIFIED state once this merges.

/label jira/valid-bug
/unlabel jira/invalid-bug

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2023

@sdodson: Can not set label jira/valid-bug: Must be member in one of these teams: [openshift-patch-managers openshift-staff-engineers openshift-release-oversight]

In response to this:

Spent a few minutes digging into the bug deps there's a weird situation where one of the bugs this depends on had previously gone to QA but got rejected because additional fixes were necessary and that bug is currently POST waiting on this PR to merge. Rather than try to disentangle all of that I'm just going to override the labels. We will likely need to manually move https://issues.redhat.com//browse/OCPBUGS-13747 to MODIFIED state once this merges.

/label jira/valid-bug
/unlabel jira/invalid-bug

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2023
Signed-off-by: Peter Hunt <pehunt@redhat.com>
when libcontainer's cgroup package is operating on a systemd scope, it uses the Name and
ScopePrefix fields to construct the path, rather than just the Name. Account for this.

Signed-off-by: Peter Hunt <pehunt@redhat.com>
they probably won't be used, but they're called for sandbox stop, so call them just in case.

Signed-off-by: Peter Hunt <pehunt@redhat.com>
haircommander and others added 5 commits June 13, 2023 14:13
to consolidate code and prevent skew

Signed-off-by: Peter Hunt <pehunt@redhat.com>
disabling cpu load balancing requires all overlapping cpusets to have the
cpuset.sched_load_balance field to 0. There is a bug where stale containers
that have stopped still have the field set to 1, and cpumanager won't attempt to
load balance the cpus away from them. Thus, adding a pod that needs cpu load
balancing to be disabled after other pods have stopped causes the cpu load balancing
to not apply in the kernel.

Fix this by adding PostStop hooks for all containers (when high performance hooks are enabled).
This hook will disable sched_load_balance in the container's cgroup. This cgroup is stale, as
the containers is already stopped and won't be restarted, so this won't change behavior past
allowing cpu load balancing to be disabled for new containers.

Signed-off-by: Peter Hunt <pehunt@redhat.com>
Signed-off-by: Peter Hunt~ <pehunt@redhat.com>
Signed-off-by: Peter Hunt <pehunt@redhat.com>
The pod is already dead when the PostStop hook is called and so
there is nothing to block anyway. Just log the error.

Signed-off-by: Martin Sivak <msivak@redhat.com>
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2023
@MarSik
Copy link
Contributor

MarSik commented Jun 16, 2023

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 16, 2023

@MarSik: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MarSik
Copy link
Contributor

MarSik commented Jun 16, 2023

@haircommander @sdodson Folks, we need to get this merged.

@haircommander
Copy link
Member Author

@haircommander @sdodson Folks, we need to get this merged.

I don't think this solution is complete actually. Checking the 1.27 version there are some cases where the load balance isn't set right. I have found a potential source and am working on fixing it

@sdodson
Copy link

sdodson commented Jun 16, 2023

I forgot cri-o doesn't require valid bug/jira labels so not blocked on that.

@haircommander haircommander changed the title OCPBUGS-13163,OCPBUGS-14021: [1.26] Fix bugs with high performance hooks OCPBUGS-13747,OCPBUGS-14021: [1.26] Fix bugs with high performance hooks Jun 28, 2023
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 28, 2023
@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-13747, which is valid.

4 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-13148 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (mniranja@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-14021, which is valid.

4 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-14018 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (mniranja@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This is a manual cherry-pick of #7000 and #7012
/kind bug

Fix a bug with cpu quota annotation that manifests like:
`pod with cpu-quota.crio.io: disable fails with error: set CPU CFS quota: invalid slice name: /kubepods.slice`
Fix a bug where stopped containers break cpu load balancing being disabled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MarSik
Copy link
Contributor

MarSik commented Jul 10, 2023

@haircommander I assume we are waiting for #7059 here, right?

@haircommander
Copy link
Member Author

@haircommander I assume we are waiting for #7059 here, right?

No I am waiting on QE to verify 4.14 version, then am going to pull an eqivalent of #7106 into this, and then it's ready

#7059 should not effect behavior only cleanup the code conceptually. I think it can be thought of as a follow on to this work but not at all required for it

Signed-off-by: Peter Hunt <pehunt@redhat.com>
@haircommander
Copy link
Member Author

okay, I picked a version of #7120 for this branch. this should now have all of the required fixes and the 1.27 variant has been verified

@haircommander
Copy link
Member Author

@harche
Copy link
Contributor

harche commented Jul 12, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2023
@openshift-merge-robot openshift-merge-robot merged commit 78941bf into cri-o:release-1.26 Jul 12, 2023
40 of 43 checks passed
@openshift-ci-robot
Copy link

@haircommander: Jira Issue OCPBUGS-13747: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-13747 has been moved to the MODIFIED state.

Jira Issue OCPBUGS-14021: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-14021 has been moved to the MODIFIED state.

In response to this:

This is a manual cherry-pick of #7000 and #7012
/kind bug

Fix a bug with cpu quota annotation that manifests like:
`pod with cpu-quota.crio.io: disable fails with error: set CPU CFS quota: invalid slice name: /kubepods.slice`
Fix a bug where stopped containers break cpu load balancing being disabled

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants