Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io #16402

Closed
Tracked by #13359
pchaigno opened this issue Jun 2, 2021 · 10 comments · Fixed by #23334
Closed
Tracked by #13359

Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io #16402

pchaigno opened this issue Jun 2, 2021 · 10 comments · Fixed by #23334
Labels
area/CI Continuous Integration testing issue or flake area/operator Impacts the cilium-operator component kind/bug/CI This is a bug in the testing code. pinned These issues are not marked stale by our issue bot.

Comments

@pchaigno
Copy link
Member

pchaigno commented Jun 2, 2021

Found in CI, in cilium-operator logs:

2021-06-02T14:04:08.985504968Z level=info msg="Leader re-election complete" newLeader=gke-cilium-ci-14-cilium-ci-14-26c6ad68-s5w4-nhKgOahXUP operatorID=gke-cilium-ci-14-cilium-ci-14-26c6ad68-s5w4-umHEuadHYZ subsys=cilium-operator-generic
2021-06-02T14:04:26.722724693Z Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "cilium-operator-resource-lock": the object has been modified; please apply your changes to the latest version and try again
2021-06-02T14:04:26.766651169Z level=error msg="Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io \"cilium-operator-resource-lock\": the object has been modified; please apply your changes to the latest version and try again" subsys=klog

From:
https://jenkins.cilium.io/job/Cilium-PR-K8s-GKE/5586/testReport/junit/Suite-k8s-1/18/K8sCLI_CLI_Identity_CLI_testing_Test_labelsSHA256/
K8sCLI_CLI_Identity_CLI_testing_Test_labelsSHA256.zip

@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. area/operator Impacts the cilium-operator component labels Jun 2, 2021
@aanm
Copy link
Member

aanm commented Jun 4, 2021

It took me 1 hour debugging this issue until I realized that this is open because of a PR still marked as draft #16395.

This is not an issue as it is part of the k8s library and the process is handled automatically by the library which Cilium can't control.

@aanm aanm closed this as completed Jun 4, 2021
@pchaigno pchaigno reopened this Jun 4, 2021
@pchaigno
Copy link
Member Author

pchaigno commented Jun 4, 2021

Are you saying the error is expected? Should it then be at least diminished to a warning?

@aanm
Copy link
Member

aanm commented Jun 4, 2021

Unfortunately we don't control the level of klog logs so we can't set it as warning.

@pchaigno
Copy link
Member Author

pchaigno commented Jun 15, 2021

Another variation seems to be:

level=info msg="attempting to acquire leader lease kube-system/cilium-operator-resource-lock..." subsys=klog
2021-06-15T08:01:37.527364968Z error initially creating leader election record: leases.coordination.k8s.io "cilium-operator-resource-lock" already exists
2021-06-15T08:01:37.527405312Z level=error msg="error initially creating leader election record: leases.coordination.k8s.io \"cilium-operator-resource-lock\" already exists" subsys=klog
2021-06-15T08:01:41.075771452Z level=info msg="Leader re-election complete" newLeader=k8s2-PwqknlEdnn operatorID=k8s3-bWOiVcFCMx subsys=cilium-operator-generic

from https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-net-next/796/testReport/junit/Suite-k8s-1/16/K8sKafkaPolicyTest_Kafka_Policy_Tests_KafkaPolicies/
4ad4186c_K8sKafkaPolicyTest_Kafka_Policy_Tests_KafkaPolicies.zip


I understand we can't change these messages or their level, but do we know why they happen?

pchaigno added a commit to pchaigno/cilium that referenced this issue Jun 22, 2021
cilium#16477 was merged and a new error,
cilium#16402 (comment) was
discovered since the PR disallowing level=error in CI was merged.

Signed-off-by: Paul Chaignon <paul@cilium.io>
christarazi pushed a commit that referenced this issue Jun 23, 2021
#16477 was merged and a new error,
#16402 (comment) was
discovered since the PR disallowing level=error in CI was merged.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno
Copy link
Member Author

Yet another variation:

2021-06-26T15:24:04.863701081Z level=error msg="error retrieving resource lock kube-system/cilium-operator-resource-lock: Get \"https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock\": http2: client connection lost" subsys=klog

pchaigno added a commit to pchaigno/cilium that referenced this issue Jun 30, 2021
This error message happened in CI and seems to be a less frequent
variation of known klog error messages [1].

1 - cilium#16402 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
pchaigno added a commit that referenced this issue Jun 30, 2021
This error message happened in CI and seems to be a less frequent
variation of known klog error messages [1].

1 - #16402 (comment)
Signed-off-by: Paul Chaignon <paul@cilium.io>
@nathanjsweet
Copy link
Member

nathanjsweet commented Dec 3, 2021

Ran into another variation for this weeks backport:

2021-12-03T01:02:19.719702706Z level=error msg="error retrieving resource lock kube-system/cilium-operator-resource-lock: Get \"https://10.96.0.1:443/apis/c\
 oordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock\": context deadline exceeded" subsys=klog                                                                                                                                                                           
 2021-12-03T01:02:19.719713805Z Failed to release lock: resource name may not be empty                                                                        
 2021-12-03T01:02:19.719718371Z level=error msg="Failed to release lock: resource name may not be empty" subsys=klog

The "Failed to release lock: resource name may not be empty" is a new one. Perhaps a worthy candidate for adding to the exceptions list?

nathanjsweet added a commit that referenced this issue Dec 3, 2021
Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. #16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
@aanm
Copy link
Member

aanm commented Dec 4, 2021

@nathanjsweet did you check etcd logs? I think this is a similar failure as this one #17981 (comment)

@nathanjsweet
Copy link
Member

I just looked again, just some warnings about "apply request took too long".

pchaigno pushed a commit that referenced this issue Dec 9, 2021
Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. #16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
nbusseneau pushed a commit that referenced this issue Dec 9, 2021
Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. #16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
nbusseneau pushed a commit to nbusseneau/cilium that referenced this issue Dec 10, 2021
[ upstream commit 82d4422 ]

Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. cilium#16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit to nbusseneau/cilium that referenced this issue Dec 10, 2021
[ upstream commit 82d4422 ]

[ Backport notes: had to resolve conflicts manually due to cilium#16395
  previously introducing exceptions not having been backported to v1.10.
  The changes in this PR completely supersede cilium#16395 so there should be
  no need to backport it first. ]

Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. cilium#16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
nbusseneau pushed a commit that referenced this issue Dec 14, 2021
[ upstream commit 82d4422 ]

[ Backport notes: had to resolve conflicts manually due to #16395
  previously introducing exceptions not having been backported to v1.10.
  The changes in this PR completely supersede #16395 so there should be
  no need to backport it first. ]

Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. #16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
tklauser pushed a commit that referenced this issue Dec 16, 2021
[ upstream commit 82d4422 ]

Occasionally the cilium-operator will run into a transient issue
where it cannot get/update/release the leaselock with K8s that
it uses to adjudicate its leader election. This error message
is part and parcel of this failure and can be ignored.

cf. #16402

Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com>
@aanm aanm added area/CI Continuous Integration testing issue or flake kind/bug/CI This is a bug in the testing code. and removed kind/bug This is a bug in the Cilium logic. labels Jan 7, 2022
@github-actions

This comment was marked as outdated.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Jul 9, 2022
@pchaigno pchaigno added pinned These issues are not marked stale by our issue bot. and removed stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. labels Jul 9, 2022
@nbusseneau
Copy link
Member

FWIW, we are still hitting variations of this in the CI:

Excerpt from one of the ciium-operator-test.log:

2022-12-06T22:49:22.999578224Z level=error msg="error retrieving resource lock kube-system/cilium-operator-resource-lock: Get \"https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock\": context deadline exceeded" subsys=klog
2022-12-06T22:49:22.999583686Z level=info msg="failed to renew lease kube-system/cilium-operator-resource-lock: timed out waiting for the condition" subsys=klog
2022-12-06T22:49:24.933257796Z level=info msg="Leader election lost" operator-id=k8s1-mlmgzWbRSs subsys=cilium-operator-generic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/operator Impacts the cilium-operator component kind/bug/CI This is a bug in the testing code. pinned These issues are not marked stale by our issue bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants