-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to update status.conditions denied by cert-manager webhook #297
Comments
See also this slack thread |
Based on our certificate renewal policy, the frequency of this error should have decreased if this was a transient error. This is not the case, so I suspect this to be a real bug. It seems like cert-manager/cert-manager#3735 introduced this strict validation in cert-manager: https://github.com/cert-manager/cert-manager/blob/5141dddf2c0c5e10c5d4452c99a2e260c6eb2983/internal/apis/certmanager/validation/certificaterequest.go#L202-L214. I wonder if the validation was intentionally made extremely strict ( |
Looking closer at the managedFields:
- apiVersion: cert-manager.io/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:conditions':
'k:{"type":"Approved"}':
.: {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
manager: approver-policy
operation: Apply
subresource: status
time: '2023-10-30T13:45:09Z'
- apiVersion: cert-manager.io/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:ca': {}
'f:certificate': {}
'f:conditions':
'k:{"type":"Approved"}':
.: {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
'k:{"type":"Ready"}':
.: {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
manager: cert-manager-certificaterequests-issuer-vault
operation: Apply
subresource: status
time: '2023-10-30T13:45:09Z' I do not expect the cert-manager Vault issuer to approve certificate requests, but it seems like it ( |
Checking if cert-manager is configured correctly, since I am not expecting cert-manager to approve any requests:
|
Here is a newer request from today, which I think proves that there is a bug in cert-manager: the cert-manager Vault issuer sets/owns the Approved condition - with a timestamp after approver-policy approval. No idea why the error ends up in approver-policy tho. I cannot find any related errors/warnings in cert-manager logs. apiVersion: cert-manager.io/v1
kind: CertificateRequest
metadata:
annotations:
cert-manager.io/certificate-name: prefixverification-forecast-selector
cert-manager.io/certificate-revision: '7'
cert-manager.io/private-key-secret-name: prefixverification-forecast-selector-9qvxg
resourceVersion: '3086765553'
name: prefixverification-forecast-selector-7
uid: 05ae7f2b-8672-4185-87d5-bc5e6ba1cf4e
creationTimestamp: '2023-11-05T09:57:44Z'
generation: 1
managedFields:
- apiVersion: cert-manager.io/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:conditions':
'k:{"type":"Approved"}':
.: {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
manager: approver-policy
operation: Apply
subresource: status
time: '2023-11-05T09:57:44Z'
- apiVersion: cert-manager.io/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:ca': {}
'f:certificate': {}
'f:conditions':
'k:{"type":"Approved"}':
.: {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
'k:{"type":"Ready"}':
.: {}
'f:lastTransitionTime': {}
'f:message': {}
'f:reason': {}
'f:status': {}
'f:type': {}
manager: cert-manager-certificaterequests-issuer-vault
operation: Apply
subresource: status
time: '2023-11-05T09:57:45Z'
- apiVersion: cert-manager.io/v1
fieldsType: FieldsV1
fieldsV1:
'f:metadata':
'f:annotations':
.: {}
'f:cert-manager.io/certificate-name': {}
'f:cert-manager.io/certificate-revision': {}
'f:cert-manager.io/private-key-secret-name': {}
'f:labels':
.: {}
'f:app.kubernetes.io/managed-by': {}
'f:app.kubernetes.io/name': {}
'f:ownerReferences':
.: {}
'k:{"uid":"fc04c6eb-d67e-466c-93f2-0b0fa6a12f9f"}': {}
'f:spec':
.: {}
'f:duration': {}
'f:issuerRef':
.: {}
'f:group': {}
'f:kind': {}
'f:name': {}
'f:request': {}
'f:usages': {}
manager: cert-manager-certificates-request-manager
operation: Update
time: '2023-11-05T09:57:44Z'
namespace: preprod-fifty-for-imbalance-forecast-postprocessing
ownerReferences:
- apiVersion: cert-manager.io/v1
blockOwnerDeletion: true
controller: true
kind: Certificate
name: prefixverification-forecast-selector
uid: fc04c6eb-d67e-466c-93f2-0b0fa6a12f9f
labels:
app.kubernetes.io/managed-by: identity-provider
app.kubernetes.io/name: prefixverification-forecast-selector
spec:
duration: 144h0m0s
extra:
authentication.kubernetes.io/pod-name:
- cert-manager-5f86f8b489-6k548
authentication.kubernetes.io/pod-uid:
- f5d36458-dfb1-458e-8c79-31c78c87a0ea
groups:
- 'system:serviceaccounts'
- 'system:serviceaccounts:cert-manager'
- 'system:authenticated'
issuerRef:
group: cert-manager.io
kind: ClusterIssuer
name: vault-spiffe-issuer
request: >-
<REDACTED>
uid: 256a8935-1197-4f9a-915d-5dfd38d5f0c6
usages:
- client auth
username: 'system:serviceaccount:cert-manager:cert-manager'
status:
ca: >-
<REDACTED>
certificate: >-
<REDACTED>
conditions:
- lastTransitionTime: '2023-11-05T09:57:44Z'
message: 'Approved by CertificateRequestPolicy: "vault-spiffe"'
reason: policy.cert-manager.io
status: 'True'
type: Approved
- lastTransitionTime: '2023-11-05T09:57:45Z'
message: Certificate fetched from issuer successfully
reason: Issued
status: 'True'
type: Ready |
@erikgb I looked at the code and my theory is that there are two problems:
One solution would be to modify approver-policy to expect this situation and ignore the error, let's call it
|
I don't think the scenario you are describing is possible because each resource can only be reconciled one at a time |
Yes, and that is how it is in the scenario I described. approver-policy is re-reconciling the CR at v1 before its cache has been updated, with the result of its previous reconcile. |
Thanks for explaining. I would expect the patch to be identical, so the webhook shouldn't complain about the condition changing, right? |
The cache is only updated from watch push events, ref. kubernetes-sigs/controller-runtime#1622, so if the cache is not updated from the previous round of reconciliation, the patch will Thinking a bit more about the suggested solution from @wallrj, I am more in favor of using a non-caching client when reconciling certificaterequests in approver-policy. Ignoring a custom error message will never be clean, and quite fragile. If anyone changes the literal error message in the cert-manager webhook, this will break again. The HTTP status code is just |
Yes, I agree that would solve the problem. approver-policy/pkg/internal/controllers/certificaterequests.go Lines 174 to 177 in c8f9506
The full error message is: {
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "admission webhook \"webhook.cert-manager.io\" denied the request: status.conditions: Forbidden: 'Approved' condition may not be modified once set",
"reason": "NotAcceptable",
"code": 406
} Given the following {
"apiVersion": "cert-manager.io/v1",
"kind": "CertificateRequest",
"status": {
"conditions": [
{
"message": "new message",
"type": "Approved"
}
]
}
} curl 'https://127.0.0.1:44121/apis/cert-manager.io/v1/namespaces/default/certificaterequests/foo/status?fieldManager=cmctl&force=true' \
--silent \
--insecure \
--header "Accept: application/json" \
--header "Authorization: Bearer $(kubectl create token -n cert-manager cert-manager)" \
--header "Content-Type: application/apply-patch+yaml" \
--request PATCH \
--data @patch.json So that error response might be unique and stable enough to reliably discern it from other errors. The configuration of the non-caching client would be here, I suppose, which is a further away from where the site of the problem: approver-policy/pkg/internal/cmd/cmd.go Lines 76 to 98 in c8f9506
Another option might be to use the cached CR E.g. the following json patch applied using the previous curl command: {
"apiVersion": "cert-manager.io/v1",
"kind": "CertificateRequest",
"metadata": {
"resourceVersion": "999"
},
"status": {
"conditions": [
{
"message": "new message",
"type": "Approved"
}
]
}
} Gives: {
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "Operation cannot be fulfilled on certificaterequests.cert-manager.io \"foo\": the object has been modified; please apply your changes to the latest version and try again",
"reason": "Conflict",
"details": {
"name": "foo",
"group": "cert-manager.io",
"kind": "certificaterequests"
},
"code": 409
} |
Wow! 👏 Thanks for your deep analysis and suggestions, @wallrj! I didn't know you could use |
I'm confused, does #303 not solve this issue? Do we still get these errors after merging that PR? |
Confused? Why? I think it might solve this issue, but we won't know for sure. I do not have the intention to build and provision a snapshot version in our busiest cluster. |
@inteon Yeah, I realise that I was over complicating...I hadn't looked carefully at the code. I still think the situation I described can happen. It will be interesting to see whether @erikgb observes these errors after he deploys this fix (albeit fewer of them). And still think that there is a bug in the cert-manager server-side apply code which causes it to take joint field ownership of the approval conditions. |
To all following this issue, I can report that it seems to work a lot better now! 🎉 I had a feeling the fix would not matter, but it has an effect.... |
We have been running cert-manager for a long time in multi-tenant clusters - with the default built-in approver. Yesterday we promoted approver-policy to one of our busiest cluster (TEST). While everything seems to work as it should: new certificate requests are approved by approver-policy, we notice significant log noise in approver policy logs. Here is an example:
CertificateRequest (after the error is logged):
I suspect this is a transitional problem that will disappear after all the certificates in our cluster are renewed, and possibly related to approver-policy attempting to modify conditions previously added by the default built-in approver. As everything seems to work as it should, I will leave the setup as it is now. As most of our certificates are only valid for a relatively short period (a.t.m. 6 days), we should see if the problems disappear by the end of this week.
But I still think this error-logging is annoying and should be avoided. Please let me know if more details about our setup is needed.
The text was updated successfully, but these errors were encountered: