Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagation check failed when using DNS-01 #4624

Closed
wolfedale opened this issue Nov 25, 2021 · 11 comments
Closed

Propagation check failed when using DNS-01 #4624

wolfedale opened this issue Nov 25, 2021 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@wolfedale
Copy link

Describe the bug:
Every time when I'm creating new Certificate I'm getting error

E1125 10:20:11.727984 1 sync.go:186] cert-manager/controller/challenges "msg"="propagation check failed" "error"="dial tcp 172.18.44.80:53: i/o timeout" "dnsName"="test.corp.foo.bar" "resource_kind"="Challenge" "resource_name"="foo-bar-f294x-2399323752-3051153670" "resource_namespace"="cert-manager" "resource_version"="v1" "type"="DNS-01"

From what I know networking is working fine, coreDNS too. This is not a fresh k8s cluster, it's a production cluster where I already have multiple services. I'm not sure why cert-manager is trying to access this IP on port 53 and using TCP.

Can I somehow change it to UDP?
Can I disable this check?
How this check is working?

I also noticed that with option: --dns01-recursive-nameservers-only it's working fine. I don't see anymore logs about "propagation check", and I see that cert-manager can now generate Certificates.

With option: --dns01-recursive-nameservers-only I also noticed two new logs entries:

I1125 10:43:57.287345 1 controller.go:161] cert-manager/controller/certificates-readiness "msg"="re-queuing item due to optimistic locking on resource" "key"="cert-manager/foo-bar" "error"="Operation cannot be fulfilled on certificates.cert-manager.io \"foo-bar\": the object has been modified; please apply your changes to the latest version and try again"

and:

E1125 10:43:57.300232 1 controller.go:211] cert-manager/controller/challenges "msg"="challenge in work queue no longer exists" "error"="challenge.acme.cert-manager.io \"foo-bar-sptj2-3396776443-943601772\" not found"

After the last one I see that my new certificate is there and it's working correctly. It's not clear from the logs what just happened. Does it mean I need to check webhook logs for the actual status of my Certificate progress? My webhook is able to create and delete (Present/CleanUp) TXT records on the provider.

Expected behaviour:
cert-manager should query my local DNS (using /etc/resolv.conf) to see if TXT record has been created or not.

Steps to reproduce the bug:

  • install cert-manager (1.6.1)
  • install webhook
  • create Certificate

Anything else we need to know?:

Environment details::

  • Kubernetes version: v1.20.8
  • cert-manager version: 1.6.1
  • Install method: kubectl

/kind bug

@jetstack-bot jetstack-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2021
@maxisam
Copy link

maxisam commented Dec 8, 2021

Is it an internal site? Do you set 172.18.44.80:53 as dns01-recursive-nameservers? I feel like cert-manager doesn't query DNS to check TXT records. I think Certificate provider does.

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2022
@jetstack-bot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 7, 2022
@irbekrm
Copy link
Collaborator

irbekrm commented Apr 28, 2022

Hi, sorry it looks like we never had a chance to look at this issue.

Is it still an issue?

There is some documentation about how to go about debugging ACME failures https://cert-manager.io/docs/faq/acme/
Selfcheck is cert-manager controller attempting to read the challenge record, usually errors at this stage suggest networking/DNS issue in the cluster

@irbekrm irbekrm added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Apr 28, 2022
@wolfedale
Copy link
Author

@irbekrm true - this issue was related to the networking problems that we had. Closing :-)

@evercast-mahesh2021
Copy link

evercast-mahesh2021 commented Dec 6, 2022

@wolfedale do you mind sharing the solution of it? I am facing the same issue in one of my eks cluster. The cluster was working fine and for some reason i had to create a cluster, and now i am stuck on this cert issue. Nothing has been changed (networking and other configurations).

controller.go:205] cert-manager/controller "msg"="starting controller" "controller"="certificates-issuing"
controller.go:205] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-venafi"
controller.go:205] cert-manager/controller "msg"="starting controller" "controller"="certificates-metrics"
controller.go:205] cert-manager/controller "msg"="starting controller" "controller"="clusterissuers"
controller.go:205] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-ca"
controller.go:205] cert-manager/controller "msg"="starting controller" "controller"="certificaterequests-issuer-selfsigned"
util.go:84] cert-manager/controller/orders/handleOwnedResource "msg"="owning resource not found in cache" "related_resource_kind"="Order" "related_resource_name"="test-us-new-tls-52l5d-2370060743" "related_resource_namespace"="test" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1"
util.go:84] cert-manager/controller/certificaterequests-issuer-acme/handleOwnedResource "msg"="owning resource not found in cache" "related_resource_kind"="CertificateRequest" "related_resource_name"="test-us-new-tls-52l5d" "related_resource_namespace"="test" "resource_kind"="Order" "resource_name"="test-us-new-tls-52l5d-2370060743" "resource_namespace"="test" "resource_version"="v1"
setup.go:202] cert-manager/clusterissuers "msg"="skipping re-verifying ACME account as cached registration details look sufficient" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-cluster-issuer-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-production" "resource_namespace"="" "resource_version"="v1"
setup.go:202] cert-manager/clusterissuers "msg"="skipping re-verifying ACME account as cached registration details look sufficient" "related_resource_kind"="Secret" "related_resource_name"="letsencrypt-cluster-issuer-key" "related_resource_namespace"="cert-manager" "resource_kind"="ClusterIssuer" "resource_name"="letsencrypt-production" "resource_namespace"="" "resource_version"="v1"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.194.189:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:401] cert-manager/controller/ingress-shim "msg"="certificate resource has no owner. refusing to update non-owned certificate resource for object" "related_resource_kind"="Certificate" "related_resource_name"="test-us-new-tls" "related_resource_namespace"="test" "related_resource_version"="v1" "resource_kind"="Ingress" "resource_name"="app1-ingress" "resource_namespace"="test" "resource_version"="v1"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.193.138:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:401] cert-manager/controller/ingress-shim "msg"="certificate resource has no owner. refusing to update non-owned certificate resource for object" "related_resource_kind"="Certificate" "related_resource_name"="test-us-new-tls" "related_resource_namespace"="test" "related_resource_version"="v1" "resource_kind"="Ingress" "resource_name"="app1-ingress" "resource_namespace"="test" "resource_version"="v1"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.193.138:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
controller.go:161] cert-manager/challenges "msg"="re-queuing item due to optimistic locking on resource" "key"="test/test-us-new-tls-52l5d-2370060743-3367498103" "error"="Operation cannot be fulfilled on challenges.acme.cert-manager.io \"test-us-new-tls-52l5d-2370060743-3367498103\": the object has been modified; please apply your changes to the latest version and try again"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.198.244:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.198.244:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
controller.go:161] cert-manager/challenges "msg"="re-queuing item due to optimistic locking on resource" "key"="test/test-us-new-tls-52l5d-2370060743-3367498103" "error"="Operation cannot be fulfilled on challenges.acme.cert-manager.io \"test-us-new-tls-52l5d-2370060743-3367498103\": the object has been modified; please apply your changes to the latest version and try again"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.193.138:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.193.138:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.198.244:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.194.189:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.198.244:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.194.189:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.198.244:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.194.189:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.196.216:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:401] cert-manager/controller/ingress-shim "msg"="certificate resource has no owner. refusing to update non-owned certificate resource for object" "related_resource_kind"="Certificate" "related_resource_name"="test-us-new-tls" "related_resource_namespace"="test" "related_resource_version"="v1" "resource_kind"="Ingress" "resource_name"="app2-ingress" "resource_namespace"="test" "resource_version"="v1"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.196.216:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:401] cert-manager/controller/ingress-shim "msg"="certificate resource has no owner. refusing to update non-owned certificate resource for object" "related_resource_kind"="Certificate" "related_resource_name"="test-us-new-tls" "related_resource_namespace"="test" "related_resource_version"="v1" "resource_kind"="Ingress" "resource_name"="app2-ingress" "resource_namespace"="test" "resource_version"="v1"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.193.138:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.193.138:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.193.138:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.196.216:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.194.189:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"
controller.go:161] cert-manager/challenges "msg"="re-queuing item due to optimistic locking on resource" "key"="test/test-us-new-tls-52l5d-2370060743-3367498103" "error"="Operation cannot be fulfilled on challenges.acme.cert-manager.io \"test-us-new-tls-52l5d-2370060743-3367498103\": the object has been modified; please apply your changes to the latest version and try again"
sync.go:186] cert-manager/challenges "msg"="propagation check failed" "error"="dial tcp 205.251.198.244:53: i/o timeout" "dnsName"="test-app.com" "resource_kind"="Challenge" "resource_name"="test-us-new-tls-52l5d-2370060743-3367498103" "resource_namespace"="test" "resource_version"="v1" "type"="DNS-01"

@github4sanjay
Copy link

github4sanjay commented Dec 10, 2022

@wolfedale i m also facing same issue with route53, could you please share the solution ? @evercast-mahesh2021 did u find a solution ?

@MageshSrinivasulu
Copy link

@evercast-mahesh2021 @github4sanjay I am too facing the same issue. Do you guys have any solution?

@evercast-mahesh2021
Copy link

I think i had to destroy and recreate my eks cluster.

@github4sanjay
Copy link

@MageshSrinivasulu
kubectl describe svc kube-dns -n kube-system
get the ip addresses and put it here

extraArgs:
- --cluster-issuer-ambient-credentials
- --dns01-recursive-nameservers-only
- --dns01-recursive-nameservers=10.0.10.221:53,10.0.19.128:53

@kzap
Copy link

kzap commented Jan 27, 2023

If you are getting this error and using a public Route53 zone, check your VPC Network ACLs if they are blocking incoming port 53. That was what we found. Possible work around using https://cert-manager.io/docs/configuration/acme/dns01/#setting-nameservers-for-dns01-self-check like @github4sanjay mentioned

extraArgs:
- --dns01-recursive-nameservers-only
- --dns01-recursive-nameservers=kube-dns.kube-system.svc.cluster.local:53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

8 participants