Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for http-01 challenge propagation: failed to perform self check GET request #3238

Closed
dineshgupta04 opened this issue Aug 31, 2020 · 54 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.

Comments

@dineshgupta04
Copy link

Status:
Presented: true
Processing: true
Reason: Waiting for http-01 challenge propagation: failed to perform self check GET request 'http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA': Get "http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA": dial tcp 18.192.17.98:80: connect: connection timed out
State: pending

Installation -
I am using AWS eks-
I cloned nginx-ngress in local and then I am installing it eks using annotation
service.beta.kubernetes.io/aws-load-balancer-type: nlb

I install certbot using helm
I applied a issuer and ingress resource. Till now I haven't created any application deployment.

When I am doing kubectl describe challenge I am getting above error message.

I am doing nothing extra. I had tried all the possible way but its not working . Can anyone help here

@meyskens
Copy link
Contributor

Can you access https://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA from within a pod inside the cluster?

/triage support

@jetstack-bot jetstack-bot added the triage/support Indicates an issue that is a support question. label Sep 22, 2020
@gudipudipradeep
Copy link

gudipudipradeep commented Sep 24, 2020

@meyskens I was also getting same issue. Providing the result.
Capture

I was using HAproxy Ingress controller but create ingress based annotation with ngnix.
image

@nabsul
Copy link

nabsul commented Sep 28, 2020

Hi, I'm having the same problem (with nginx ingress). I tried curling the validation URL from a pod in the cluster and got the following response:

curl: (52) Empty reply from server

Whereas doing the same from outside the cluster returns the secret as expected.

Does this mean there's something strange going on with DNS in Kubernetes?

@nabsul
Copy link

nabsul commented Sep 28, 2020

Interestingly, I can access external domains from the pod, but I can't seem to access any of the domains that are hosted inside this cluster. For example:

> curl https://nabeel.blog
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to nabeel.blog:443

vs:

> curl https://bing.com
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="https://www.bing.com:443/?toWww=1&amp;redig=E2BCEB95F2954770B50A53D9BBBE3C3D">here</a>.</h2>
</body></html>

@Serrvosky
Copy link

I'm facing the same problem as @nabsul

@nabsul
Copy link

nabsul commented Oct 6, 2020

Based on some other threads I've been reading, this seems to be related to a bug in Kubernetes DNS, not directly related to cert-manager (but certain affects cert-manager) heavily.

In case this is helpful to others stuck on this issue, I've unblocked myself by manually generating certs and uploading them to my cluster. (Note: I prefer to use a VM or pod to do the following because the source IP address gets logged to public records at let's encrypt)

  1. Run certbot certonly --manual --preferred-challenges dns and follow the instructions. You'll need to update a TXT record in your domain settings to complete the process.

  2. Go to the directory where the certs were created and run kubectl create secret tls [secret-name] --cert=fullchain.pem --key=privkey.pem

It's tedious, but at least my certs won't expire while we're waiting for this bug to get fixed.

Docs:

https://certbot.eff.org/docs/using.html#manual
https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#-em-secret-tls-em-

@nabsul
Copy link

nabsul commented Oct 6, 2020

In the meantime, I wonder how hard/bad it would be to hack cert-manager itself to skip the "self check GET request" step altogether. It's a great idea to do this check, but I don't think it's absolutely necessary to the cert renewal process.

@nabsul
Copy link

nabsul commented Oct 6, 2020

WARNING: This is a random idea that I haven't fully thought through. Attempt at your own risk :-).

Although I would love to, I most likely don't have time to mess with this idea, but if anyone wants to give it a shot, I would try replacing the testReachability() function here with a simple return nil.

You'd then need to build a Docker image, upload it to docker hub, and use it instead of the official image in your cluster.

Again, if this works at all, it should be considered a temporary solution until a formal fix comes out.

@meyskens
Copy link
Contributor

meyskens commented Oct 8, 2020

I strongly reccomend not doing that, have you tried https://cert-manager.io/docs/usage/certificate/#temporary-certificates-whilst-issuing ?

@mohamedalaa33
Copy link

@meyskens what do you recommend ??

@nabsul
Copy link

nabsul commented Oct 23, 2020

Hi all. Has there been any news on this issue? I gave up on looking for a solution and instead figured out how to manually renew Let's Encrypt certs in Kubernetes.

In case any of your are still stuck, I've shared code and instructions to do this here: https://github.com/nabsul/k8s-letsencrypt

The instructions are long for the sake of clarity, but it's actually not that bad to do it manually.

@UsernameAlvarez
Copy link

Is There a new development about this issue?

@nabsul
Copy link

nabsul commented Nov 13, 2020

I've lost faith in cert-manager due to the lack of progress on this issue. I'm writing a replacement that automates what I described in my previous comment.

@just1689
Copy link

I am also having this issue. Its far too on-and-off

@farahty
Copy link

farahty commented Nov 24, 2020

i have also the same issue

@nabsul
Copy link

nabsul commented Nov 25, 2020

If anyone is stuck and willing to try out some experimental code, please reach out to me (LinkedIn and Twitter username is the same as here).

@just1689
Copy link

just1689 commented Nov 25, 2020

@nabsul Hi there. Are you referring to https://github.com/nabsul/k8s-letsencrypt ? I'd like to give it a go. Need to find time this week. I'd be happy to move the conversation over to your repo if that works for you.

@rcomblen
Copy link

Hi here,

I got exactly the same issue while deploying the Gitlab Helm charts, that rely on Jetstack's cert-manager (version v0.10.1).

The log of the acme-htttp-solver pod kept telling "Failed to perform self check GET request" on the challenge url on the public domain.

I was working on french k8s as a service provider Scaleway (https://www.scaleway.com/en/kubernetes-kapsule/).

The default network setup is using Cilium as container network interface (CNI).

On a cluster with the same provider using the Calico CNI, the problem is gone, certificates are properly issued.

@farahty
Copy link

farahty commented Nov 25, 2020

finally i have got a solution, i have follow the tutorial on digitalocean and step number 5 solve the issue

@nabsul
Copy link

nabsul commented Nov 25, 2020

@nimerfarahty Do you know if the solution suggested by digital ocean works if you have multiple domains on the same load balancer?

@nabsul
Copy link

nabsul commented Nov 25, 2020

@just1689 It's actually a new project based on the one you referenced . I'll be making the code public today and will tag you from there.

@chrissound
Copy link

Interestingly, I can access external domains from the pod, but I can't seem to access any of the domains that are hosted inside this cluster. For example:

> curl https://nabeel.blog
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to nabeel.blog:443

vs:

> curl https://bing.com
<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="https://www.bing.com:443/?toWww=1&amp;redig=E2BCEB95F2954770B50A53D9BBBE3C3D">here</a>.</h2>
</body></html>

Same issue here (digital ocean). Which probably points to a bug upstream, I'm not sure if this issue only started occurring after installation of cert-manager though, in which case maybe it's not upstream.

@chrissound
Copy link

Seems to be the same issue here: #466

@fabstu
Copy link

fabstu commented Nov 27, 2020

This PR (implementation of the KEP) might help: kubernetes/kubernetes#92312
It will probably be merged for 1.21 (next cycle). So until then maybe use the workarounds or use dns instead of http.

@zukko78
Copy link

zukko78 commented Feb 27, 2021

Can you access https://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA from within a pod inside the cluster?

/triage support

Hi, @meyskens is it possible that cert-manager makes requests over https instead of http? For example:
my current issue:
E0227 03:49:29.419018 1 sync.go:182] cert-manager/controller/challenges "msg"="propagation check failed" "error"="failed to perform self check GET request 'http://xxxx.com/.well-known/acme-challenge/FD_v3Aqntwk-T5iXp6XHBiHrwBe6JgUAH75SxmjMB58': Get "http://xxxx.com/.well-known/acme-challenge/FD_v3Aqntwk-T5iXp6XHBiHrwBe6JgUAH75SxmjMB58\": dial tcp 1.x.x.0:80: connect: connection refused" "dnsName"="xxxx.com" "resource_kind"="Challenge" "resource_name"="xxxx.com-5hx2t-578086427-1035349906" "resource_namespace"="gloo" "resource_version"="v1" "type"="HTTP-01"

over https:
@busybox:/ ]$ curl -vkL https://xxxx.com/.well-known/acme-challenge/FD_v3Aqntwk-T5iXp6XHBiHrwBe6JgUAH75SxmjMB58

  • SSLv3, TLS handshake, Client hello (1):
  • SSLv3, TLS handshake, Server hello (2):
  • SSLv3, TLS handshake, CERT (11):
  • SSLv3, TLS handshake, Server key exchange (12):
  • SSLv3, TLS handshake, Server finished (14):
  • SSLv3, TLS handshake, Client key exchange (16):
  • SSLv3, TLS change cipher, Client hello (1):
  • SSLv3, TLS handshake, Finished (20):
  • SSLv3, TLS change cipher, Client hello (1):
  • SSLv3, TLS handshake, Finished (20):

GET /.well-known/acme-challenge/FD_v3Aqntwk-T5iXp6XHBiHrwBe6JgUAH75SxmjMB58 HTTP/1.1
User-Agent: curl/7.35.0
Host: xxxx.com
Accept: /

< HTTP/1.1 200 OK
< cache-control: no-cache, no-store, must-revalidate
< date: Sat, 27 Feb 2021 03:53:07 GMT
< content-length: 87
< content-type: text/plain; charset=utf-8
< x-envoy-upstream-service-time: 2
< server: envoy
<
FD_v3Aqntwk-T5iXp6XHBiHrwBe6JgUAH75SxmjMB58.6M5WXWjw_rw-kfYcGGo4my-CchJbGRRNuHUgAytzzJk

@joshua-hester
Copy link

@nimerfarahty solved my problem. I agree this needs to be addressed. ASAP

@ahmed-adly-khalil
Copy link

I'm facing the same issue with Linode, Step 5 on digital ocean solution above didn't solve the issue for me. any ideas?

@ahmed-adly-khalil
Copy link

I was able to fix this, the chain of issues started as follow:

I had the following in the annotation in my ingress controller

nginx.ingress.kubernetes.io/use-regex: "true"

nginx.ingress.kubernetes.io/rewrite-target: /

this caused all URLs to be rewritten to /
this caused the cert-manager to fail on self-check before communicating to let's encrypt
this caused certificate generation not to start at all
this also caused the DNS resolution from inside the cluster to fail

commenting these 2 lines made things work

@adonaicosta
Copy link

Hi guys

to fix problem on DOKS cluster, enough add these annotations on ingress-nginx-controller service:

apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/do-loadbalancer-enable-proxy-protocol: "true"
service.beta.kubernetes.io/do-loadbalancer-hostname: "anyDNSEntryWithThisLoadBalancer, ex: prefix.mydomain.com"

@nabsul
Copy link

nabsul commented Oct 27, 2021

@bjornarhagen I'm sorry but I don't remember the exact thread. What I remember is that for some reason, DNS would fail to correctly resolve to allow a pod to access a domain that is part of the cluster you're on (see this comment: #3238 (comment))

Does this help at all? #656 (comment)

I ended up writing my own cert manager. I was interested in learning Let's Encrypt, and so far it's been running flawlessly for close to a year. It's not production quality, but at least I understand it and can fix anything that goes wrong.

My next project is to replace nginx-controller, and then my Kubernetes cluster will be perfect!

@ahmed-adly-khalil
Copy link

@nabsul Thank you so much for your help on this :)

@Jille
Copy link

Jille commented Oct 27, 2021

I ran into the same problem (for me the error was EOF). It was caused by our cloud load balancer not using the Proxy Protocol on connections coming from inside the cluster. Edit: I was wrong, it's caused by kube-proxy setting up an iptables rule to rewrite outgoing traffic to the Service for nginx. https://github.com/compumike/hairpin-proxy looks like a solution for that.

Some debugging tips for others:

  • Create a debugging pod (kubectl run -i --tty --rm debug --image=ubuntu:20.04 --restart=Never -- bash), and look around with ping, dig (apt install dnsutils), and curl (apt install curl). Figure out if DNS is giving the right answer and whether you can connect to the HTTP server.
  • If the connection is established, but you see an error (like EOF in my case), check the logs of your ingress controller (like nginx).
  • If it resolves to an IP inside your cluster, check if there's any firewalls / security policies preventing you. (Check Hubble, if you're using Cilium)

@Vad1mo
Copy link

Vad1mo commented Nov 29, 2021

its seems like a patch was provided on kubernetes side (kubernetes/kubernetes#92312), however we still face the issue with 1.21.

Any Ideas how to solve this with AWS NLB? without the hairpin-proxy solution?

@afpd
Copy link

afpd commented Dec 28, 2021

I my setup of the bare metal kubernetes + metallb + nginx-ingress I was able to resolve the issue by adding internalTrafficPolicy: Cluster to the service/ingress-nginx-controller

@vhadianto
Copy link

vhadianto commented Feb 23, 2022

I found this issue in our nginx ingress controller + cert manager setup (in a public cloud).

The issue in our setup is the network policy. I need to do the following:

  • allow egress 8089 from the nginx ingress controller to the namespace or pod where the ACME solver will run
  • allow ingress 8089 to the namespace/pod where ACME solver will run from the nginx namespace

If you install everything in a single namespace I suspect you don't have to do this.

@jetstack-bot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle stale

@jetstack-bot jetstack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2022
@jetstack-bot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to jetstack.
/lifecycle rotten
/remove-lifecycle stale

@jetstack-bot jetstack-bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 23, 2022
@jetstack-bot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

@jetstack-bot
Copy link
Collaborator

@jetstack-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to jetstack.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@adiii717
Copy link

adiii717 commented Sep 8, 2022

agree with @nabsul, I am able to reach DNS curl http://test.example.com/.well-known/acme-challenge/XHazaOCbmAA8Kh2ej7hySxZudeQuayNTgxzsm2-SQr0 but the cert-manager getting timeout on AWS EKS

E0908 07:06:30.343494       1 sync.go:190] cert-manager/challenges "msg"="propagation check failed" "error"="failed to perform self check GET request 'http://test.example.com/.well-known/acme-challenge/XHazaOCbmAA8Kh2ej7hySxZudeQuayNTgxzsm2-SQr0': Get \"http://test.example.com/.well-known/acme-challenge/XHazaOCbmAA8Kh2ej7hySxZudeQuayNTgxzsm2-SQr0\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "dnsName"="test.example.com" "resource_kind"="Challenge" "resource_name"="app-4w57c-777272283-1392974179" "resource_namespace"="argocd" "resource_version"="v1" "type"="HTTP-01"

@mikebevz
Copy link

Same problem here in a standalone Kubernetes cluster.

@magsol
Copy link

magsol commented Nov 29, 2022

Same problem here, on a bare-metal k3s cluster.

Running traefik 2.9.5 and cert-manager 1.10.1, both via helm.

@nabsul
Copy link

nabsul commented Nov 30, 2022

If you're up experimenting, I've created an alternative here: https://kcert.dev

@nis130
Copy link

nis130 commented Jan 6, 2023

I am also facing the same problem. I am using GCP internal HTTPS load balancer which only supports one type of traffic either HTTPS or HTTP. I am using https using following annotations

kubernetes.io/ingress.class: "gce-internal"
kubernetes.io/ingress.allow-http: "false"
cert-manager.io/cluster-issuer: letsencrypt-prod
acme.cert-manager.io/http01-edit-in-place: "true"

Since this ingress only accepts https traffic and self check makes GET request to http://example.com which doesn't work whereas https://example.com works fine. Is there any way I can avoid self check or make it use https somehow?

@adrianimboden
Copy link

adrianimboden commented Jan 16, 2023

After reading all this, it seems to me that some basic networking knowledge is missing.

I support the decision to not provide a possiblity to disable the check, because you can very easily get blocked by letsencrypt when you have too many invalid requests!

If you have a standalone cluster, you probably have a problem like this. Example:

  • the domain you want the certificate for is your-page.com
  • the domain your-page.com has the IP 1.2.3.4 set as its A-record
  • your nodes are in the private network 192.168.1.0/24
  • testnode1 has 192.168.1.100
  • your router internal IP is 192.168.1.1
  • your router external IP is 1.2.3.4
  • your router does DNAT so that TCP requests the external IP 1.2.3.4 get DNATed to 192.168.1.100 (or to a load balancer, does not matter)

Now, the NAT most likely is configured so that the address translation only gets done for traffic that comes from outside/wan.

So basically, your node cannot access http://1.2.3.4/.well-known/whatever and http://your-page.com/.well-known/whatever, but computers on the internet (such as letsencrypt) can.

You should be able to verify that by comparing the result of dig your-page.com and curl http://your-page.com/ on your k8s node and on a different computer (e.g. your work computer at home).

Possible solutions

  • add 127.0.0.1 or the load-balancer internal IP to every nodes /etc/host for the domains you want to reach (your-page.com in this example)
  • do split DNS on your internal DNS and add the load-balancer instead of the public IP for all your domains
  • Reconfigure the router to also forward port 80/443 internally (this is rather hacky, as the source ip also has to be translated)

More information: https://wiki.kptree.net/doku.php?id=linux_router:nftables#hairpin_nat

@tkrusterholz
Copy link

After reading all this, it seems to me that some basic networking knowledge is missing.

Major 4head moment. You're absolutely right. After fighting with this all morning, the solution was to add test.domain.com > node IP to my local DNS. Worked like a charm after that.

@shadyabhi
Copy link

Summary: Check where the issue exists, DNS or HTTP? Many people are reporting DNS here, so I wanted to share my fix as well.

There can be many reasons for timeouts, we first need to trace where the failure lies. https://cert-manager.io/docs/troubleshooting/#troubleshooting-a-failed-certificate-request. For me, the reason was DNS requests dropping leading to exceeding the context deadline for self-check.

After inspecting each object, I traced it down to challenge resource, and it was a timeout error, which is what this issue is about. Everyone will have a different reason for timeouts.

In my client, this timeout was actually caused by DNS. I was able to trace it by doing the nslookup from one of the pods.

$ kubectl exec -i -t dnsutils -- nslookup hass.home.domain.host
Server:		10.152.183.10
Address:	10.152.183.10#53

** server can't find hass.home.domain.host.home.lab: SERVFAIL

command terminated with exit code 1

Note the addition of home.lab due to my resolv.conf and k8s defaults. The SERVFAIL returned after a 5s delay.

nameserver 127.0.0.53
options edns0 trust-ad
search home.lab

As I've a search home.lab due to DHCP settings, the cert-manager pod's resolv.conf ends up being:-

$ kubectl exec -i -t dnsutils -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local home.lab
nameserver 10.152.183.10
options ndots:5

Note, ndots:5 and the existence of search home.lab. My local DNS resolver was configured to drop the request for home.lab subdomain records if records don't exist. System Domain Local Zone Type=Deny at https://docs.netgate.com/pfsense/en/latest/services/dns/resolver-config.html

As there was a 5s delay in DNS lookup, cert-manager self-check ended up timing out. The fix was to return NXDOMAIN and not drop requests by setting System Domain Local Zone Type=Static

TLDR: It can't be dns.... it's always DNS! 😂

@64J0
Copy link

64J0 commented Jun 26, 2023

Not sure how to proceed. I'm facing this problem as well.

As in @shadyabhi scenario, I was able to find more information related to the problem in the "challenge" CRD, but the information there is not very useful to understand the problem.

I have this reason: context deadline exceeded (Client.Timeout exceeded while awaiting headers).

And I have these Events:

Events:
  Type    Reason     Age   From                     Message
  ----    ------     ----  ----                     -------
  Normal  Started    13m   cert-manager-challenges  Challenge scheduled for processing
  Normal  Presented  13m   cert-manager-challenges  Presented challenge using HTTP-01 challenge mechanism

The problem was some netpol misconfiguration.

@vchokshi
Copy link

I ran into this problem yesterday and today. After debugging, I switched over to DNS01 as a validation method and it works great.

Kubernetes 1.27.3-00
external-dns: 6.28.5
ingress-nginx: 4.8.3
cert manager 1.13.2

I tried to read through RFC8555 and my best guess is that my situation is a race condition between the HTTP01 challenger and external-dns updates.

@ohemelaar
Copy link

Hi, just adding my two cents in case it helps someone. I ran into this issue, with the following symptoms:

  • running curl from a haproxy ingress pod to request the challenge URL from within the cluster worked fine
  • running curl from my machine to request the challenge URL worked fine
  • running curl from a temporary alpine pod to request the challenge URL showed Connection reset by peer

In my case, it's an issue with our haproxy ingress controller because we enabled proxy protocol. Disabling it made the challenge solve immediately. We'll now have to look at how we can keep proxy protocol AND solved challenges, but that's not the topic of this issue.

@MohamedHaroon98
Copy link

Hi I was facing a similar issue with EKS and adding the following to the ingress annotations solved it;

nginx.ingress.kubernetes.io/cors-allow-origin: http://localhost,<my_domain>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests