Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"propagation check failed" for DNS-01 ACME challenge on internal EKS-cluster #5586

Closed
kobethuwis opened this issue Nov 18, 2022 · 7 comments
Closed

Comments

@kobethuwis
Copy link

kobethuwis commented Nov 18, 2022

Trying to configure a LetsEncrypt Clusterissuer for our internal EKS-cluster here using the official Helm charts & documentation. I can access the challenge from my webbrowser, DNS resolving works for both the certificate requesting & cert manager pods. I can even retrieve the challenge's TXT DNS01 challenge in the Route53 console. However, when the certificate is created, the challenge & order are kept in the 'Pending' state and the cert-manager pod spits out the following:

cert-manager/challenges "msg"="propagation check failed" "error"="DNS record for \"my-dns-record.net\" not yet propagated" "dnsName"="my-dns-record.net" "resource_kind"="Challenge" "resource_name"="letsencrypt-tls-web" "resource_namespace"="airflow" "resource_version"="v1" "type"="DNS-01"
E1118 10:08:18.870843       1 sync.go:190] 

What's going wrong here? I've already tried editing the serviceaccount's permissions & specifying the extraArgs, as mentioned in #1627, without result.

Cert-manager config

resource "helm_release" "cert-manager" {
  repository = "https://charts.jetstack.io"
  chart      = "cert-manager"
  name       = "cert-manager"
  namespace  = var.namespace
  version    = var.chart_tag
  values     = [
    <<EOT
installCRDs: true
serviceAccount:
  name: ${var.service_account}
  annotations:
    eks.amazonaws.com/role-arn: ${module.cert_manager_iam_assumable_role_with_oidc.iam_role_arn}
securityContext:
  fsGroup: 1001
extraArgs:
  - --dns01-recursive-nameservers-only
  - --dns01-recursive-nameservers=8.8.8.8:53,1.1.1.1:53
  EOT
  ]
}

resource "kubernetes_manifest" "cluster_issuer_manifest" {
  manifest = {
    apiVersion = "cert-manager.io/v1"
    kind       = "ClusterIssuer"
    metadata   = {
      name = "letsencrypt-cluster-issuer"
    }
    spec = {
      acme = {
        server              = "https://acme-v02.api.letsencrypt.org/directory"
        email               = "xxxxxxxxxxxxxxxxx"
        privateKeySecretRef = {
          name = "letsencrypt-cluster-issuer-key"
        }
        solvers = [
          {
            dns01 = {
              route53       = {
                region       = "eu-west-1"
                hostedZoneID = var.zone_id
              }
            }
          }
        ]
      }
    }
  }
  depends_on = [helm_release.cert-manager]
}

module "cert_manager_iam_assumable_role_with_oidc" {
  source                        = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
  version                       = "5.5.0"
  role_name                     = "datalake-cert-manager-service-role"
  create_role                   = true
  provider_url                  = var.provider_url
  role_policy_arns              = [aws_iam_policy.cert_manager_policy.arn]
  oidc_fully_qualified_subjects = ["system:serviceaccount:${var.namespace}:${var.service_account}"]
}

data "aws_iam_policy_document" "cert_manager_policy_doc" {
  statement {
    effect = "Allow"

    actions = [
      "route53:ChangeResourceRecordSets",
      "route53:ListResourceRecordSets"
    ]

    resources = ["arn:aws:route53:::hostedzone/${var.zone_id}"]
  }

  statement {
    effect = "Allow"

    actions = [
      "route53:GetChange"
    ]

    resources = ["arn:aws:route53:::change/*"]
  }
}

Ingress config

    annotations:
      kubernetes.io/ingress.class: "nginx"
      cert-manager.io/cluster-issuer: "letsencrypt-cluster-issuer"
    tls:
      secretName: letsencrypt-tls-flower
@kobethuwis
Copy link
Author

kobethuwis commented Nov 25, 2022

🤷‍♂️ Turns out that it's impossible to solve the dns01 challenge in a AWS Route53 private hosted zone as mentioned in #2690.

@MageshSrinivasulu
Copy link

@kobethuwis I am using a public hosted zone with a similar set of configurations and facing the same issue.

Did you ever tried a public hosted zone and does it work for you?

@kobethuwis
Copy link
Author

Yes! The certificate configuration for the ELB worked, but I experienced application-level issues causing the SSL deployment in general to be delayed. An ACM generated certificated issued in a public hosted zone definitely works though!

@kzap
Copy link

kzap commented Jan 27, 2023

If you are getting this error and using a public Route53 zone, check your VPC Network ACLs if they are blocking incoming port 53. That was what we found. Possible work around using https://cert-manager.io/docs/configuration/acme/dns01/#setting-nameservers-for-dns01-self-check

extraArgs:
- --dns01-recursive-nameservers-only
- --dns01-recursive-nameservers=kube-dns.kube-system.svc.cluster.local:53

@droslean
Copy link

@kzap @kobethuwis I have the same issue but I can't find any solution. Can you let me know how to fix this or at least understand the problem? Adding those extra flags doesn't help.

@kobethuwis
Copy link
Author

@droslean You need to make sure that your cert-manager deployment can reach and resolve public IP's. This is needed because you can't issue an SSL certificate without 'proving' that you're the issuer (look at it like some sort of handshake). Like @kzap mentioned, you need to check your security groups, network gateways and even external firewalls if they allow traffic through port 53.

When you're using a private hosted zone in AWS, you are by definition blocking public access:
-> it is impossible to issue an SSL Certificate in a private hosted zone.

Additionally, I was trying to propagate an SSL Certificate through cert-manager attached to an Ingress Nginx Load Balancer. Since I was using the helm chart, am using an EKS hosted cluster and wanted to use a single SSL certificate for all of my applications (thus attach it to the ELB, not the separate Ingresses) I had to use AWS ACM to issue a certificate, which worked like a charm!

@droslean
Copy link

@kobethuwis Thanks for the response! I tried to take a look but it seems that everything should be ok but it's not :(. I will try to dig further into the problem by opening a customer support case in AWS to at least understand where is the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants