Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS Authentication in Kubernetes, Pulsar 2.6.1 - Broker crash loop on startup due to 401 in WorkerService.start(..) #84

Closed
devinbost opened this issue Nov 12, 2020 · 11 comments

Comments

@devinbost
Copy link

devinbost commented Nov 12, 2020

Copying from the Apache/Pulsar Github issue (apache/pulsar#8536):

Describe the bug
After configuring TLS Authentication in Pulsar 2.6.1 with this helm chart: https://github.com/devinbost/pulsar-helm-chart/tree/tls-auth
the broker gets stuck in a restart loop due to the WorkerService crashing with:

21:24:45.025 [pulsar-web-48-8] WARN org.apache.pulsar.broker.web.AuthenticationFilter - [10.244.0.9] Failed to authenticate HTTP request: Client unable to authenticate with TLS certificate
21:24:45.042 [pulsar-web-48-8] INFO org.eclipse.jetty.server.RequestLog - 10.244.0.9 - - [17/Nov/2020:21:24:44 +0000] "PUT /admin/v2/persistent/public/functions/assignments HTTP/1.1" 401 0 "-" "Pulsar-Java-v2.6.1" 63
21:24:45.042 [pulsar-web-48-1] INFO org.eclipse.jetty.server.RequestLog - 10.244.0.7 - - [17/Nov/2020:21:24:44 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.17.2" 63
21:24:45.098 [AsyncHttpClient-64-1] WARN org.apache.pulsar.client.admin.internal.BaseResource - [http://pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local:8080/admin/v2/persistent/public/functions/assignments] Failed to perform http put request: javax.ws.rs.NotAuthorizedException: HTTP 401 Unauthorized
21:24:45.115 [main] ERROR org.apache.pulsar.functions.worker.WorkerService - Error Starting up in worker
org.apache.pulsar.client.admin.PulsarAdminException$NotAuthorizedException: HTTP 401 Unauthorized

during the WorkerService.start(..) method execution.

Edit:
After debugging, the issue is that the data is still unreadable after the decrypt step, so something is misconfigured with the certs.

To Reproduce
Steps to reproduce the behavior:

  1. Clone the tls-auth branch of my fork of the Pulsar helm chart by running:
git clone https://github.com/devinbost/pulsar-helm-chart.git
git checkout tls-auth
  1. Start minikube with an appropriate number of CPUs:
    minikube start --memory=8192 --cpus=6 --cni=bridge

  2. Run the following commands to setup the kubernetes environment, tokens, certs, and keys:

./scripts/cert-manager/install-cert-manager.sh
./scripts/pulsar/prepare_helm_release.sh -n pulsar -k pulsar-ci -c --pulsar-superusers superadmin,proxy-admin,broker-admin,client-admin,admin
  1. Install the local helm chart with the values file specified:
    helm install --values examples/values-minikube-with-tls-and-jwt.yaml pulsar-ci ./charts/pulsar/

  2. After waiting for a time, get logs from the broker:
    kubectl -n pulsar logs pulsar-ci-broker-0

The logs should demonstrate the problem.
Expected behavior
Decryption should be happening correctly, resulting in the correct auth headers passing when we execute a PUT on the function/assignments topic during broker start.

Environment

  • minikube v1.14.2 on Darwin 10.15.7
  • Kubernetes v1.19.2 on Docker 19.03.8 ...
  • Enabled addons: storage-provisioner, default-storageclass
  • kubectl is configured to use "minikube"

Code involved (edited)

When we create the brokerAdmin client, we use the pulsarWebServiceUrl: https://github.com/apache/pulsar/blob/master/pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/WorkerService.java#L146

The first PUT on the function assignment topic uses the brokerAdminclient here: https://github.com/apache/pulsar/blob/master/pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/WorkerService.java#L169

There must be a cert misconfiguration issue.

@devinbost
Copy link
Author

After I swapped out the brokerClient auth to use token auth instead of TLS auth in the broker-configmap.yaml and proxy-configmap.yaml files, the cluster started just fine. So, it seems that there's a problem in the broker client TLS auth.

@devinbost
Copy link
Author

I added TLS debugging. (I forgot to tag this issue in the commit.)

@devinbost
Copy link
Author

devinbost commented Nov 18, 2020

It appears (from the debug logs) that the TLS session is established.
The exception "Client unable to authenticate with TLS certificate" is thrown in AuthenticationProviderTls from this block:

@Override
public String authenticate(AuthenticationDataSource authData) throws AuthenticationException {
    String commonName = null;

    if (authData.hasDataFromTls()) {
        Certificate[] certs = authData.getTlsCertificates();
        String distinguishedName = ((X509Certificate) certs[0]).getSubjectX500Principal().getName();
        for (String keyValueStr : distinguishedName.split(",")) {
            String[] keyValue = keyValueStr.split("=", 2);
            if (keyValue.length == 2 && "CN".equals(keyValue[0]) && !keyValue[1].isEmpty()) {
                commonName = keyValue[1];
                break;
            }
        }
    }

    if (commonName == null) {
        throw new AuthenticationException("Client unable to authenticate with TLS certificate");
    }

    return commonName;
}

(https://github.com/apache/pulsar/blob/master/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/authentication/AuthenticationProviderTls.java#L86)

That implies that the CN is blank... However, the TLS logs (see attached) show that a CN is clearly present.
So, I'm not sure that I understand what is wrong here.

pulsarbroker.txt
Edit: There's a more recent log in a later comment that shows the CNs matching superAdmin role names. Even after making that change, I'm still getting the same 401.

@devinbost
Copy link
Author

Just before the exception is thrown, it appears that the broker is successfully able to establish a TLS session with Zookeeper, but then it gives this odd message:

Inaccessible trust store: /usr/local/openjdk-8/jre/lib/security/jssecacerts
trustStore is: /usr/local/openjdk-8/jre/lib/security/cacerts
trustStore type is: jks
trustStore provider is: 
the last modified time is: Thu Apr 16 10:21:14 UTC 2020
Reload the trust store
Reload trust certs
Reloaded 128 trust certs

and then loads a lot of certs, like:

adding as trusted cert:
  Subject: CN=Hongkong Post Root CA 1, O=Hongkong Post, C=HK
  Issuer:  CN=Hongkong Post Root CA 1, O=Hongkong Post, C=HK
  Algorithm: RSA; Serial number: 0x3e8
  Valid from Thu May 15 05:13:14 UTC 2003 until Mon May 15 04:52:29 UTC 2023

adding as trusted cert:
  Subject: CN=SecureTrust CA, O=SecureTrust Corporation, C=US
  Issuer:  CN=SecureTrust CA, O=SecureTrust Corporation, C=US
  Algorithm: RSA; Serial number: 0xcf08e5c0816a5ad427ff0eb271859d0
  Valid from Tue Nov 07 19:31:18 UTC 2006 until Mon Dec 31 19:40:55 UTC 2029
. . . 

Immediately after it loads those certs, it reports:

trigger seeding of SecureRandom
done seeding SecureRandom

and then gets the 401 with:

org.apache.pulsar.broker.web.AuthenticationFilter - [10.244.0.9] Failed to authenticate HTTP request: Client unable to authenticate with TLS certificate

devinbost pushed a commit to devinbost/pulsar-helm-chart that referenced this issue Nov 20, 2020
…match superadmin names to ensure the principals are authorized.
@devinbost
Copy link
Author

I tried changing all the certs to use CNs that match roles specified as superAdmin roles, but I can't get beyond the exception:

20:26:40.155 [AsyncHttpClient-64-1] WARN  org.apache.pulsar.client.admin.internal.BaseResource - [http://pulsar-ci-broker-0.pulsar-ci-broker.pulsar.svc.cluster.local:8080/admin/v2/persistent/public/functions/assignments] Failed to perform http put request: javax.ws.rs.NotAuthorizedException: HTTP 401 Unauthorized
20:26:40.173 [main] ERROR org.apache.pulsar.functions.worker.WorkerService - Error Starting up in worker
org.apache.pulsar.client.admin.PulsarAdminException$NotAuthorizedException: HTTP 401 Unauthorized

@devinbost
Copy link
Author

Here's a complete broker log.
brokerlogs.txt

@gubespam
Copy link

@devinbost Did you ever find a solution for this? I am running to the same problem you described in apache/pulsar#8536 . It seems to me to be related to how the function worker is connecting to the broker, but it doesn't have anything to do with the helm chart itself.

@devinbost
Copy link
Author

@gubespam I ended up putting this on the shelf to work on higher priority items, but I suspect it's a configuration issue.

@vitosans
Copy link

vitosans commented Oct 1, 2021

I did something like this:

webServiceUrl: "https://{{ template "pulsar.fullname" . }}-{{ .Values.proxy.component }}:{{ .Values.proxy.ports.https }}/"
brokerServiceUrl: "pulsar+ssl://{{ template "pulsar.fullname" . }}-{{ .Values.proxy.component }}:{{ .Values.proxy.ports.pulsarssl }}/"
tlsEnabled: "true"
brokerClientTlsEnabled: "true"
brokerClientTrustCertsFilePath: "/pulsar/certs/ca/ca.crt"
useTls: true
tlsCertificateFilePath: "/pulsar/certs/broker/tls.crt"
tlsKeyFilePath: "/pulsar/certs/broker/tls.key"
tlsTrustCertsFilePath: "/pulsar/certs/ca/ca.crt"
tlsAllowInsecureConnection: false
tlsEnableHostnameVerification: false
tlsCertRefreshCheckDurationSec: 300

In broker-configmap.yaml

The broker is now able to start up when functions are enabled. Now the problem is when you deploy a function the functions_worker that gets spawned off has a default functions_works.yaml and not the one generated from bin/gen-yml-from-env.py conf/functions_worker.yml in the StateFullSet

So of course he now gets a:

HTTP 401 Unauthorized │
Reason: HTTP 401 Unauthorized

as he is trying to post to http://localhost:8080 which of course is wrong :)

Trying to debug this currently, and then make a giant PR that enables mTLS

@hyperevo
Copy link

hyperevo commented Jan 10, 2023

I had a similar error happen to me. The cause was the tokens that were generated using the scripts/pulsar/prepare_helm_release.sh script that were stored in kubernetes secrets were asymmetric when they should have been symmetric. This was due to changing the values.yaml to be symmetric and redeploying. When redeploying, it doesn't overwrite the secrets if they already exist. I fixed this by manually deleting all of the kubernetes secrets and re-running the prepare script and reinstalling the helm chart. After doing that, everything worked properly.

rdhabalia pushed a commit to rdhabalia/pulsar-helm-chart that referenced this issue Feb 2, 2023
@lhotari
Copy link
Member

lhotari commented Feb 15, 2024

I believe that #435 addresses this issue. Released in 3.2.0 version of the chart.

@lhotari lhotari closed this as completed Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants