Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about how userTokenSOPS authenticate EKS clusters #5

Open
lixmgl opened this issue Nov 2, 2022 · 16 comments
Open

Question about how userTokenSOPS authenticate EKS clusters #5

lixmgl opened this issue Nov 2, 2022 · 16 comments

Comments

@lixmgl
Copy link
Contributor

lixmgl commented Nov 2, 2022

We are getting unauthorized error when submit App to EKS cluster in this step:
https://github.com/apple/batch-processing-gateway/blob/main/docs/GETTING_STARTED.md#submit-a-spark-app

Error message is:

{"code":500,"message":"io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: {masterurl}/apis/sparkoperator.k8s.io/v1beta2/namespaces/spark-applications/sparkapplications. Message: Unauthorized! Token may have expired! Please log-in again. Unauthorized."}

userTokenSOPS is generated from https://github.com/apple/batch-processing-gateway/blob/main/dev-setup/generate-bpg-config.sh#L17
This token comes from spark service account secret, do you need any additional setup on cluster level to make this authentication work?

Thanks.

@lixmgl
Copy link
Contributor Author

lixmgl commented Nov 3, 2022

hi @tongtianqi777 @yuchaoran2011

Any idea why userTokenSOPS doesn't work?

@tongtianqi777
Copy link
Collaborator

Hi @lixmgl , just to confirm: this is a remote AWS EKS cluster you are submitting the Spark app to, correct? can you confirm both the service account and secret exist? the userTokenSOPS needs to come from the secret. be aware the guide doc is using a particular K8s version (1.21.14) that automatically creates secrets for service accounts. If you are using a newer K8s version, there's a chance you need manually create secrets.

some more details: https://itnext.io/big-change-in-k8s-1-24-about-serviceaccounts-and-their-secrets-4b909a4af4e0

@lixmgl
Copy link
Contributor Author

lixmgl commented Nov 3, 2022

Hi @tongtianqi777 ,
Yes, we are submitting app to a remote AWS EKS cluster.
Yes, both service account and secret token exists.
Our EKS cluster is using 1.21 k8s version.

@vara-bonthu
Copy link

I think the issue is with the fabric8io/kubernetes-client client version used by this tool. This tool is using fabric8io/kubernetes-client 4.13.3 which doesn't support the refresh token feature.

I can see the token refresh feature is added to later version of fabric8io/kubernetes-client >5.12.x

I will test and raise a PR with the latest version.

WARNING
Please note that the above change is going to refresh a token every 1 minute which will be too many calls to AWS sts:GetCallerIdentity. Ideally the solution should use ExpirationTimestamp provided in Kubernetes Client instead of using the hardcoded 1 minute refresh. This means we also need to add another PR to this fabric8io/kubernetes-client to improve this.

@lixmgl
Copy link
Contributor Author

lixmgl commented Nov 3, 2022

Thanks for the context Vara.

We still want to figure out why this token works for Apple since regardless of expiration, this token doesn't work for our eks cluster.

userTokenSOPS from https://github.com/apple/batch-processing-gateway/blob/main/dev-setup/generate-bpg-config.sh#L17

@hiboyang
Copy link
Contributor

hiboyang commented Nov 3, 2022

Just a dummy question, do you use multiple lines of strings when setting userTokenSOPS in the gateway config file?

I remember when I copy/paste the token string to the config yaml file, the IntelliJ IDE will automatically split that to multiple lines, and I have to manually put them back together as a single very long line.

@lixmgl
Copy link
Contributor Author

lixmgl commented Nov 3, 2022

Thanks for checking.
I used one line of string since it's automatically generated by generate-bpg-config.sh

@hiboyang
Copy link
Contributor

hiboyang commented Nov 3, 2022

Got it, it is different issue then.

@yuchaoran2011
Copy link
Collaborator

@lixmgl Could it be possible that your remote Spark EKS cluster is missing some security group configurations that prevent your bpg instance from talking to it?

@lixmgl
Copy link
Contributor Author

lixmgl commented Nov 3, 2022

@yuchaoran2011 what kind of security groups you set for EKS cluster? We use standard EKS cluster.
Here is my .kube/config file (I commented the actual value)

cat ~/.kube/config
apiVersion: v1
clusters:

  • cluster:
    certificate-authority-data: {certData}
    server: {master url}
    name: {cluster name}
    contexts:
  • context:
    cluster: {cluster name}
    user: {cluster name}
    name: {cluster name}
    current-context: {cluster name}
    kind: Config
    preferences: {}
    users:
  • name: {cluster name}
    user:
    exec:
    apiVersion: client.authentication.k8s.io/v1beta1
    args:
    - --region
    - us-east-1
    - eks
    - get-token
    - --cluster-name
    - {cluster name}
    - --role
    - {iam role}
    command: aws

@yuchaoran2011
Copy link
Collaborator

@lixmgl I didn't mean your local kubectl config. I was thinking about the remote EKS cluster that you are submitting applications to. The Security Groups section of your EC2 Management Console should show the list of security groups currently configured. You'll want to have an inbound rule that tells the nodes in the cluster to accept HTTPS traffic with TCP at port 443 from anywhere (in production, inbound connections will be only allowed from the EKS cluster that runs BPG).

@lixmgl
Copy link
Contributor Author

lixmgl commented Nov 4, 2022

@yuchaoran2011 I see.
Security group for remote EKS cluster does have inbound rule to accept all traffic, all protocol and all port range.
Also, temporary token that is generated by aws eks get-token works for our cluster.
Not sure why long live secret token doesn't work.

@hiboyang
Copy link
Contributor

hiboyang commented Nov 4, 2022

@lixmgl, another option is to write some simple Java code to call your EKS API server and see what will happen. You could follow code example from here:

  protected Pod getPod(String podName, AppConfig.SparkCluster sparkCluster) {
    com.codahale.metrics.Timer timer =
        registry.timer(this.getClass().getSimpleName() + ".getPod.k8s-time");
    try (DefaultKubernetesClient client = KubernetesHelper.getK8sClient(sparkCluster);
        com.codahale.metrics.Timer.Context context = timer.time()) {
      Pod pod =
          client
              .pods()
              .inNamespace(sparkCluster.getSparkApplicationNamespace())
              .withName(podName)
              .get();
      context.stop();
      return pod;
    }
  }

@hiboyang
Copy link
Contributor

hiboyang commented Nov 4, 2022

Also, would you check whether the service account is set up with proper role and rolebinding? See discussion here.

@lixmgl
Copy link
Contributor Author

lixmgl commented Nov 6, 2022

@hiboyang Thanks for the suggestion! I will investigate more.

@Claudiazhaoya
Copy link

This issue is resolved by upgrading the fabric k8s client lib to the latest version. Please confirm @lixmgl Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants