Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" #4087

Closed
lkoniecz opened this issue Sep 16, 2019 · 19 comments · Fixed by #5540
Assignees
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. language/python Related to Python bindings p1

Comments

@lkoniecz
Copy link

❓ General Issue

The Question

eks_cluster = aws_eks.Cluster(
            scope=self,
            id=cluster_name,
            cluster_name=cluster_name,
            default_capacity=0,
            masters_role=cluster_masters_role,
            version='1.12',
            vpc=vpc
        )

with open('assets/tiller-service-account.yml', 'r') as file:
    eks_cluster.add_resource('TillerServiceAccount', yaml.safe_load(file))

with open('assets/tiller-cluster-role-binding.yml', 'r') as file:
    eks_cluster.add_resource('TillerClusterRoleBinding', yaml.safe_load(file))

with open('assets/tiller-deployment.yml', 'r') as file:
    eks_cluster.add_resource('TillerDeployment', yaml.safe_load(file))

with open('assets/tiller-service.yml', 'r') as file:
    eks_cluster.add_resource('TillerService', yaml.safe_load(file))

Manifests are tiller deployment files. The work when deployed manually. I can attach the files if needed.

Deployment fails with:

 36/41 | 09:24:23 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerDeployment/Resource/Default (DevEksClustermanifestTillerDeployment81AA2785) Failed to create resource. b'E0916 07:23:46.832354      13 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://3E59AF23B9E843E291570C60FFFDAF2C.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 52.2.151.216:443: i/o timeout\n'
	new CustomResource (/tmp/jsii-kernel-otb6xc/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)
	\_ new KubernetesResource (/tmp/jsii-kernel-otb6xc/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.js:22:9)
	\_ Cluster.addResource (/tmp/jsii-kernel-otb6xc/node_modules/@aws-cdk/aws-eks/lib/cluster.js:215:16)
	\_ _wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:51)
	\_ Kernel._wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7131:20)
	\_ ret._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:25)
	\_ Kernel._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7102:20)

rest of the stack trace ommited.

Environment

  • CDK CLI Version: 1.4.0 (build 175471f)
  • Module Version: 1.4
  • OS: all
  • Language: python

Other information

@lkoniecz lkoniecz added the needs-triage This issue or PR still needs to be triaged. label Sep 16, 2019
@SomayaB SomayaB added @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. needs-reproduction This issue needs reproduction. labels Sep 16, 2019
@SomayaB SomayaB added the language/python Related to Python bindings label Sep 16, 2019
@SomayaB SomayaB removed the needs-reproduction This issue needs reproduction. label Sep 16, 2019
@lkoniecz
Copy link
Author

Looks like some race condition. Stack gets sucesfully deployed sometimes. More often it fails on random resources.

 34/40 | 12:43:52 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerServiceAccount/Resource/Default (DevEksClustermanifestTillerServiceAccountF8370BEF) Resource creation Initiated
 34/40 | 12:43:52 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerDeployment/Resource/Default (DevEksClustermanifestTillerDeployment81AA2785) Resource creation Initiated
 35/40 | 12:43:52 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/manifest-TillerServiceAccount/Resource/Default (DevEksClustermanifestTillerServiceAccountF8370BEF) Failed to create resource. b'E0918 10:43:16.159351      14 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://B4A809FC4A5D31F7A9577E53CC42C0D7.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 3.210.93.152:443: i/o timeout\n'
	new CustomResource (/tmp/jsii-kernel-0sUqCd/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)
	\_ new KubernetesResource (/tmp/jsii-kernel-0sUqCd/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.js:22:9)
	\_ Cluster.addResource (/tmp/jsii-kernel-0sUqCd/node_modules/@aws-cdk/aws-eks/lib/cluster.js:215:16)
	\_ _wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:51)
	\_ Kernel._wrapSandboxCode (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7131:20)
	\_ ret._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6498:25)
	\_ Kernel._ensureSync (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:7102:20)
	\_ Kernel.invoke (/home/lky/Repositories/HuuugeStarsCfConfig/contrib/cloudformation/cdk/eks/cluster/.env/lib/python3.7/site-packages/jsii/_embedded/jsii/jsii-runtime.js:6497:26)

@lkoniecz
Copy link
Author

This time it failed with no resources manually added. I guess the AwsAuth resource comes from the AWS IAM Mapping feature.

 50/53 | 13:49:34 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/AwsAuth/manifest/Resource/Default (DevEksClusterAwsAuthmanifest25FB57E0) Resource creation Initiated
 51/53 | 13:49:34 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | DevEksCluster/AwsAuth/manifest/Resource/Default (DevEksClusterAwsAuthmanifest25FB57E0) Failed to create resource. b'E0918 11:48:57.714860      13 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://962A6CC83950262D83626C48486A6719.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 34.231.238.152:443: i/o timeout\n'
	new CustomResource (/tmp/jsii-kernel-sd5H2r/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)

@stefanolczak
Copy link

stefanolczak commented Sep 18, 2019

It looks like kubectl launched from lambda is suffering from timeout when calling EKS API Server. I suggest to implement some kind of retry mechanism in this place ( or just pass correct arguments to kubectl ) to make sure kubectl doesn't fail due to just timeout that can happen sometimes.

kubectl('apply', manifest_file)

@tjbaker
Copy link

tjbaker commented Sep 19, 2019

I found that the lambda has a 13m timeout. If EKS is having a bad day, this fails and everything rolls back. This week as I was playing with CDK and EKS, the EKS cluster would OFTEN take longer than 13m to become active.

# wait for the cluster to become active (13min timeout)
logger.info('waiting for cluster to become active...')
waiter = eks.get_waiter('cluster_active')
waiter.wait(name=cluster_name, WaiterConfig={
'Delay': 30,
'MaxAttempts': 26
})

How do you address this since a lambda can only run for 15m max? Also, why are we paying metered lambda costs for a waiter routine?

@runlevel-six
Copy link

runlevel-six commented Sep 23, 2019

I can verify that whenever I deploy a CDK solution for EKS that I've been working on in us-east-1 that I more often than not receive the same error during the AwsAuth resource (no custom resources are being added). Any other region I've tried (us-west-2, us-east-2, & eu-west-1 works fine with no errors). I've tested it multiple times in each region.

When it fails, I get the following error:

57/60 | 11:31:49 AM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | k8s-sandbox-1/AwsAuth/manifest/Resource/Default (k8ssandbox1AwsAuthmanifest8F203DE4) Resource creation Initiated
58/60 | 11:31:50 AM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | k8s-sandbox-1/AwsAuth/manifest/Resource/Default (k8ssandbox1AwsAuthmanifest8F203DE4) Failed to create resource. b'E0923 16:31:13.504110      14 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://B7302555A3113CB74E42C83A93928DB0.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 52.71.64.29:443: i/o timeout\n'
	new CustomResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.ts:92:21)
	\_ new KubernetesResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.ts:62:5)
	\_ new AwsAuth (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/aws-auth.ts:32:5)
	\_ Cluster.get awsAuth [as awsAuth] (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:563:23)
	\_ new Cluster (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:412:12)
	\_ new K8SPlatformStack (/Users/jsohl/code/adobe/k8s-platform/lib/k8s-platform-stack.ts:150:27)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/bin/k8s-platform.ts:38:1)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Module.m._compile (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:493:23)
	\_ Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Object.require.extensions.<computed> [as .ts] (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:496:12)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/bin.ts:158:12)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Object.Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ /usr/local/lib/node_modules/npm/node_modules/libnpx/index.js:268:14

I'd be glad to provide further details. My code is written in Typescript.

@eladb
Copy link
Contributor

eladb commented Sep 25, 2019

Thanks @runlevel-six. We need to modify our resource to allow a much longer wait time.

@runlevel-six
Copy link

@eladb:

Providing some more detail. This issue repeatedly does not occur until after the cluster is created. Here is a greater view of the activity:

k8s-cluster-playground-1: deploying...
k8s-cluster-playground-1: creating CloudFormation changeset...
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | kubectl-layer-8C2542BC-BF2B-4DFE-B765-E181FD30A9A0 (kubectllayer8C2542BCBF2B4DFEB765E181FD30A9A0617C4ADA)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::EC2::SecurityGroup               | playground-1/ControlPlaneSecurityGroup (playground1ControlPlaneSecurityGroup9F2E8BE6)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1-us-east-1-cluster-admin-role (playground1useast1clusteradminrole64CFE07D)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/Resource/ResourceHandler/ServiceRole (playground1ResourceHandlerServiceRole5DE06889)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/ClusterRole (playground1ClusterRole09E5E62F)
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::CDK::Metadata                    | CDKMetadata
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1-us-east-1-cluster-admin-role (playground1useast1clusteradminrole64CFE07D) Resource creation Initiated
  0/12 | 4:09:16 PM | CREATE_IN_PROGRESS   | AWS::CloudFormation::Stack            | kubectl-layer-8C2542BC-BF2B-4DFE-B765-E181FD30A9A0 (kubectllayer8C2542BCBF2B4DFEB765E181FD30A9A0617C4ADA) Resource creation Initiated
  0/12 | 4:09:17 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/ClusterRole (playground1ClusterRole09E5E62F) Resource creation Initiated
  0/12 | 4:09:17 PM | CREATE_IN_PROGRESS   | AWS::IAM::Role                        | playground-1/Resource/ResourceHandler/ServiceRole (playground1ResourceHandlerServiceRole5DE06889) Resource creation Initiated
  0/12 | 4:09:18 PM | CREATE_IN_PROGRESS   | AWS::CDK::Metadata                    | CDKMetadata Resource creation Initiated
  1/12 | 4:09:18 PM | CREATE_COMPLETE      | AWS::CDK::Metadata                    | CDKMetadata
  1/12 | 4:09:21 PM | CREATE_IN_PROGRESS   | AWS::EC2::SecurityGroup               | playground-1/ControlPlaneSecurityGroup (playground1ControlPlaneSecurityGroup9F2E8BE6) Resource creation Initiated
  2/12 | 4:09:22 PM | CREATE_COMPLETE      | AWS::EC2::SecurityGroup               | playground-1/ControlPlaneSecurityGroup (playground1ControlPlaneSecurityGroup9F2E8BE6)
  3/12 | 4:09:34 PM | CREATE_COMPLETE      | AWS::IAM::Role                        | playground-1-us-east-1-cluster-admin-role (playground1useast1clusteradminrole64CFE07D)
  4/12 | 4:09:35 PM | CREATE_COMPLETE      | AWS::IAM::Role                        | playground-1/Resource/ResourceHandler/ServiceRole (playground1ResourceHandlerServiceRole5DE06889)
  5/12 | 4:09:35 PM | CREATE_COMPLETE      | AWS::IAM::Role                        | playground-1/ClusterRole (playground1ClusterRole09E5E62F)
  5/12 | 4:09:37 PM | CREATE_IN_PROGRESS   | AWS::IAM::Policy                      | playground-1/Resource/ResourceHandler/ServiceRole/DefaultPolicy (playground1ResourceHandlerServiceRoleDefaultPolicyF9F64556)
  5/12 | 4:09:38 PM | CREATE_IN_PROGRESS   | AWS::IAM::Policy                      | playground-1/Resource/ResourceHandler/ServiceRole/DefaultPolicy (playground1ResourceHandlerServiceRoleDefaultPolicyF9F64556) Resource creation Initiated
  6/12 | 4:09:46 PM | CREATE_COMPLETE      | AWS::IAM::Policy                      | playground-1/Resource/ResourceHandler/ServiceRole/DefaultPolicy (playground1ResourceHandlerServiceRoleDefaultPolicyF9F64556)
  7/12 | 4:09:51 PM | CREATE_COMPLETE      | AWS::CloudFormation::Stack            | kubectl-layer-8C2542BC-BF2B-4DFE-B765-E181FD30A9A0 (kubectllayer8C2542BCBF2B4DFEB765E181FD30A9A0617C4ADA)
  7/12 | 4:09:53 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/Resource/ResourceHandler (playground1ResourceHandlerFCA27D23)
  7/12 | 4:10:00 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/Resource/ResourceHandler (playground1ResourceHandlerFCA27D23) Resource creation Initiated
  8/12 | 4:10:00 PM | CREATE_COMPLETE      | AWS::Lambda::Function                 | playground-1/Resource/ResourceHandler (playground1ResourceHandlerFCA27D23)
  8/12 | 4:10:04 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | playground-1/Resource/Resource/Default (playground13674B29B)
 8/12 Currently in progress: playground13674B29B
  8/12 | 4:18:51 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | playground-1/Resource/Resource/Default (playground13674B29B) Resource creation Initiated
  9/12 | 4:18:52 PM | CREATE_COMPLETE      | Custom::AWSCDK-EKS-Cluster            | playground-1/Resource/Resource/Default (playground13674B29B)
  9/12 | 4:18:54 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/KubernetesResourceHandler (playground1KubernetesResourceHandler520AFAB4)
  9/12 | 4:19:00 PM | CREATE_IN_PROGRESS   | AWS::Lambda::Function                 | playground-1/KubernetesResourceHandler (playground1KubernetesResourceHandler520AFAB4) Resource creation Initiated
 10/12 | 4:19:00 PM | CREATE_COMPLETE      | AWS::Lambda::Function                 | playground-1/KubernetesResourceHandler (playground1KubernetesResourceHandler520AFAB4)
 10/12 | 4:19:03 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | playground-1/AwsAuth/manifest/Resource/Default (playground1AwsAuthmanifestE4865195)
10/12 Currently in progress: playground1AwsAuthmanifestE4865195
 10/12 | 4:20:20 PM | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | playground-1/AwsAuth/manifest/Resource/Default (playground1AwsAuthmanifestE4865195) Resource creation Initiated
 11/12 | 4:20:20 PM | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | playground-1/AwsAuth/manifest/Resource/Default (playground1AwsAuthmanifestE4865195) Failed to create resource. b'E0926 21:19:45.159704      14 round_trippers.go:174] CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml": Get https://589777DBCFCB148C747DA4CB0E49B65D.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 3.231.36.92:443: i/o timeout\n'
	new CustomResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.ts:92:21)
	\_ new KubernetesResource (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/k8s-resource.ts:62:5)
	\_ new AwsAuth (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/aws-auth.ts:32:5)
	\_ Cluster.get awsAuth [as awsAuth] (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:563:23)
	\_ new Cluster (/Users/jsohl/code/adobe/k8s-platform/node_modules/@aws-cdk/aws-eks/lib/cluster.ts:412:12)
	\_ new K8SClusterStack (/Users/jsohl/code/adobe/k8s-platform/lib/k8s-platform-stack.ts:142:22)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/bin/k8s-platform.ts:34:18)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Module.m._compile (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:493:23)
	\_ Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Object.require.extensions.<computed> [as .ts] (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/index.ts:496:12)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ Object.<anonymous> (/Users/jsohl/code/adobe/k8s-platform/node_modules/ts-node/src/bin.ts:158:12)
	\_ Module._compile (internal/modules/cjs/loader.js:936:30)
	\_ Object.Module._extensions..js (internal/modules/cjs/loader.js:947:10)
	\_ Module.load (internal/modules/cjs/loader.js:790:32)
	\_ Function.Module._load (internal/modules/cjs/loader.js:703:12)
	\_ Function.Module.runMain (internal/modules/cjs/loader.js:999:10)
	\_ /usr/local/lib/node_modules/npm/node_modules/libnpx/index.js:268:14
 11/12 | 4:20:21 PM | ROLLBACK_IN_PROGRESS | AWS::CloudFormation::Stack            | k8s-cluster-playground-1 The following resource(s) failed to create: [playground1AwsAuthmanifestE4865195]. . Rollback requested by user.

I'm not sure what is timing out, because the creation of Custom::AWSCDK-EKS-KubernetesResource starts only 1 minute and 17 seconds before it fails. That should not be due to a Lambda timeout or the waiter function, unless I am misunderstanding what is going on (even the Lambda function is only created a little over 10 minutes before the failure).

Is there anything else I can provide? The last pass at this I stripped my code down to just the creation of the cluster admin role and then the creation of the cluster. Everything else I removed. It almost always fails in us-east-1, has failed once (after dozens of creates) in us-west-2 and all other regions I tested never failed (dozens in us-east-2 and several times in eu-west-1).

@cseickel
Copy link

cseickel commented Oct 7, 2019

I am also having this issue, using us-east-1. Let me know if there is any information I can provide or tests I can perform to help with troubleshooting.

38/43 | 11:38:56 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-KubernetesResource | EKSCluster/AwsAuth/manifest/Resource/Default (EKSClusterAwsAuthmanifestA4E0796C) Resource creation Initiated
 39/43 | 11:38:56 | CREATE_FAILED        | Custom::AWSCDK-EKS-KubernetesResource | EKSCluster/AwsAuth/manifest/Resource/Default (EKSClusterAwsAuthmanifestA4E0796C) Failed to create resource. b'error: unable to recognize "/tmp/manifest.yaml": Get https://EB01F1F7F0334B54FFF12955B4154E66.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s: dial tcp 34.198.248.74:443: i/o timeout\n'
        new CustomResource (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-cloudformation\lib\custom-resource.ts:92:21)
        \_ new KubernetesResource (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\k8s-resource.ts:62:5)
        \_ new AwsAuth (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\aws-auth.ts:32:5)
        \_ Cluster.get awsAuth [as awsAuth] (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:563:23)
        \_ new Cluster (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:412:12)
        \_ new AlfrescoAppStack (C:\local\invest-apps\alfresco-aws-cdk\lib\alfresco-app-stack.ts:40:22)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\bin\alfresco-aws-cdk.ts:51:18)
        \_ Module._compile (module.js:652:30)
        \_ Module.m._compile (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:493:23)
        \_ Module._extensions..js (module.js:663:10)
        \_ Object.require.extensions.(anonymous function) [as .ts] (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:496:12)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\bin.ts:158:12)
        \_ Module._compile (module.js:652:30)
        \_ Object.Module._extensions..js (module.js:663:10)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ findNodeScript.then.existing (C:\Program Files\nodejs\node_modules\npm\node_modules\libnpx\index.js:268:14)
        \_ <anonymous>
 40/43 | 11:38:57 | CREATE_FAILED        | AWS::IAM::InstanceProfile             | EKSCluster/DefaultCapacity/InstanceProfile (EKSClusterDefaultCapacityInstanceProfile79FC0597) Resource creation cancelled
        new AutoScalingGroup (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-autoscaling\lib\auto-scaling-group.ts:420:24)
        \_ Cluster.addCapacity (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:449:17)
        \_ new Cluster (C:\local\invest-apps\alfresco-aws-cdk\node_modules\@aws-cdk\aws-eks\lib\cluster.ts:425:35)
        \_ new AlfrescoAppStack (C:\local\invest-apps\alfresco-aws-cdk\lib\alfresco-app-stack.ts:40:22)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\bin\alfresco-aws-cdk.ts:51:18)
        \_ Module._compile (module.js:652:30)
        \_ Module.m._compile (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:493:23)
        \_ Module._extensions..js (module.js:663:10)
        \_ Object.require.extensions.(anonymous function) [as .ts] (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\index.ts:496:12)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ Object.<anonymous> (C:\local\invest-apps\alfresco-aws-cdk\node_modules\ts-node\src\bin.ts:158:12)
        \_ Module._compile (module.js:652:30)
        \_ Object.Module._extensions..js (module.js:663:10)
        \_ Module.load (module.js:565:32)
        \_ tryModuleLoad (module.js:505:12)
        \_ Function.Module._load (module.js:497:3)
        \_ Function.Module.runMain (module.js:693:10)
        \_ findNodeScript.then.existing (C:\Program Files\nodejs\node_modules\npm\node_modules\libnpx\index.js:268:14)
        \_ <anonymous>

@runlevel-six
Copy link

For those that are having this issue, I got around it by not setting a master role in the cluster stack, and instead creating a second stack that builds the aws-auth manifest and applies it as a new KubernetesResource. That second stack, which I apply after the cluster stack has completed creation, uses an export of the cluster (from the cluster stack) and configured as a part of K8SExtendedStackProps to apply. It looks something like this:

Cluster Definition (from initial stack):

this.cluster = new eks.Cluster(this, `${props.deploy}-${props.clusterNum}`, {
  clusterName: `${props.deploy}-${props.clusterNum}`,
  vpc: props.vpc,
  defaultCapacity: 0,
  vpcSubnets: [{
    subnetType: ec2.SubnetType.PRIVATE
  }, {
    subnetType: ec2.SubnetType.PUBLIC
  }],
  kubectlEnabled: true,
  role: props.clusterAdmin,
  //mastersRole: props.kubeAdminRole,
  outputConfigCommand: false,
  version: environment.k8sVersion
});

Cluster Auth Stack (hopefully temporarily):

export class K8SClusterAuthStack extends Stack {
  constructor(scope: App, id: string, props: K8SExtendedStackProps) {
    super(scope, id, props);

    try {
      const awsAwthRoles = {
        apiVersion: 'v1',
        kind: 'ConfigMap',
        metadata: {
          name: 'aws-auth',
          namespace: 'kube-system'
        },
        data: {
          mapRoles: `[{\"rolearn\": \"${props.nodeGroupRole.roleArn}\",\"username\":\"system:node:{{EC2PrivateDNSName}}\",\"groups\": [\"system:bootstrappers\",\"system:nodes\"]},{\"rolearn\": \"${props.kubeAdminRole.roleArn}\",\"groups\": [\"system:masters\"]}]`,
          mapUsers: "[]",
          mapAccounts: "[]"
        }
      };
      new eks.KubernetesResource(this, 'k8sRolesAwsAuthManifest', {
        cluster: props.cluster,
        manifest: [
          awsAwthRoles
        ]
      });
      new CfnOutput(this, `${props.deploy}-${props.clusterNum}-Kubeconfig-Command`, {
        value: `aws eks update-kubeconfig --name ${props.deploy}-${props.clusterNum} --region ${this.region} --role-arn ${props.kubeAdminRole.roleArn}`
      });
    } catch(e) {
      console.log(e);
    }
  }
}

It then rebuilds the new update-kubeconfig command and provides it as output. A few items to note, though:

  1. This is just for my own purposes of getting around this issue temporarily. I am not saying this is production-ready (or even quality) code. But it can give you an idea of a workaround.
  2. There are a lot of types that are configured and other resources that are created and exported from previously run stacks within the app, and then used in this code (and the creation and export of which are not shown in this code excerpt). I just wanted to provide a basic idea of how I'm (for now, until this issue is resolved), working around it.

@cseickel
Copy link

cseickel commented Oct 7, 2019

@runlevel-six Thanks for sharing your workaround! There is one item I don't understand, could you tell me what role the "props.nodeGroupRole" maps to in your code? Is that first role required to make the masters role mapping work or is it something unrelated?

Thanks for your time, I really appreciate it.

@runlevel-six
Copy link

runlevel-six commented Oct 7, 2019

@cseickel: that role was a second role I created for the nodegroups so that they could properly join the cluster as worker nodes. If it doesn't apply to what you are doing and you just need the masters role, you could change that line to:

mapRoles: `[{\"rolearn\": \"${props.kubeAdminRole.roleArn}\",\"groups\": [\"system:masters\"]}]`,

kubeAdminRole is a role I created earlier in the app and exported so that this stack could apply it to the masters role.

@NGL321 NGL321 removed the needs-triage This issue or PR still needs to be triaged. label Oct 7, 2019
@cseickel
Copy link

cseickel commented Oct 8, 2019

Unfortunately for me, the two stage KubernetesResource workaround did not work out. It does avoid the bug, but I am left with permissions issues that I have opted not to debug any further right now.

At least for my setup, I do need to add more than just the masters role to enable the worker nodes to join properly.

@runlevel-six
Copy link

In that case you need to use:

mapRoles: `[{\"rolearn\": \"${props.nodeGroupRole.roleArn}\",\"username\":\"system:node:{{EC2PrivateDNSName}}\",\"groups\": [\"system:bootstrappers\",\"system:nodes\"]},{\"rolearn\": \"${props.kubeAdminRole.roleArn}\",\"groups\": [\"system:masters\"]}]`,

Where props.nodeGroupRole.roleArn represents the ARN of the role the worker nodes are deployed with and props.kubeAdminRole.roleArn represents a role that you want to have masters capability for.

@stefanolczak
Copy link

The lambda can also fails in another place. This is what happened to me few times:

Cluster status not active
[ERROR] 2019-10-03T12:58:39.506Z b47a8eba-f840-4bd0-8ba3-01d776dfc47e Command '['aws', 'eks', 'update-kubeconfig', '--name', 'DevEksCluster', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
Traceback (most recent call last):
File "/var/task/index.py", line 44, in handler
'--kubeconfig', kubeconfig
File "/var/lang/lib/python3.7/subprocess.py", line 347, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['aws', 'eks', 'update-kubeconfig', '--name', 'DevEksCluster', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.
[ERROR] 2019-10-03T12:58:39.532Z b47a8eba-f840-4bd0-8ba3-01d776dfc47e | cfn_error: Command '['aws', 'eks', 'update-kubeconfig', '--name', 'DevEksCluster', '--kubeconfig', '/tmp/kubeconfig']' returned non-zero exit status 255.

# "log in" to the cluster
subprocess.check_call([ 'aws', 'eks', 'update-kubeconfig',
'--name', cluster_name,
'--kubeconfig', kubeconfig
])

It's strange because my cluster-resource lambda received info that the cluster is active 30 seconds before that.

@lkoniecz
Copy link
Author

lkoniecz commented Oct 11, 2019

In addtion to what has already been reported - this is what I experienced yesterday.

30/69 | 12:34:37 | CREATE_IN_PROGRESS   | Custom::AWSCDK-EKS-Cluster            | DevEksCluster/DevEksCluster/Resource/Resource/Default (DevEksCluster6F41DD8A) Resource creation Initiated
31/69 | 12:34:38 | CREATE_FAILED        | Custom::AWSCDK-EKS-Cluster            | DevEksCluster/DevEksCluster/Resource/Resource/Default (DevEksCluster6F41DD8A) Failed to create resource. Waiter ClusterActive failed: Max attempts exceeded
    new CustomResource (/tmp/jsii-kernel-1QxloY/node_modules/@aws-cdk/aws-cloudformation/lib/custom-resource.js:32:25)
    \_ new ClusterResource (/tmp/jsii-kernel-1QxloY/node_modules/@aws-cdk/aws-eks/lib/cluster-resource.js:46:26)
    \_ new Cluster (/tmp/jsii-kernel-1QxloY/node_modules/@aws-cdk/aws-eks/lib/cluster.js:81:24)

Logs snippet from the cluster resource handler lambda

10:21:55 [INFO]    2019-10-10T10:21:55.743Z    c0bc686d-8132-470c-a022-83af9924c5a0    waiting for cluster to become active...

10:21:56

10:34:28 [ERROR]    2019-10-10T10:34:28.674Z    c0bc686d-8132-470c-a022-83af9924c5a0   
Waiter ClusterActive failed: Max attempts exceeded

10:34:28 Traceback (most recent call last):

10:34:28 File "/var/task/index.py", line 83, in handler

10:34:28 'MaxAttempts': 26

10:34:28 File "/opt/awscli/botocore/waiter.py", line 53, in wait

10:34:28 Waiter.wait(self, **kwargs)

10:34:28 File "/opt/awscli/botocore/waiter.py", line 329, in wait

10:34:28 last_response=response

10:34:28 botocore.exceptions.WaiterError: Waiter ClusterActive failed: Max attempts exceeded

@cseickel
Copy link

Can anyone else confirm whether this issue is primarily with us-east-1?

@lkoniecz
Copy link
Author

No, it happens quite often in eu-west-1 too.

@eladb eladb added the p0 label Oct 23, 2019
@eladb eladb added p1 and removed p0 labels Nov 4, 2019
@lkoniecz
Copy link
Author

Hello, any updates on this?

@eladb
Copy link
Contributor

eladb commented Nov 28, 2019

Hello, this is still high on our priority list, but we are a bit heads down towards re:Invent next week, and will get to this as soon as possible.

@eladb eladb added the in-progress This issue is being actively worked on. label Nov 30, 2019
eladb pushed a commit that referenced this issue Nov 30, 2019
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created.

The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource.
The second issue is fixed by adding 3 retries to "kubectl apply".

Fixes #4087
Fixes #4695
eladb pushed a commit that referenced this issue Dec 30, 2019
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created.

The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers.

The second issue is fixed by adding 3 retries to "kubectl apply".

**Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020.

- Fixes #4087
- Fixes #4695
- Fixes #5259
- Fixes #5501

---

BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions.
@mergify mergify bot closed this as completed in #5540 Dec 30, 2019
mergify bot added a commit that referenced this issue Dec 30, 2019
There were two causes of timeouts for EKS cluster creation: create time which is longer than the AWS Lambda timeout (15min) and lack of retry when applying kubectl after the cluster has been created.

The change fixes the first issue by leveraging the custom resource provider framework to implement the cluster resource as an async resource. The custom resource providers are now bundled as nested stacks so they don't take up too many resources from users, and are also reused by multiple clusters within the same stack. This required that the creation role will not be the same as the lambda role, so we define this role separately and assume it within the providers.

The second issue is fixed by adding 3 retries to "kubectl apply".

**Backwards compatibility**: as described in #5544, since the resource provider handler of `Cluster` and `KubernetesResource` has been changed, this change requires a replacement of existing clusters (deployment fails with "service token cannot be changed" error). Since this can be disruptive to users, this change includes an exact copy of the previous version under a new module called `@aws-cdk/aws-eks-legacy`, which can be used as a drop-in replacement until users decide to upgrade to the new version. Using the legacy cluster will emit a synthesis warning that this module will no longer be released as part of the CDK starting March 1st, 2020.

- Fixes #4087
- Fixes #4695
- Fixes #5259
- Fixes #5501

---

BREAKING CHANGE: (in experimental module) the providers behind the AWS EKS module have been rewritten to address multiple stability issues. Since this change requires cluster replacement, the old version of this module is available under `@aws-cdk/aws-eks-legacy`. Please read #5544 carefully for upgrade instructions.

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@iliapolo iliapolo changed the title EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" [aws-eks] EKS fails to deploy with CancelRequest not implemented by *exec.roundTripper\nerror: unable to recognize "/tmp/manifest.yaml" Aug 16, 2020
@iliapolo iliapolo removed the in-progress This issue is being actively worked on. label Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. language/python Related to Python bindings p1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants