cdk deploy in endless loop cause of Fargate Service cant fire up task #7746

logemann · 2020-05-01T19:06:32Z

I am deploying a codepipeline stack with deployment to a fargate service. Problem is, when there is an issue starting the fargate task, the deployment never returns because fargate tries to start the task over and over again (like every minute or so).

Roughly my code is:

public createEcsDeployAction(vpc: Vpc, ecrRepo: ecr.Repository, buildOutput : Artifact): EcsDeployAction {
    return new EcsDeployAction({
      actionName: 'EcsDeployAction',
      service: this.createLoadBalancedFargateService(this, vpc, ecrRepo).service,
      input: buildOutput,
    })
  };


  createLoadBalancedFargateService(scope: Construct, vpc: Vpc, ecrRepository: ecr.Repository) {
    return new ecspatterns.ApplicationLoadBalancedFargateService(scope, 'myLbFargateService', {
      vpc: vpc,
      serviceName: "HelloWorldFargateService",
      memoryLimitMiB: 512,
      cpu: 256,
      taskImageOptions: {
        image: ecs.ContainerImage.fromEcrRepository(ecrRepository, "latest"),
      },
    });
  }

My problem could be that i define an image in the LoadBalancedFargateService which isnt available during deployment of the stack because codePipeline didnt run yet. Dont know for sure.

Question remains if its wise to just never terminate the "cdk deploy" cause of neverending tries to fire up a task in the backend.

Reproduction Steps

hard to reproduce out of context.

Error Log

no error in console on cdk deploy. Hard to find the real error. Tried it via AWS console without success.

Environment

CLI Version : aws-cli/2.0.10 Python/3.8.2 Darwin/19.4.0 botocore/2.0.0dev14
Framework Version: 1.36.1 (build 4df7dac)
OS : Mac OS X

This is 🐛 Bug Report

The text was updated successfully, but these errors were encountered:

jonny-rimek · 2020-05-01T22:10:14Z

I have the same issue with the ApplicationLoadbalancedFargate pattern. I use fromAsset(pathName) and deploy via cdk deploy locally. Looking at the task page there was an error message that it couldn't download the image from ECR. To me, that sounds like it is an error with the preconfigured IAM permissions. I launched the task via the Console and the Task succesfully reached the status RUNNING

	const vpc = new ec2.Vpc(this, 'Vpc', {
			subnetConfiguration: [{
				name: 'publicSubnet',
				subnetType: ec2.SubnetType.PUBLIC,
			}],
			natGateways: 0,
		})

		const postgres = new rds.DatabaseInstance(this, 'Postgres', {
			engine: rds.DatabaseInstanceEngine.POSTGRES,
			instanceClass: ec2.InstanceType.of(ec2.InstanceClass.BURSTABLE2, ec2.InstanceSize.MICRO),
			masterUsername: 'postgres',
			deletionProtection: false,
			vpc,
			vpcPlacement: { subnetType: ec2.SubnetType.PUBLIC }
		})

		postgres.connections.allowFromAnyIpv4(ec2.Port.tcp(5432))

		const loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'Service', {
			vpc,
			memoryLimitMiB: 512,
			cpu: 256,
			desiredCount: 1,
			taskImageOptions: {
				image: ecs.ContainerImage.fromAsset('services/api'),
			},
		});

CDK version 1.36.1

logemann · 2020-05-02T02:35:27Z

@jonny-rimek thats a different scenario than mine. I dont have IAM problems.

I digged in deeper and to me it looks like a chicken/egg problem. When i remove my EcsDeployment Stage from Codepipeline and deploy my stack from scratch everything works. Of course this gets me a docker image in my ECS repo (because codepipeline runs). Now when i re-add the ECS Deployment stage in my code and re-deploy the stack, everything works because now there is a docker image in the ECR repo. Subsequent codepipeline runs triggered via Github repo change work too and i get full auto-deployment and stuff.

So currently i must deploy my stack in two steps, first without the deployment stage and then with it included. Looks wrong to me.

IMO the problem is that ApplicationLoadBalancedFargateService directly wants to bootstrap an image via:

taskImageOptions: {
        image: ecs.ContainerImage.fromEcrRepository(ecrRepository, "latest"),
      },

it doesnt know that its embedded in EcsDeployAction where it should act only when there is an imagedefinitions.json on input attribute.

jonny-rimek · 2020-05-02T08:39:45Z

to you deploy via cdk deploy inside a Code Build Step or do you use Code Deploy?

jonny-rimek · 2020-05-02T09:42:53Z

		const fargateTask = new ecs.FargateTaskDefinition(this, 'FargateTask', {
			cpu: 256,
			memoryLimitMiB: 512,
		})

		fargateTask.addContainer("GinContainer", {
			image: ecs.ContainerImage.fromAsset('services/api')
		})

		const cluster = new ecs.Cluster(this, 'Cluster', {
			containerInsights: true,
			vpc
		})

		const fargateService = new ecs.FargateService(this, 'FargateService', {
			cluster,
			taskDefinition: fargateTask,
			desiredCount: 1,
			assignPublicIp: true,
			platformVersion: ecs.FargatePlatformVersion.VERSION1_4
		})

I dropped the ecs pattern and did everything from scratch and the deployment works just fine(no ALB yet)

If I remove assignPublicIp I get the following error message Stopped reason ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr.... and the deployment is back to being stuck

logemann · 2020-05-02T11:25:44Z

Need to correct my previous statement. Its indeed a IAM issue but i dont understand why. Thats what i ve seen in the task logs:

Status reason	CannotPullECRContainerError: AccessDeniedException: User: arn:aws:sts::xxxxxxxxxxx:assumed-role/ContDeployStack-myLbFargateServiceTaskDefExecution-10XL6RI11AJ8J/f81d47b2-fa8e-4e14-ab0c-56a172e7d825 is not authorized to perform: ecr:GetAuthorizationToken

Before i thought its a chicken egg problem and tried to circumvent it by using:

taskImageOptions: {
        containerName: repoName,
        //image: ecs.ContainerImage.fromEcrRepository(ecrRepository, "latest"),
        image: ecs.ContainerImage.fromRegistry("hello-world"),
        containerPort: 8080,
      },

Here "hello-world" ist the most tiny image on dockerhub i could find which acts as a placeholder as long as my codepipeline runs.

Now a clean "cdk deploy" finishes ok but the problem is that now when my codepipeline finishes the new image wont be pulled into ECS. The scary thing is that my pipeline worked yesterday witbout problems and i could easily trigger new builds and ECS was updated accordingly.

logemann · 2020-05-02T12:20:01Z

Ok, i am facing a two headed snake here. It looks like if i use the "placeholder" image approach with fromRegistry("hello-world"), then CDK cant know that i want to pull from ECR when my codepipeline finishes, thus not having the correct permissions in the taskExecutionRole. I can fix that with:

fargateService.taskDefinition.executionRole?.addManagedPolicy((ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryPowerUser')));

When i use the fromEcrRepository(ecrRepository, "latest") approach, then i still think i have a problem with not having an image available at first deploy time which leaves me with the endless deploy loop. Because here i think its not a permission problem because CDK does know that i want to interact with ECR and should create the default taskExecutionRole accordingly.

I will try both approaches from scratch now to see if my findings hold. Always takes ages of course to test because destroying and deploying complex stacks takes a while.

logemann · 2020-05-02T15:06:06Z

Ok. This is the detailed error when directly referencing a non-exising image with fromEcrRepository():

CannotPullContainerError: Error response from daemon: manifest for 985582282849.dkr.ecr.eu-central-1.amazonaws.com/hello-world-webapp:latest not found

So to me it looks like the placeholder-dummy image for 1st time deployment is the only way to go. If you do it this way, you need to add a policy like mentioned in my previous post, because otherwise the CDK created TaskExecutionRole has not enough permissions.

Hope i have not put too much infos in here, but this way other people can get an idea what to do. To the AWS-CDK dev team: Is there a way to solve this in an elegant way?

skinny85 · 2020-05-04T18:26:20Z

Hey @logemann ,

yes, this is an issue. Basically, the problem is that we're missing a concept in the CDK currently, that represent "an image that doesn't exist yet, but will be created when the CodePipeline runs".

In a demo project we've done a long time ago, we have a class that represents exactly that. This is how it is used: [1], [2].

Would adding this class to the main CDK project solve your issue @logemann ? If so, I will convert this issue to a feature request.

Thanks,
Adam

logemann · 2020-05-04T18:59:24Z

hey @skinny85 ,

thanks for commenting. Just checked the project and the class you mentioned. Quite some amount of code (not only the class but also the surroundings) to solve this particular issue. I would rather use my placeholder image instead of going the mentioned way.

You might check the Tutorial i ve just finished regarding this issue to see how i approached this.
https://medium.com/aws-factory/continuous-delivery-of-a-java-spring-boot-webapp-with-aws-cdk-codepipeline-and-docker-56e045812bd2
(any feedback would be awesome too quite frankly :-))

From a dev standpoint it would be super nice if ApplicationLoadBalancedFargateService would be smart enough to know it is wrapped in EcsDeployAction and somehow do a different initializing behavior. But i am way too bad in Cloudformation inner workings to know if this is even possible.

skinny85 · 2020-05-04T22:50:56Z

Well, this would solve the following issue that you talk about in your tutorial:

This is a small node express webserver i provided on Dockerhub which only displays a “Waiting for Codepipeline Docker image” webpage. But why do we need this? On the very first deployment of our stack, we cant reference our “to be build” image from our Codepipeline because it wont be there on startup of the stack. So we can’t code something like fromEcrRepository(ecrRepository, “latest”). This would result in an endless deployment loop on the console because AWS tries to start this Service unlimited times and the deployment will just wait on the console for the successful startup which will never happen.

Instead, you would simply do this:

createLoadBalancedFargateService(scope: Construct, vpc: Vpc, ecrRepository: ecr.Repository, pipelineProject: PipelineProject) {
  var fargateService = new ecspatterns.ApplicationLoadBalancedFargateService(scope, 'myLbFargateService', {
     // ...
     taskImageOptions: {
        containerName: repoName,
        image: new PipelineContainerImage(ecrRepository),
        containerPort: 8080,
     },
  });
 
  fargateService.taskDefinition.executionRole?.addManagedPolicy(
    ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryPowerUser'));
  return fargateService;
}

And no weird workarounds are needed... isn't that strictly better?

logemann · 2020-05-05T00:02:52Z

indeed... i somehow didnt fully understand the project you mentioned because i though you need to use CloudFormationCreateUpdateStackAction and some other things to get PipelineContainerImage up and running.

Somehow couldnt see that PipelineContainerImage will suffice. To me thats definitely worth a FeatRequest then. Can you write in 2 short sentences what imageName in PipelineContainerImage gets resolved to and why it doesnt have the same problems (no finding an image) ?

Thanks.

Note: Will update the tutorial then, that there might be something coming along the way to make it even better. But for that i should understand it at least haha.

Update: From what i can gather from the class is that there is some lazy evaluation going on with regards to the imageName in the ECR repo. But i really dont get what PipelineParam is. And cant get an idea of the (in my case) unused methods like paramName()

skinny85 · 2020-05-06T23:08:01Z

The trick with imageName that it's turned into a CloudFormation parameter - that's what PipelineParam is.

If you see, then the parameter is filled in the CloudFormationCreateUpdateStackAction to be the URL of the image that was pushed to the ECR repo inside of the CodeBuild job.

If you don't want to use CloudFormationCreateUpdateStackAction, but EcsDeployAction, I guess the situation is even simpler: you don't need `PipelineContainerImage, your workaround of the dummy web server image (probably any other dummy image containing a server would do, you don't need a dedicated one I think) works fine.

However, there is a problem: the action will update the image "out-of-band", causing an intentional drift in the CloudFormation state (the actual image will be something different than the image parameter in CloudFormation). This might prove problematic (for example, and update to your service's properties might make the image be reverted to its original), and in general is a bad practice.

For those reasons, I would advise against using EcsDeployAction, and would instead use CloudFormationCreateUpdateStackAction with PipelineContainerImage.

Does this make sense?

logemann · 2020-05-07T14:58:32Z

Yeah makes sense and then i got it right that you cant use PipelineContainerImage isolated. But still i dont think its developer friendly. Using CloudFormationCreateUpdateStackAction feels like quite a big workaround too if at the end you just want to use EcsDeployAction. I think we can close this one, adding PipelineContainerImage to the distro would only make sense if there is a ton of documentation how to use it in conjunction with CloudFormationCreateUpdateStackAction as kind of a replacement to EcsDeployAction for this specific use case. A use case which is IMO quite mainstream.

ChrisLahaye · 2020-11-14T15:09:24Z

@logemann were you able to solve this issue?

…ant to be used in CodePipeline While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK, it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code. This adds a new class to the ECS module, `TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline. Related to aws#1237 Related to aws#7746

… be used in CodePipeline (#11795) While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK, it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code. This adds a new class to the ECS module, `TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline. Related to #1237 Related to #7746 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

… be used in CodePipeline (aws#11795) While CDK Pipelines is the idiomatic way of deploying ECS applications in CDK, it does not handle the case where the application's source code is kept in a separate source code repository from the CDK infrastructure code. This adds a new class to the ECS module, `TagParameterContainerImage`, that allows deploying a service managed that way through CodePipeline. Related to aws#1237 Related to aws#7746 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*

ianpogi5 · 2021-07-09T01:00:56Z

		const fargateTask = new ecs.FargateTaskDefinition(this, 'FargateTask', {
			cpu: 256,
			memoryLimitMiB: 512,
		})

		fargateTask.addContainer("GinContainer", {
			image: ecs.ContainerImage.fromAsset('services/api')
		})

		const cluster = new ecs.Cluster(this, 'Cluster', {
			containerInsights: true,
			vpc
		})

		const fargateService = new ecs.FargateService(this, 'FargateService', {
			cluster,
			taskDefinition: fargateTask,
			desiredCount: 1,
			assignPublicIp: true,
			platformVersion: ecs.FargatePlatformVersion.VERSION1_4
		})
I dropped the ecs pattern and did everything from scratch and the deployment works just fine(no ALB yet)

If I remove assignPublicIp I get the following error message Stopped reason ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post https://api.ecr.... and the deployment is back to being stuck

@jonny-rimek did you find the solution to this? I also dropped ecs pattern and now I am facing the same issue.

hansfpc · 2021-09-19T12:36:41Z

same problem here

dineshtrivedi · 2021-09-30T06:29:34Z

I am having a similar problem with ecs_pattern and QueueProcessingFargateService:

ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed

yoshinorihisakawa · 2021-10-01T11:28:36Z

I cant solve this problem....

univerze · 2022-01-15T14:04:36Z

Hi, I had the same problem, I managed to get a workaround with the pattern. It's more code though

    const vpc = new ec2.Vpc(this, "myvpc", {
      maxAzs: 3
    });

    const cluster = new ecs.Cluster(this, "mycluster", {
      vpc: vpc
    });

    const repo = new ecr.Repository(this, 'apprepo', {
      repositoryName: 'app-repo',
    });

    const execRole = new iam.Role(this, 'taskexecutionrole', {
      assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
    });
execRole.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonEC2ContainerRegistryPowerUser'));

    const taskDef = new ecs.FargateTaskDefinition(this, 'mytaskdef', {
      family: 'app-taskdef',
      executionRole: execRole,
    });
    taskDef.addContainer('myappimage', {
      image: ecs.ContainerImage.fromRegistry("amazon/amazon-ecs-sample"),
      containerName: 'nodejs',
      portMappings: [
        { containerPort: 80 }
      ]
    });

    new ecs_patterns.ApplicationLoadBalancedFargateService(this, "myecsservice", {
      cluster: cluster,
      publicLoadBalancer: true,
      desiredCount: 0,
      taskDefinition: taskDef,
    });

At this point it may be better to drop the pattern and create everything manually. With this tho I can use GitHub actions to push to the ecr repo with no issues, and the nodejs app starts.

shresthapradip · 2022-09-22T16:12:32Z

The trick with imageName that it's turned into a CloudFormation parameter - that's what PipelineParam is.

If you see, then the parameter is filled in the CloudFormationCreateUpdateStackAction to be the URL of the image that was pushed to the ECR repo inside of the CodeBuild job.

If you don't want to use CloudFormationCreateUpdateStackAction, but EcsDeployAction, I guess the situation is even simpler: you don't need `PipelineContainerImage, your workaround of the dummy web server image (probably any other dummy image containing a server would do, you don't need a dedicated one I think) works fine.

However, there is a problem: the action will update the image "out-of-band", causing an intentional drift in the CloudFormation state (the actual image will be something different than the image parameter in CloudFormation). This might prove problematic (for example, and update to your service's properties might make the image be reverted to its original), and in general is a bad practice.

For those reasons, I would advise against using EcsDeployAction, and would instead use CloudFormationCreateUpdateStackAction with PipelineContainerImage.

Does this make sense?

Still happening in newer cdk version: 2.41
And the class PipelineContainerImage is no more compatible.

rafaelmarques7 · 2023-07-07T19:49:25Z

this is happening to me as well

0xBradock · 2024-07-03T14:44:44Z

Hello,

I am still trying to deploy the most basic stack just to get ApplicationLoadBalancedFargateService working.
I took the example from the documentation and tried to deploy cdk deploy stack-app.

It hangs.
I didn't manage to get any logs.

import { Construct } from 'constructs';
import { App, Stack, StackProps } from 'aws-cdk-lib';
import { Vpc } from 'aws-cdk-lib/aws-ec2';
import { Cluster, ContainerImage } from 'aws-cdk-lib/aws-ecs';
import { ApplicationLoadBalancedFargateService } from 'aws-cdk-lib/aws-ecs-patterns';

const app = new App();

export class ApiSixStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const vpc = Vpc.fromLookup(this, 'VPC', { isDefault: true });

    const cluster = new Cluster(this, 'Cluster', { vpc });

    new ApplicationLoadBalancedFargateService(this, 'Service', {
      cluster,
      memoryLimitMiB: 1024,
      desiredCount: 1,
      cpu: 512,
      taskImageOptions: { image: ContainerImage.fromRegistry('amazon/amazon-ecs-sample') },
      loadBalancerName: 'application-lb-name',
    });
  }
}

const env = {
  account: process.env.CDK_DEFAULT_ACCOUNT,
  region: process.env.CDK_DEFAULT_REGION,
};

new ApiSixStack(app, 'stack-app', { env });

app.synth();

Any help is appreciated,

skinny85 · 2024-07-03T20:19:33Z

@0xBradock check out the container status in ECS. I would bet the port amazon/amazon-ecs-sample listens on is different (the default is 80, and it probably listens on 8080, or something).

wjes · 2024-09-02T17:15:54Z

@0xBradock I think it's because the load balancer cannot reach Fargate services' health checks. The default VPC you're using doesn't have private subnets and since assignPublicIp is false by default (a.k.a "use private subnets") your Fargate services end up isolated in a limbo.

You can solve it in two ways:

Set the assignPublicIp to true in ApplicationLoadBalancedFargateService. This way Fargate will live in the public subnet and the load balancer should be able to reach it. The cons is that besides the load balancer, Fargate services will be accessed directly from that public IP. Also you'll have to pay for that reserved IP.
Create a private subnet in your default VPC. Even better, create a new VPC with public and private subnets. The load balancer should live in the public subnet and Fargate in the private subnet. The cons is that Fargate will be completely isolated from the internet, only accessible through the load balancer. If the apps running in Fargate need to fetch an external API you'll have to include a NAT Gateway into the mix (which may be expensive, so make sure to set natGateways to 1 in your VPC constructor, otherwise one NAT Gateway will be created in each public subnet).

logemann added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 1, 2020

skinny85 self-assigned this May 4, 2020

logemann closed this as completed May 7, 2020

skinny85 mentioned this issue Nov 30, 2020

feat(ecs): introduce a new Image type, TagParameterContainerImage, to be used in CodePipeline #11795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdk deploy in endless loop cause of Fargate Service cant fire up task #7746

cdk deploy in endless loop cause of Fargate Service cant fire up task #7746

logemann commented May 1, 2020 •

edited

Loading

jonny-rimek commented May 1, 2020 •

edited

Loading

logemann commented May 2, 2020 •

edited

Loading

jonny-rimek commented May 2, 2020

jonny-rimek commented May 2, 2020 •

edited

Loading

logemann commented May 2, 2020 •

edited

Loading

logemann commented May 2, 2020

logemann commented May 2, 2020 •

edited

Loading

skinny85 commented May 4, 2020

logemann commented May 4, 2020 •

edited

Loading

skinny85 commented May 4, 2020

logemann commented May 5, 2020 •

edited

Loading

skinny85 commented May 6, 2020

logemann commented May 7, 2020

ChrisLahaye commented Nov 14, 2020

ianpogi5 commented Jul 9, 2021

hansfpc commented Sep 19, 2021

dineshtrivedi commented Sep 30, 2021

yoshinorihisakawa commented Oct 1, 2021

univerze commented Jan 15, 2022

shresthapradip commented Sep 22, 2022

rafaelmarques7 commented Jul 7, 2023

0xBradock commented Jul 3, 2024 •

edited

Loading

skinny85 commented Jul 3, 2024 •

edited

Loading

wjes commented Sep 2, 2024

cdk deploy in endless loop cause of Fargate Service cant fire up task #7746

cdk deploy in endless loop cause of Fargate Service cant fire up task #7746

Comments

logemann commented May 1, 2020 • edited Loading

Reproduction Steps

Error Log

Environment

jonny-rimek commented May 1, 2020 • edited Loading

logemann commented May 2, 2020 • edited Loading

jonny-rimek commented May 2, 2020

jonny-rimek commented May 2, 2020 • edited Loading

logemann commented May 2, 2020 • edited Loading

logemann commented May 2, 2020

logemann commented May 2, 2020 • edited Loading

skinny85 commented May 4, 2020

logemann commented May 4, 2020 • edited Loading

skinny85 commented May 4, 2020

logemann commented May 5, 2020 • edited Loading

skinny85 commented May 6, 2020

logemann commented May 7, 2020

ChrisLahaye commented Nov 14, 2020

ianpogi5 commented Jul 9, 2021

hansfpc commented Sep 19, 2021

dineshtrivedi commented Sep 30, 2021

yoshinorihisakawa commented Oct 1, 2021

univerze commented Jan 15, 2022

shresthapradip commented Sep 22, 2022

rafaelmarques7 commented Jul 7, 2023

0xBradock commented Jul 3, 2024 • edited Loading

skinny85 commented Jul 3, 2024 • edited Loading

wjes commented Sep 2, 2024

logemann commented May 1, 2020 •

edited

Loading

jonny-rimek commented May 1, 2020 •

edited

Loading

logemann commented May 2, 2020 •

edited

Loading

jonny-rimek commented May 2, 2020 •

edited

Loading

logemann commented May 2, 2020 •

edited

Loading

logemann commented May 2, 2020 •

edited

Loading

logemann commented May 4, 2020 •

edited

Loading

logemann commented May 5, 2020 •

edited

Loading

0xBradock commented Jul 3, 2024 •

edited

Loading

skinny85 commented Jul 3, 2024 •

edited

Loading