Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment fails with Concurrent::RejectedExecutionError #168

Closed
Raniz85 opened this issue Jun 4, 2018 · 37 comments
Closed

Deployment fails with Concurrent::RejectedExecutionError #168

Raniz85 opened this issue Jun 4, 2018 · 37 comments
Labels

Comments

@Raniz85
Copy link

Raniz85 commented Jun 4, 2018

We've seen similar failures on multiple servers, they seem to happen randomly.

Here's the log message:

2018-05-31 15:31:07 INFO  [codedeploy-agent(8052)]: [Aws::CodeDeployCommand::Client 200 0.030218 0 retries] put_host_command_complete(command_status:"Failed",diagnostics:{format:"JSON",payload:"{\"error_code\":5,\"script_name\":\"\",\"message\":\"Concurrent::RejectedExecutionError\",\"log\":\"\"}"},host_command_identifier:"<redacted>")

The redacted base64 string decodes into:

["com.amazon.apollo.deploycontrol.domain.HostCommandIdentifier",{"deploymentId":"CodeDeploy/eu-west-1/Prod/arn:aws:sds:eu-west-1:<accountId>:deployment/<deploymentId>","hostId":"arn:aws:ec2:eu-west-1:<accountId>:instance/<instanceId>","commandName":"AfterAllowTraffic","commandPosition":13,"commandAttempt":1}]
@rohkat-aws
Copy link
Contributor

@Raniz85 we have a fix for this bug in the next release. But there are solutions to this. Can you tell me a little more about the time its happening and may be paste some agent logs ,before this happens .

@rohkat-aws rohkat-aws added the bug label Jun 4, 2018
@Raniz85
Copy link
Author

Raniz85 commented Jun 5, 2018

Here are the logs from the instance: cd-agent-crash.log

Note that we restart the agent 2 minutes after each deployment as a workaround to #32.

@rohkat-aws
Copy link
Contributor

@Raniz85 so here is my hypothesis, When the agent restarts and starts exiting, it should not accept any new poll requests,but it does ,because of a bug in the thread synchronization of the agent, for which the fix has been added. ef65652 . But will be released in the next release. But the fix for this for now, is that if you can wait for the agent to restart properly , may be add a wait and then start a deployment. I know its not ideal , but this can work as a temporary workaround.

@Raniz85
Copy link
Author

Raniz85 commented Jun 7, 2018

I'm not sure I understand the proposed workaround.

Should we ensure that we're never deploying when an agent is restarting? That's not an easy workaround in an environment with more than one or two servers.

When can we expect the next release?

@rohkat-aws
Copy link
Contributor

@Raniz85 I totally understand that, and its happening due to a race condition. But another workaround could be you just add a wait after the restart like a sleep for some seconds and then start deploying, we are still working out some issues for next release, but will post once we start releasing the next version.

@Raniz85
Copy link
Author

Raniz85 commented Jun 8, 2018

I still don't understand what you want me to do.

We have the agent running on about 100 instances and deployments are automated via Jenkins. Do you suggest what we add a hook in Jenkins that ensures that no agent has restarted on any of the 100 instances within the last 10 seconds (or so) before starting the deployment?

@rohkat-aws
Copy link
Contributor

@Raniz85 which region are the hosts in ?

@Raniz85
Copy link
Author

Raniz85 commented Jun 18, 2018

eu-west-1 mostly though some are in cn-north-1 and us-east-2.

@rohkat-aws
Copy link
Contributor

so in eu-west-1 we have released the new version of the agent , which should fix this issue. Can you try that. 1518

@SupportNubersia
Copy link

SupportNubersia commented Jun 19, 2018

Hi @rohkat-aws we've the same problem with the agent code-deploy

We use the region eu-west-1 and we've to tried update agent(agent_version: OFFICIAL_1.0-1.1518_rpm) to the new version and we got these message:

sudo /opt/codedeploy-agent/bin/install auto

I, [2018-06-19T12:26:13.508928 #25330]  INFO -- : Starting Ruby version check.
I, [2018-06-19T12:26:13.509033 #25330]  INFO -- : Starting update check.
I, [2018-06-19T12:26:13.509064 #25330]  INFO -- : Attempting to automatically detect supported package manager type for system...
I, [2018-06-19T12:26:13.517030 #25330]  INFO -- : Checking AWS_REGION environment variable for region information...
I, [2018-06-19T12:26:13.517100 #25330]  INFO -- : Checking EC2 metadata service for region information...
I, [2018-06-19T12:26:13.550405 #25330]  INFO -- : Downloading version file from bucket aws-codedeploy-eu-west-1 and key latest/VERSION...
I, [2018-06-19T12:26:13.571773 #25330]  INFO -- : Running version matches target version, skipping install
I, [2018-06-19T12:26:13.571839 #25330]  INFO -- : Update check complete.
I, [2018-06-19T12:26:13.571858 #25330]  INFO -- : Stopping updater.

Can you confirm that the new version is released?

Regards,

@SupportNubersia
Copy link

SupportNubersia commented Jun 19, 2018

Hi @rohkat-aws when you say
"could be you just add a wait after the restart like a sleep for some seconds and then start deploying"
what do you mean?
could be :wait_between_runs: variable for codedeployagent.yml?
any suggestions?
Greetings and Thanks.

@woodhull
Copy link

We're seeing this issue as well. Has the fixed been released to the us-east-1 codedeploy install s3 bucket?

I do not understand the proposed workaround from @rohkat-aws either.

@rohkat-aws
Copy link
Contributor

@SupportNubersia @woodhull 1.0-1.1518_rpm is the new version. Are you still seeing issues ?

@woodhull
Copy link

woodhull commented Jun 19, 2018

We've so far only seen this issue on fresh instance boot, so we're rebaking our AMIs in the hope that we get a fresher version of the codedeploy agent. We'll let you know one that is complete, we roll the AMI out, and can test.

@SupportNubersia
Copy link

Hi @rohkat-aws

The version agent_version: OFFICIAL_1.0-1.1518_rpm that exists in my AMIs was created more than a month ago.

Can you confirm that the new release of code-agent is 1.0-1.1518?

@SupportNubersia
Copy link

To install the codedeploy-agent we used the official process.

https://docs.aws.amazon.com/codedeploy/latest/userguide/codedeploy-agent-operations-install-linux.html

@rohkat-aws
Copy link
Contributor

@SupportNubersia @woodhull yes the OFFICIAL_1.0-1.1518_rpm should fix this.
And @woodhull the fix that was suggested was for 1458 or the previous agent version and All i meant in the workaround in was that, if we can wait after the agent restarts completely or wait for it to start up again ,before sending it a deployment request.

@SupportNubersia
Copy link

Hi @rohkat-aws
How we can do to wait after agent restarts completely or wait ffor it to start up again, before sending it a deployment request?
The problem occurs when the autoscaling group launch a new instances.

Please we need to solve it ASAP.

@rohkat-aws
Copy link
Contributor

@SupportNubersia Are you seeing the issue even with the latest version of the agent ? is it possible the Ami is pre-baked with an old version of the agent. And when you say it happens during Scale up, is it because the agent is being restarted in the launch config after the install?

@SupportNubersia
Copy link

@rohkat-aws the last version that our AMI has prepared a month ago is the version: OFFICIAL_1.0-1.1518_rpm and we get the same error.
What we see is that the system executes the whole process until the execution of the StartApplication hook (other times it works correctly).
Do you advise us to restart the code-deploy agent in the LaunchConfiguration?
We are going to eliminate the installation of the code-deploy agent and install it again.

@rohkat-aws
Copy link
Contributor

rohkat-aws commented Jun 20, 2018

the 1518 version was released a week back. Not a month back @SupportNubersia. Having said that, can you please look into /tmp/codedeploy-agent.update.log and confirm. If it's not being updated.

@SupportNubersia
Copy link

Hi @rohkat-aws we can see that the error is that when the instances launch in fisrt time it execute a update and kill our deployment.

Now we're updating the AMI with latest codedeploy-agent version and we will try again.

How we can disabled this automatic update in first execution?

@rohkat-aws
Copy link
Contributor

@SupportNubersia Did that work?

@woodhull
Copy link

We think that this is now fixed for us after baking a new AMI with the latest codedeploy agent version preinstalled.

@SupportNubersia
Copy link

@rohkat-aws its working good.

But in the next code-deploy update, The problem will occur aggain...

@rohkat-aws
Copy link
Contributor

no it should not @SupportNubersia this version fixes that. and this is the commit which does that.
ef65652#diff-9c9dfb7af94f7715489974ad6d37d7f3R76

@rohkat-aws
Copy link
Contributor

@Raniz85 if you can also confirm, I can close the issue.

@SupportNubersia
Copy link

Ok @rohkat-aws perfect!

Thanks for your help

@Raniz85
Copy link
Author

Raniz85 commented Jun 20, 2018

We're preparing to upgrade our AMIs, but haven't gotten there yet. I probably won't have time tomorrow and then I'm on vacation next week, so I'll get back to you after that

@pags
Copy link

pags commented Jun 26, 2018

Updating our AMIs with the newest CodeDeploy agent fixed the issue for us.

@rohkat-aws
Copy link
Contributor

@Raniz85 i think we can re open this if you still have issues , Closing this for now

@jgerry
Copy link

jgerry commented Jun 28, 2018

I'm testing the new agents now and it seems to be working, but my issue is mostly that something changed internally on the AWS side that caused this. I bake AMIs for some applications specifically to not get new versions of the agents frequently. I have apps that are using agent versions from 4 months ago that suddenly started having problems.

@petervandoros
Copy link

@jgerry We delete the crontab for the autoupdate feature when baking the AMI. A couple of our systems weren't doing this and started seeing this error when the agent was updating itself to the latest version mid-deploy. I.e., the old version was failing to shutdown mid-deploy triggered by the autoupdate feature.

@mysteriouskangaroo
Copy link

mysteriouskangaroo commented Aug 21, 2018

@rohkat-aws unfortunately, we too are running into this issue about half of the time with a simple scale-up from one to two servers. We have the latest OFFICIAL_1.0-1.1518_rpm on us-east-1. On autoscale, we push out two repos concurrently that are relatively small. They do, however, call out to composer to pull in external packages.

Our solution likely for now is going to be to just bake everything into AMIs, which is cumbersome :/. Is there any ETA on a fix?

@rohkat-aws
Copy link
Contributor

@falcor781 is an update happening or the agent that was used was already 1518 . Can you pls also check your update logs

@mysteriouskangaroo
Copy link

The agent that was used was already 1518 starting from the AMI. I'm not sure which 'update logs' you are referring to as it is starting from an AMI snapshot; nothing should be getting updated.

@rohkat-aws
Copy link
Contributor

@falcor781 just to confirm you are getting Concurrent::RejectedExecutionError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants