New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) #397
Comments
Well that's interesting! Thanks for the report, certainly something we're going to have to dive into and understand what's going on. Out of interesting, have you tried on any larger instance sizes than |
Hello Stewart thanks for the reply, I have just tried to create 5 |
cc: @stewartsmith The error message we got if we tried the Here's the relevant part of our # Script written for Amazon Linux 2023
printf "\n*** $(date +'%Y-%m-%d %H:%M:%S') INITIAL SETUP STARTING\n"
uname -a
printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Checking for and applying OS updates\n"
system_release=$(rpm -q system-release --qf "%{VERSION}")
printf "\n*** Current release ${system_release}\n"
dnf check-release-update
dnf check-update --releasever=${system_release}
dnf update -y --releasever=${system_release}
# sleep, in case of this error:
# "can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"
sleep 60
printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Installing additional packages\n"
dnf -y install {{AMI_PACKAGES}} |
It look like the userData script is being executed as soon as the instance is running and before finishing to initialise, instead of waiting for it to be in 'OK' status, also someone reported a similar problem in this StackOverflow post. |
I had a similar issue and this is what I did to fix it
Fix
|
@bbenson29 Had any issues with that? A bit hesitant to delete a lock. Would it be better to add a |
Same problem here. Why is this still not fixed? |
AWS Support recommended the following. I agree it seems like there is an issue and we shouldn't need this, didn't in v1 or v2 but it's been working for me.
|
I've been trying to get to the bottom of this issue over the last 48 hours and I think it is caused by the SSM agent updater running at the same time as the user data scripts. I've disabled this in SSM Fleet Manager and the issue has gone away. If you do disable the automatic SSM agent updates, it's important that you know the consequences of this and implement the updates in another way! Edit: I'm specifically referring to the RPM lock issue btw... |
The malfunction is apparently triggered by the Auto update SSM agent function. The SSM automation document AWS-UpdateSSMAgent is then executed on the EC2 instance, which eventually leads to this error when other YUM/DNF commands are processed at the same time:
A quick and dirty workaround is to temporarily disable the Auto update SSM agent feature (Try it at your own risk, but it is not generally recommended).
After that, the error will no longer occur. As long as AWS does not have a bugfix here, the workaround of @supergibbs makes the most sense, but leads to multiple entries under yum history. |
We've passed this along to the SSM team and they're tracking it in an internal ticket. I'll try and keep an eye on it. |
Any update here? Having a real blast with 2023-powered EB engines. |
@stewartsmith Would there be any update? Surely the low number of thumbs up here is more a reflection of those that aren't aware of the issue or have been forced to work around it. To add yet another workaround I ended up using the MINIMAL AMI, I then installed aws-cfn-bootstrap via UserData and then install both docker and amazon-ssm-agent as part of CloudInit but make made sure to start the amazon-ssm-agent last. |
In our project we were starting ssm early on minimal. We added this snippet to our user data. (White space added for readability) systemctl enable amazon-ssm-agent.service --now
# Give service a moment to start to create its first logs
sleep 2
# Watch the logs for the completion of self-update
stdbuf -i0 -o0 -e0 tail -n +0 -f /var/log/amazon/ssm/amazon-ssm-agent.log \
| awk -e '
/"awsupdateSsmAgent":/,/("status":)/ {
if (/"(Success|Skipped)"/) {
print "awsupdateSsmAgent: " $0;
exit
}
}
'
We were using just a plain |
I having the same problem installing Cloudwatch agent. I download the rpm file, then the dnf install runs while dnf clean all runs in the background and breaks the dnf install command and never finishing my machines. Setting a sleep before my dnf install command, the machines build.. but i'd like to not have this. One key element of using an ASG is that a new machine is built quickly. and not having to wait 60 seconds for no good reason. |
Fix deployment issue bug : amazonlinux/amazon-linux-2023#397
Just add some steps for easily replicate the issue. Add userdata as cloud config.
Randomly gives error below
Recently, the situation becomes more frequently. We are using
cloud-init version is 22.2.2 |
Maybe also as a hotfix which works well for me.
|
Any update for this issue? |
We've also successfully been using the workaround posted above but something further seems to have changed in the last couple of weeks and we've had to apply it to more scripts. Fortunately it does still seem to work although our simple setup scripts are getting messier! |
@BlackDark's workaround could cause issues when rebooting the instance, as bootcmd runs on every boot, but runcmd only runs when starting the instance for the first time, meaning the SSM Agent won't start back up. This workaround worked for us, and restarts the SSM agent even when rebooting: # Workaround for this issue: https://github.com/amazonlinux/amazon-linux-2023/issues/397
# Stops the SSM Agent from updating during the cloud-init process while we're trying to install packages, then restarts once cloud-init is complete.
bootcmd:
- systemctl stop amazon-ssm-agent
write_files:
- path: /var/lib/cloud/scripts/per-boot/startSsmAgent.sh
permissions: "0755" # Allow owner to read/write/execute, others to read/execute
content: |
#!/bin/sh
systemctl start amazon-ssm-agent |
I've had similar issue on ubuntu for a long time, the only way for me was to wait for the rpm lock to be released, and then do the rest on my own: #cloud-config
package_update: false
package_upgrade: false
runcmd:
- while fuser /var/lib/rpm/.rpm.lock > /dev/null 2>&1 ; do sleep 1 ; done
- dnf install -y docker
- systemctl enable docker.service
- systemctl start docker.service
- usermod -a -G docker ec2-user |
@avoidik Thank you for posting however that solution doesn't work for me on a |
@mg-alanjones since it's just a workaround you're free to experiment with it, for instance try to increase sleep timeout, wait for some other mutex, or perhaps wait for amazon guys to fix this problem, cheers |
Describe the bug
When creating an EC2 instance using userData to init Docker, 9/10 times the command
yum install docker -y
fails giving the following error:[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The dependency name in that path also changes and it's not always the same.
Out of 30 instances created, only about 4 had their docker installation successful, so sometimes for some reason it doesn't give any error.
If I use a Sleep of 10s inside the bash script before the
yum install docker -y
command, the installation works without any problem.If I write twice the
yum install docker -y
command in the bash script, the installation works.If I ssh into the created ec2 instance and execute manually the command
sudo yum install docker -y
, the installation works.To Reproduce
ImageId: ami-0f61de2873e29e866
InstanceType: "t2.micro",
userData script:
#!/bin/bash
set -x
yum update -y
yum upgrade -y
yum install -y docker
systemctl start docker
systemctl enable docker
usermod -aG docker ec2-user
wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-$(uname -s)-$(uname -m) -O /usr/bin/docker-compose
chmod +x /usr/bin/docker-compose
Logs
cat /var/log/cloud-init-output.log >>>>>>>>
The text was updated successfully, but these errors were encountered: