Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) #397

Open
ambrosdavid opened this issue Jul 10, 2023 · 23 comments
Labels
aws-integration An issue integrating with an AWS Service bug Something isn't working

Comments

@ambrosdavid
Copy link

Describe the bug
When creating an EC2 instance using userData to init Docker, 9/10 times the command yum install docker -y fails giving the following error:
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The dependency name in that path also changes and it's not always the same.

Out of 30 instances created, only about 4 had their docker installation successful, so sometimes for some reason it doesn't give any error.

If I use a Sleep of 10s inside the bash script before the yum install docker -y command, the installation works without any problem.
If I write twice the yum install docker -y command in the bash script, the installation works.
If I ssh into the created ec2 instance and execute manually the command sudo yum install docker -y, the installation works.

To Reproduce
ImageId: ami-0f61de2873e29e866
InstanceType: "t2.micro",

userData script:

#!/bin/bash
set -x
yum update -y
yum upgrade -y
yum install -y docker
systemctl start docker
systemctl enable docker
usermod -aG docker ec2-user
wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-$(uname -s)-$(uname -m) -O /usr/bin/docker-compose
chmod +x /usr/bin/docker-compose

Logs
cat /var/log/cloud-init-output.log >>>>>>>>

Cloud-init v. 22.2.2 running 'modules:config' at Mon, 10 Jul 2023 14:05:45 +0000. Up 11.41 seconds.
Cloud-init v. 22.2.2 running 'modules:final' at Mon, 10 Jul 2023 14:05:46 +0000. Up 12.60 seconds.
+ yum update -y
Amazon Linux 2023 repository                     21 MB/s |  15 MB     00:00    
Amazon Linux 2023 Kernel Livepatch repository   272 kB/s | 158 kB     00:00    
Last metadata expiration check: 0:00:01 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum upgrade -y
Last metadata expiration check: 0:00:04 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum install -y docker
Last metadata expiration check: 0:00:06 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
================================================================================
 Package                 Arch    Version                     Repository    Size
================================================================================
Installing:
 docker                  x86_64  20.10.23-1.amzn2023.0.1     amazonlinux   42 M
Installing dependencies:
 containerd              x86_64  1.6.19-1.amzn2023.0.1       amazonlinux   31 M
 iptables-libs           x86_64  1.8.8-3.amzn2023.0.2        amazonlinux  401 k
 iptables-nft            x86_64  1.8.8-3.amzn2023.0.2        amazonlinux  183 k
 libcgroup               x86_64  3.0-1.amzn2023.0.1          amazonlinux   75 k
 libnetfilter_conntrack  x86_64  1.0.8-2.amzn2023.0.2        amazonlinux   58 k
 libnfnetlink            x86_64  1.0.1-19.amzn2023.0.2       amazonlinux   30 k
 libnftnl                x86_64  1.2.2-2.amzn2023.0.2        amazonlinux   84 k
 pigz                    x86_64  2.5-1.amzn2023.0.3          amazonlinux   83 k
 runc                    x86_64  1.1.7-1.amzn2023.0.1        amazonlinux  3.0 M

Transaction Summary
================================================================================
Install  10 Packages

Total download size: 77 M
Installed size: 300 M
Downloading Packages:
(1/10): libnetfilter_conntrack-1.0.8-2.amzn2023 336 kB/s |  58 kB     00:00    
(2/10): libnftnl-1.2.2-2.amzn2023.0.2.x86_64.rp 440 kB/s |  84 kB     00:00    
(3/10): iptables-nft-1.8.8-3.amzn2023.0.2.x86_6 3.8 MB/s | 183 kB     00:00    
(4/10): libcgroup-3.0-1.amzn2023.0.1.x86_64.rpm 909 kB/s |  75 kB     00:00    
(5/10): iptables-libs-1.8.8-3.amzn2023.0.2.x86_ 4.6 MB/s | 401 kB     00:00    
(6/10): libnfnetlink-1.0.1-19.amzn2023.0.2.x86_ 616 kB/s |  30 kB     00:00    
(7/10): pigz-2.5-1.amzn2023.0.3.x86_64.rpm      1.2 MB/s |  83 kB     00:00    
(8/10): runc-1.1.7-1.amzn2023.0.1.x86_64.rpm     11 MB/s | 3.0 MB     00:00    
(9/10): docker-20.10.23-1.amzn2023.0.1.x86_64.r  29 MB/s |  42 MB     00:01    
(10/10): containerd-1.6.19-1.amzn2023.0.1.x86_6  18 MB/s |  31 MB     00:01    
--------------------------------------------------------------------------------
Total                                            32 MB/s |  77 MB     00:02     
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'yum clean packages'.
+ systemctl start docker
Failed to start docker.service: Unit docker.service not found.
+ systemctl enable docker
Failed to enable unit: Unit file docker.service does not exist.
+ usermod -aG docker ec2-user
usermod: group 'docker' does not exist
++ uname -s
++ uname -m
+ wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64 -O /usr/bin/docker-compose
--2023-07-10 14:06:08--  https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream [following]
--2023-07-10 14:06:09--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44953600 (43M) [application/octet-stream]
Saving to: ‘/usr/bin/docker-compose’

     0K .......... .......... .......... .......... ..........  0% 15.4M 3s
    50K .......... .......... .......... .......... ..........  0% 13.9M 3s
   100K .......... .......... .......... .......... ..........  0% 13.6M 3s
  [...]
 43850K .......... .......... .......... .......... ..........100% 83.0M 0s
 43900K                                                       100% 0.00 =1.1s

2023-07-10 14:06:10 (39.2 MB/s) - ‘/usr/bin/docker-compose’ saved [44953600/44953600]

+ chmod +x /usr/bin/docker-compose
Cloud-init v. 22.2.2 finished at Mon, 10 Jul 2023 14:06:11 +0000. Datasource DataSourceEc2.  Up 37.29 seconds
@stewartsmith stewartsmith added the bug Something isn't working label Jul 10, 2023
@stewartsmith
Copy link
Member

Well that's interesting!

Thanks for the report, certainly something we're going to have to dive into and understand what's going on.

Out of interesting, have you tried on any larger instance sizes than t2.micro? Any chance there's something in dmesg about running out of memory or anything? (no need to burn a bunch of your ec2 $ on this if you haven't tested, we can certainly test at scale for this with the steps you have provided)

@ambrosdavid
Copy link
Author

Hello Stewart thanks for the reply, I have just tried to create 5 t3.medium instances and on all of them docker got installed successfully without any problem and without having to use the sleep command 👍

@mwebber
Copy link

mwebber commented Jul 28, 2023

cc: @stewartsmith
Just to contribute, we had occasional failures with a related but different error message, and we found that a sleep was required and sufficient to fix the problem. This is on a t3.large.

The error message we got if we tried the dnf install too early was
"can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"

Here's the relevant part of our userdata script, with our workaround:

# Script written for Amazon Linux 2023

printf "\n*** $(date +'%Y-%m-%d %H:%M:%S') INITIAL SETUP STARTING\n"
uname -a

printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Checking for and applying OS updates\n"
system_release=$(rpm -q system-release --qf "%{VERSION}")
printf "\n*** Current release ${system_release}\n"
dnf check-release-update
dnf check-update --releasever=${system_release}
dnf update -y --releasever=${system_release}

# sleep, in case of this error:
# "can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"
sleep 60
printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Installing additional packages\n"
dnf -y install {{AMI_PACKAGES}}

@ambrosdavid
Copy link
Author

It look like the userData script is being executed as soon as the instance is running and before finishing to initialise, instead of waiting for it to be in 'OK' status, also someone reported a similar problem in this StackOverflow post.

@bbenson29
Copy link

I had a similar issue and this is what I did to fix it

[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-84ef13e8f4afd0b4/packages/libsepol-devel-3.4-3.amzn2023.0.3.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.

Fix

sudo dnf upgrade --refresh rpm glibc
sudo rm /var/lib/rpm/.rpm.lock
dnf -y update
dnf install  <MY PACKAGES>

@supergibbs
Copy link

@bbenson29 Had any issues with that? A bit hesitant to delete a lock. Would it be better to add a sleep 60 to ensure cloud-init is done?

@BlackDark
Copy link

Same problem here. Why is this still not fixed?
This issue even happens with simple cloud init config where you directly use the packages directive without custom scripts.

@supergibbs
Copy link

AWS Support recommended the following. I agree it seems like there is an issue and we shouldn't need this, didn't in v1 or v2 but it's been working for me.

while true; do
dnf install --assumeyes docker && break
done

while true; do
dnf update --assumeyes && break
done

@rbpltr
Copy link

rbpltr commented Oct 18, 2023

I've been trying to get to the bottom of this issue over the last 48 hours and I think it is caused by the SSM agent updater running at the same time as the user data scripts.

See: https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent-automatic-updates.html#ssm-agent-automatic-updates-console

I've disabled this in SSM Fleet Manager and the issue has gone away.

If you do disable the automatic SSM agent updates, it's important that you know the consequences of this and implement the updates in another way!

Edit: I'm specifically referring to the RPM lock issue btw...

@AB-DBMC
Copy link

AB-DBMC commented Oct 18, 2023

The malfunction is apparently triggered by the Auto update SSM agent function. The SSM automation document AWS-UpdateSSMAgent is then executed on the EC2 instance, which eventually leads to this error when other YUM/DNF commands are processed at the same time:

RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)
Error: Could not run transaction.

A quick and dirty workaround is to temporarily disable the Auto update SSM agent feature (Try it at your own risk, but it is not generally recommended).

Systems Manager > Fleet Manager > Settings > Auto update SSM agent -> Disable

After that, the error will no longer occur. As long as AWS does not have a bugfix here, the workaround of @supergibbs makes the most sense, but leads to multiple entries under yum history.

@stewartsmith stewartsmith changed the title [Bug] - userData Docker installation fails on EC2. [Errno 2] No such file or directory [Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) Oct 20, 2023
@stewartsmith
Copy link
Member

We've passed this along to the SSM team and they're tracking it in an internal ticket. I'll try and keep an eye on it.

@stewartsmith stewartsmith added the aws-integration An issue integrating with an AWS Service label Oct 20, 2023
@BenCoffeed
Copy link

Any update here? Having a real blast with 2023-powered EB engines.

@hastarin
Copy link

hastarin commented Jan 18, 2024

@stewartsmith Would there be any update?

Surely the low number of thumbs up here is more a reflection of those that aren't aware of the issue or have been forced to work around it.

To add yet another workaround I ended up using the MINIMAL AMI, I then installed aws-cfn-bootstrap via UserData and then install both docker and amazon-ssm-agent as part of CloudInit but make made sure to start the amazon-ssm-agent last.

@rrehbein
Copy link

In our project we were starting ssm early on minimal. We added this snippet to our user data. (White space added for readability)

systemctl enable amazon-ssm-agent.service --now
# Give service a moment to start to create its first logs
sleep 2 
# Watch the logs for the completion of self-update
stdbuf -i0 -o0 -e0 tail -n +0 -f /var/log/amazon/ssm/amazon-ssm-agent.log \
    | awk -e '
        /"awsupdateSsmAgent":/,/("status":)/ {
            if (/"(Success|Skipped)"/) {
                print "awsupdateSsmAgent: " $0;
                exit
            }
        }
    '
  • stdbuf -i0 -o0 -e0 tail -f ... - no-buffered tail
  • awk ... - a multi-line string-match, with an exit on first match of status:Success or status:Skipped.

We were using just a plain sleep 10 however we had some random timing issues crop up around 10 seconds not being enough time sometimes. Rather than increasing the time to be safe we changed to watching the logs for a key event.

@andreverheij
Copy link

andreverheij commented Jan 24, 2024

I having the same problem installing Cloudwatch agent. I download the rpm file, then the dnf install runs while dnf clean all runs in the background and breaks the dnf install command and never finishing my machines. Setting a sleep before my dnf install command, the machines build.. but i'd like to not have this. One key element of using an ASG is that a new machine is built quickly. and not having to wait 60 seconds for no good reason.

@singlewind
Copy link

Just add some steps for easily replicate the issue.

Add userdata as cloud config.

#cloud-config

repo_update: true
repo_upgrade: true
package_reboot_if_required: true

packages:
  - docker
  - postgresql15
  - python3-boto3

Randomly gives error below

Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Could not run transaction.

Recently, the situation becomes more frequently.

We are using

Linux ip-10-42-102-106.ap-southeast-2.compute.internal 6.1.79-99.164.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Feb 27 18:02:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

cloud-init version is 22.2.2

@BlackDark
Copy link

Maybe also as a hotfix which works well for me.

  • stop ssm agent in bootcmd
  • do your stuff
  • at the end of runcmd start ssm again
package_update: true
package_upgrade: true

bootcmd:
  - systemctl stop amazon-ssm-agent

runcmd:
  - ... your stuff here
  - systemctl start amazon-ssm-agent # at the end restart

@xu-lei-richard
Copy link

Any update for this issue?
I put in a hack with a command 'sleep 30', and it worked before. Now this hack doesn't work.

@rowanbeentje
Copy link

We've also successfully been using the workaround posted above but something further seems to have changed in the last couple of weeks and we've had to apply it to more scripts. Fortunately it does still seem to work although our simple setup scripts are getting messier!

@Booligoosh
Copy link

Booligoosh commented Apr 24, 2024

@BlackDark's workaround could cause issues when rebooting the instance, as bootcmd runs on every boot, but runcmd only runs when starting the instance for the first time, meaning the SSM Agent won't start back up. This workaround worked for us, and restarts the SSM agent even when rebooting:

# Workaround for this issue: https://github.com/amazonlinux/amazon-linux-2023/issues/397
# Stops the SSM Agent from updating during the cloud-init process while we're trying to install packages, then restarts once cloud-init is complete.
bootcmd:
  - systemctl stop amazon-ssm-agent
write_files:
  - path: /var/lib/cloud/scripts/per-boot/startSsmAgent.sh
    permissions: "0755" # Allow owner to read/write/execute, others to read/execute
    content: |
      #!/bin/sh
      systemctl start amazon-ssm-agent

@avoidik
Copy link

avoidik commented May 5, 2024

I've had similar issue on ubuntu for a long time, the only way for me was to wait for the rpm lock to be released, and then do the rest on my own:

#cloud-config

package_update: false
package_upgrade: false

runcmd:
  - while fuser /var/lib/rpm/.rpm.lock > /dev/null 2>&1 ; do sleep 1 ; done
  - dnf install -y docker
  - systemctl enable docker.service
  - systemctl start docker.service
  - usermod -a -G docker ec2-user

@mg-alanjones
Copy link

@avoidik Thank you for posting however that solution doesn't work for me on a t3a.small.

@avoidik
Copy link

avoidik commented May 10, 2024

@mg-alanjones since it's just a workaround you're free to experiment with it, for instance try to increase sleep timeout, wait for some other mutex, or perhaps wait for amazon guys to fix this problem, cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws-integration An issue integrating with an AWS Service bug Something isn't working
Projects
None yet
Development

No branches or pull requests