[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) #397

ambrosdavid · 2023-07-10T14:34:09Z

Describe the bug
When creating an EC2 instance using userData to init Docker, 9/10 times the command yum install docker -y fails giving the following error:
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The dependency name in that path also changes and it's not always the same.

Out of 30 instances created, only about 4 had their docker installation successful, so sometimes for some reason it doesn't give any error.

If I use a Sleep of 10s inside the bash script before the yum install docker -y command, the installation works without any problem.
If I write twice the yum install docker -y command in the bash script, the installation works.
If I ssh into the created ec2 instance and execute manually the command sudo yum install docker -y, the installation works.

To Reproduce
ImageId: ami-0f61de2873e29e866
InstanceType: "t2.micro",

userData script:

#!/bin/bash
set -x
yum update -y
yum upgrade -y
yum install -y docker
systemctl start docker
systemctl enable docker
usermod -aG docker ec2-user
wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-$(uname -s)-$(uname -m) -O /usr/bin/docker-compose
chmod +x /usr/bin/docker-compose

Logs
cat /var/log/cloud-init-output.log >>>>>>>>

Cloud-init v. 22.2.2 running 'modules:config' at Mon, 10 Jul 2023 14:05:45 +0000. Up 11.41 seconds.
Cloud-init v. 22.2.2 running 'modules:final' at Mon, 10 Jul 2023 14:05:46 +0000. Up 12.60 seconds.
+ yum update -y
Amazon Linux 2023 repository                     21 MB/s |  15 MB     00:00    
Amazon Linux 2023 Kernel Livepatch repository   272 kB/s | 158 kB     00:00    
Last metadata expiration check: 0:00:01 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum upgrade -y
Last metadata expiration check: 0:00:04 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum install -y docker
Last metadata expiration check: 0:00:06 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
================================================================================
 Package                 Arch    Version                     Repository    Size
================================================================================
Installing:
 docker                  x86_64  20.10.23-1.amzn2023.0.1     amazonlinux   42 M
Installing dependencies:
 containerd              x86_64  1.6.19-1.amzn2023.0.1       amazonlinux   31 M
 iptables-libs           x86_64  1.8.8-3.amzn2023.0.2        amazonlinux  401 k
 iptables-nft            x86_64  1.8.8-3.amzn2023.0.2        amazonlinux  183 k
 libcgroup               x86_64  3.0-1.amzn2023.0.1          amazonlinux   75 k
 libnetfilter_conntrack  x86_64  1.0.8-2.amzn2023.0.2        amazonlinux   58 k
 libnfnetlink            x86_64  1.0.1-19.amzn2023.0.2       amazonlinux   30 k
 libnftnl                x86_64  1.2.2-2.amzn2023.0.2        amazonlinux   84 k
 pigz                    x86_64  2.5-1.amzn2023.0.3          amazonlinux   83 k
 runc                    x86_64  1.1.7-1.amzn2023.0.1        amazonlinux  3.0 M

Transaction Summary
================================================================================
Install  10 Packages

Total download size: 77 M
Installed size: 300 M
Downloading Packages:
(1/10): libnetfilter_conntrack-1.0.8-2.amzn2023 336 kB/s |  58 kB     00:00    
(2/10): libnftnl-1.2.2-2.amzn2023.0.2.x86_64.rp 440 kB/s |  84 kB     00:00    
(3/10): iptables-nft-1.8.8-3.amzn2023.0.2.x86_6 3.8 MB/s | 183 kB     00:00    
(4/10): libcgroup-3.0-1.amzn2023.0.1.x86_64.rpm 909 kB/s |  75 kB     00:00    
(5/10): iptables-libs-1.8.8-3.amzn2023.0.2.x86_ 4.6 MB/s | 401 kB     00:00    
(6/10): libnfnetlink-1.0.1-19.amzn2023.0.2.x86_ 616 kB/s |  30 kB     00:00    
(7/10): pigz-2.5-1.amzn2023.0.3.x86_64.rpm      1.2 MB/s |  83 kB     00:00    
(8/10): runc-1.1.7-1.amzn2023.0.1.x86_64.rpm     11 MB/s | 3.0 MB     00:00    
(9/10): docker-20.10.23-1.amzn2023.0.1.x86_64.r  29 MB/s |  42 MB     00:01    
(10/10): containerd-1.6.19-1.amzn2023.0.1.x86_6  18 MB/s |  31 MB     00:01    
--------------------------------------------------------------------------------
Total                                            32 MB/s |  77 MB     00:02     
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'yum clean packages'.
+ systemctl start docker
Failed to start docker.service: Unit docker.service not found.
+ systemctl enable docker
Failed to enable unit: Unit file docker.service does not exist.
+ usermod -aG docker ec2-user
usermod: group 'docker' does not exist
++ uname -s
++ uname -m
+ wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64 -O /usr/bin/docker-compose
--2023-07-10 14:06:08--  https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream [following]
--2023-07-10 14:06:09--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44953600 (43M) [application/octet-stream]
Saving to: ‘/usr/bin/docker-compose’

     0K .......... .......... .......... .......... ..........  0% 15.4M 3s
    50K .......... .......... .......... .......... ..........  0% 13.9M 3s
   100K .......... .......... .......... .......... ..........  0% 13.6M 3s
  [...]
 43850K .......... .......... .......... .......... ..........100% 83.0M 0s
 43900K                                                       100% 0.00 =1.1s

2023-07-10 14:06:10 (39.2 MB/s) - ‘/usr/bin/docker-compose’ saved [44953600/44953600]

+ chmod +x /usr/bin/docker-compose
Cloud-init v. 22.2.2 finished at Mon, 10 Jul 2023 14:06:11 +0000. Datasource DataSourceEc2.  Up 37.29 seconds

The text was updated successfully, but these errors were encountered:

stewartsmith · 2023-07-10T16:16:34Z

Well that's interesting!

Thanks for the report, certainly something we're going to have to dive into and understand what's going on.

Out of interesting, have you tried on any larger instance sizes than t2.micro? Any chance there's something in dmesg about running out of memory or anything? (no need to burn a bunch of your ec2 $ on this if you haven't tested, we can certainly test at scale for this with the steps you have provided)

ambrosdavid · 2023-07-12T16:13:58Z

Hello Stewart thanks for the reply, I have just tried to create 5 t3.medium instances and on all of them docker got installed successfully without any problem and without having to use the sleep command 👍

mwebber · 2023-07-28T15:14:10Z

cc: @stewartsmith
Just to contribute, we had occasional failures with a related but different error message, and we found that a sleep was required and sufficient to fix the problem. This is on a t3.large.

The error message we got if we tried the dnf install too early was
"can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"

Here's the relevant part of our userdata script, with our workaround:

# Script written for Amazon Linux 2023

printf "\n*** $(date +'%Y-%m-%d %H:%M:%S') INITIAL SETUP STARTING\n"
uname -a

printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Checking for and applying OS updates\n"
system_release=$(rpm -q system-release --qf "%{VERSION}")
printf "\n*** Current release ${system_release}\n"
dnf check-release-update
dnf check-update --releasever=${system_release}
dnf update -y --releasever=${system_release}

# sleep, in case of this error:
# "can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"
sleep 60
printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Installing additional packages\n"
dnf -y install {{AMI_PACKAGES}}

ambrosdavid · 2023-07-28T15:25:57Z

It look like the userData script is being executed as soon as the instance is running and before finishing to initialise, instead of waiting for it to be in 'OK' status, also someone reported a similar problem in this StackOverflow post.

bbenson29 · 2023-08-21T01:27:40Z

I had a similar issue and this is what I did to fix it

[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-84ef13e8f4afd0b4/packages/libsepol-devel-3.4-3.amzn2023.0.3.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.

Fix

sudo dnf upgrade --refresh rpm glibc
sudo rm /var/lib/rpm/.rpm.lock
dnf -y update
dnf install  <MY PACKAGES>

supergibbs · 2023-09-29T18:52:54Z

@bbenson29 Had any issues with that? A bit hesitant to delete a lock. Would it be better to add a sleep 60 to ensure cloud-init is done?

BlackDark · 2023-10-12T14:38:54Z

Same problem here. Why is this still not fixed?
This issue even happens with simple cloud init config where you directly use the packages directive without custom scripts.

supergibbs · 2023-10-12T18:36:55Z

AWS Support recommended the following. I agree it seems like there is an issue and we shouldn't need this, didn't in v1 or v2 but it's been working for me.

while true; do
dnf install --assumeyes docker && break
done

while true; do
dnf update --assumeyes && break
done

rbpltr · 2023-10-18T15:37:17Z

I've been trying to get to the bottom of this issue over the last 48 hours and I think it is caused by the SSM agent updater running at the same time as the user data scripts.

See: https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent-automatic-updates.html#ssm-agent-automatic-updates-console

I've disabled this in SSM Fleet Manager and the issue has gone away.

If you do disable the automatic SSM agent updates, it's important that you know the consequences of this and implement the updates in another way!

Edit: I'm specifically referring to the RPM lock issue btw...

AB-DBMC · 2023-10-18T16:02:58Z

The malfunction is apparently triggered by the Auto update SSM agent function. The SSM automation document AWS-UpdateSSMAgent is then executed on the EC2 instance, which eventually leads to this error when other YUM/DNF commands are processed at the same time:

RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)
Error: Could not run transaction.

A quick and dirty workaround is to temporarily disable the Auto update SSM agent feature (Try it at your own risk, but it is not generally recommended).

Systems Manager > Fleet Manager > Settings > Auto update SSM agent -> Disable

After that, the error will no longer occur. As long as AWS does not have a bugfix here, the workaround of @supergibbs makes the most sense, but leads to multiple entries under yum history.

stewartsmith · 2023-10-20T17:03:56Z

We've passed this along to the SSM team and they're tracking it in an internal ticket. I'll try and keep an eye on it.

BenCoffeed · 2023-12-11T17:47:50Z

Any update here? Having a real blast with 2023-powered EB engines.

hastarin · 2024-01-18T06:13:35Z

@stewartsmith Would there be any update?

Surely the low number of thumbs up here is more a reflection of those that aren't aware of the issue or have been forced to work around it.

To add yet another workaround I ended up using the MINIMAL AMI, I then installed aws-cfn-bootstrap via UserData and then install both docker and amazon-ssm-agent as part of CloudInit but make made sure to start the amazon-ssm-agent last.

rrehbein · 2024-01-19T13:22:16Z

In our project we were starting ssm early on minimal. We added this snippet to our user data. (White space added for readability)

systemctl enable amazon-ssm-agent.service --now
# Give service a moment to start to create its first logs
sleep 2 
# Watch the logs for the completion of self-update
stdbuf -i0 -o0 -e0 tail -n +0 -f /var/log/amazon/ssm/amazon-ssm-agent.log \
    | awk -e '
        /"awsupdateSsmAgent":/,/("status":)/ {
            if (/"(Success|Skipped)"/) {
                print "awsupdateSsmAgent: " $0;
                exit
            }
        }
    '

stdbuf -i0 -o0 -e0 tail -f ... - no-buffered tail
awk ... - a multi-line string-match, with an exit on first match of status:Success or status:Skipped.

We were using just a plain sleep 10 however we had some random timing issues crop up around 10 seconds not being enough time sometimes. Rather than increasing the time to be safe we changed to watching the logs for a key event.

andreverheij · 2024-01-24T00:35:08Z

I having the same problem installing Cloudwatch agent. I download the rpm file, then the dnf install runs while dnf clean all runs in the background and breaks the dnf install command and never finishing my machines. Setting a sleep before my dnf install command, the machines build.. but i'd like to not have this. One key element of using an ASG is that a new machine is built quickly. and not having to wait 60 seconds for no good reason.

Fix deployment issue bug : amazonlinux/amazon-linux-2023#397

singlewind · 2024-03-08T00:15:32Z

Just add some steps for easily replicate the issue.

Add userdata as cloud config.

#cloud-config

repo_update: true
repo_upgrade: true
package_reboot_if_required: true

packages:
  - docker
  - postgresql15
  - python3-boto3

Randomly gives error below

Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Could not run transaction.

Recently, the situation becomes more frequently.

We are using

Linux ip-10-42-102-106.ap-southeast-2.compute.internal 6.1.79-99.164.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Feb 27 18:02:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

cloud-init version is 22.2.2

BlackDark · 2024-03-08T10:14:27Z

Maybe also as a hotfix which works well for me.

stop ssm agent in bootcmd
do your stuff
at the end of runcmd start ssm again

package_update: true
package_upgrade: true

bootcmd:
  - systemctl stop amazon-ssm-agent

runcmd:
  - ... your stuff here
  - systemctl start amazon-ssm-agent # at the end restart

xu-lei-richard · 2024-03-28T12:30:27Z

Any update for this issue?
I put in a hack with a command 'sleep 30', and it worked before. Now this hack doesn't work.

rowanbeentje · 2024-04-09T08:51:39Z

We've also successfully been using the workaround posted above but something further seems to have changed in the last couple of weeks and we've had to apply it to more scripts. Fortunately it does still seem to work although our simple setup scripts are getting messier!

Booligoosh · 2024-04-24T04:02:05Z

@BlackDark's workaround could cause issues when rebooting the instance, as bootcmd runs on every boot, but runcmd only runs when starting the instance for the first time, meaning the SSM Agent won't start back up. This workaround worked for us, and restarts the SSM agent even when rebooting:

# Workaround for this issue: https://github.com/amazonlinux/amazon-linux-2023/issues/397
# Stops the SSM Agent from updating during the cloud-init process while we're trying to install packages, then restarts once cloud-init is complete.
bootcmd:
  - systemctl stop amazon-ssm-agent
write_files:
  - path: /var/lib/cloud/scripts/per-boot/startSsmAgent.sh
    permissions: "0755" # Allow owner to read/write/execute, others to read/execute
    content: |
      #!/bin/sh
      systemctl start amazon-ssm-agent

avoidik · 2024-05-05T19:10:15Z

I've had similar issue on ubuntu for a long time, the only way for me was to wait for the rpm lock to be released, and then do the rest on my own:

#cloud-config

package_update: false
package_upgrade: false

runcmd:
  - while fuser /var/lib/rpm/.rpm.lock > /dev/null 2>&1 ; do sleep 1 ; done
  - dnf install -y docker
  - systemctl enable docker.service
  - systemctl start docker.service
  - usermod -a -G docker ec2-user

mg-alanjones · 2024-05-10T01:23:31Z

@avoidik Thank you for posting however that solution doesn't work for me on a t3a.small.

avoidik · 2024-05-10T17:03:13Z

@mg-alanjones since it's just a workaround you're free to experiment with it, for instance try to increase sleep timeout, wait for some other mutex, or perhaps wait for amazon guys to fix this problem, cheers

stewartsmith added the bug Something isn't working label Jul 10, 2023

stewartsmith changed the title ~~[Bug] - userData Docker installation fails on EC2. [Errno 2] No such file or directory~~ [Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) Oct 20, 2023

stewartsmith added the aws-integration An issue integrating with an AWS Service label Oct 20, 2023

snmatus added a commit to aws-samples/db-top-monitoring that referenced this issue Jan 24, 2024

Fix deployment issue bug : amazonlinux/amazon-linux-2023#397

56b31dd

snmatus mentioned this issue Jan 24, 2024

Fix deployment issue bug : https://github.com/amazonlinux/amazon-linux-2023/issues/397 aws-samples/db-top-monitoring#10

Merged

snmatus added a commit to aws-samples/db-top-monitoring that referenced this issue Jan 24, 2024

Merge pull request #10 from aws-samples/app/upgrade/nodejs

eb6b248

Fix deployment issue bug : amazonlinux/amazon-linux-2023#397

jkruse14 mentioned this issue Jan 26, 2024

AL2023 Consistently Failing to boot philips-labs/terraform-aws-github-runner#3741

Closed

jkueloc mentioned this issue Apr 30, 2024

Update bastionhosts for ami al2023 & postgres15 client LibraryOfCongress/concordia#2363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) #397

[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) #397

ambrosdavid commented Jul 10, 2023

stewartsmith commented Jul 10, 2023

ambrosdavid commented Jul 12, 2023

mwebber commented Jul 28, 2023

ambrosdavid commented Jul 28, 2023

bbenson29 commented Aug 21, 2023

supergibbs commented Sep 29, 2023

BlackDark commented Oct 12, 2023

supergibbs commented Oct 12, 2023

rbpltr commented Oct 18, 2023 •

edited

AB-DBMC commented Oct 18, 2023 •

edited

stewartsmith commented Oct 20, 2023

BenCoffeed commented Dec 11, 2023

hastarin commented Jan 18, 2024 •

edited

rrehbein commented Jan 19, 2024

andreverheij commented Jan 24, 2024 •

edited

singlewind commented Mar 8, 2024

BlackDark commented Mar 8, 2024

xu-lei-richard commented Mar 28, 2024

rowanbeentje commented Apr 9, 2024

Booligoosh commented Apr 24, 2024 •

edited

avoidik commented May 5, 2024 •

edited

mg-alanjones commented May 10, 2024

avoidik commented May 10, 2024

[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) #397

[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory) #397

Comments

ambrosdavid commented Jul 10, 2023

stewartsmith commented Jul 10, 2023

ambrosdavid commented Jul 12, 2023

mwebber commented Jul 28, 2023

ambrosdavid commented Jul 28, 2023

bbenson29 commented Aug 21, 2023

supergibbs commented Sep 29, 2023

BlackDark commented Oct 12, 2023

supergibbs commented Oct 12, 2023

rbpltr commented Oct 18, 2023 • edited

AB-DBMC commented Oct 18, 2023 • edited

stewartsmith commented Oct 20, 2023

BenCoffeed commented Dec 11, 2023

hastarin commented Jan 18, 2024 • edited

rrehbein commented Jan 19, 2024

andreverheij commented Jan 24, 2024 • edited

singlewind commented Mar 8, 2024

BlackDark commented Mar 8, 2024

xu-lei-richard commented Mar 28, 2024

rowanbeentje commented Apr 9, 2024

Booligoosh commented Apr 24, 2024 • edited

avoidik commented May 5, 2024 • edited

mg-alanjones commented May 10, 2024

avoidik commented May 10, 2024

rbpltr commented Oct 18, 2023 •

edited

AB-DBMC commented Oct 18, 2023 •

edited

hastarin commented Jan 18, 2024 •

edited

andreverheij commented Jan 24, 2024 •

edited

Booligoosh commented Apr 24, 2024 •

edited

avoidik commented May 5, 2024 •

edited