Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally mount Instance Storage on boot #557

Merged
merged 34 commits into from Sep 9, 2021
Merged

Conversation

lox
Copy link
Contributor

@lox lox commented Mar 28, 2019

Since lots of instances seem to have fancy fast instance storage now, this mounts whatever is available at /ephemeral and uses that for docker storage.

Closes #464.

Copy link
Contributor

@toolmantim toolmantim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏🏼

@benesch
Copy link

benesch commented Jul 30, 2019

This would be a huge help to us! Anything I can do to help move this along?

(By the way, I think we'd also want the option to move /var/lib/buildkite to the local storage, so that compilation would also see the faster I/O.)

@lox
Copy link
Contributor Author

lox commented Aug 7, 2019

@benesch would be very interested to see if you actually saw performance improvements that were meaningful from this change. I guess there is only one way to find out!

@ans0600
Copy link

ans0600 commented Aug 8, 2019

Hi @lox, I am keen to try this PR. Anything I can do to help move this along? We will be trying this on c5.24xlarge with about 100 buildkite agents. I think this PR is critical for our idea to work.

@ans0600
Copy link

ans0600 commented Aug 10, 2019

Hi @lox, I tried this commit and the instance failed to start docker daemon with following error. The docker daemon.json file seems to be empty.

[root@ip-172-31-1-61 ec2-user]# tail -f /var/log/docker
Aug 10 02:41:10 ip-172-31-15-205 dockerd: unable to configure the Docker daemon with file /etc/docker/daemon.json: EOF

[root@ip-172-31-1-61 ec2-user]# ls /etc/docker/daemon.json
/etc/docker/daemon.json
[root@ip-172-31-1-61 ec2-user]# cat /etc/docker/daemon.json

I tried this branch with following steps:

  • Checkout this branch and run make
  • In packer folder, run packer build. I can see an AMI is created in my account
  • Use the "aws-stack.yml" in build directory to launch the cloudformation stack. I specified the AMI ID with the one generated in previous step.

After launch, I can see the ephemeral drive is created 👍 . The instance type in this case is m5d.24large

[root@ip-172-31-1-61 ec2-user]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        185G     0  185G   0% /dev
tmpfs           185G     0  185G   0% /dev/shm
tmpfs           185G  788K  185G   1% /run
tmpfs           185G     0  185G   0% /sys/fs/cgroup
/dev/nvme0n1p1   10G  1.8G  8.3G  18% /
/dev/nvme1n1    838G  888M  837G   1% /ephemeral
tmpfs            37G     0   37G   0% /run/user/1000

@lox
Copy link
Contributor Author

lox commented Aug 12, 2019

@ans0600 really appreciate the testing! I'll get this rebased and see if I can figure out what is going on.

@ans0600
Copy link

ans0600 commented Aug 16, 2019

@lox Just want to check if the PR is ready to be tested again?

@lox
Copy link
Contributor Author

lox commented Oct 18, 2019

@jradtilbrook or @ans0600 if either of you get a chance to test this out, let me know!

@lox
Copy link
Contributor Author

lox commented Nov 4, 2019

For those following along, you can test it out with https://s3.amazonaws.com/buildkite-aws-stack/add-instance-storage/aws-stack.yml. Would love feedback!

@jradtilbrook
Copy link
Member

I'm under the pump for other things at work so I don't think I'll be able to try this out any time soon sorry - just in case you were relying/waiting on me

@lox
Copy link
Contributor Author

lox commented Nov 5, 2019

No worries @jradtilbrook, some other folks asked about it!

@dgarbus
Copy link
Contributor

dgarbus commented Dec 5, 2019

I just gave this a try and confirmed that it is still failing to spin up new instances. The issue appears to be that the jq filter syntax being used does not support key names containing hyphens (alphanumeric characters and underscores only): https://github.com/buildkite/elastic-ci-stack-for-aws/pull/557/files#diff-75f0630f614eebc548f6a4beeda4f879R23

To fix it, we just need to add double quotes around the key name:

--- a/packer/conf/bin/bk-configure-docker.sh
+++ b/packer/conf/bin/bk-configure-docker.sh
@@ -20,5 +20,5 @@ fi

 # Move docker root to the ephemeral device
 if [[ "${BUILDKITE_ENABLE_INSTANCE_STORAGE:-false}" == "true" ]] ; then
-  cat <<< "$(jq '.data-root="/ephemeral/docker"' /etc/docker/daemon.json)" > /etc/docker/daemon.json
+  cat <<< "$(jq '."data-root"="/ephemeral/docker"' /etc/docker/daemon.json)" > /etc/docker/daemon.json
 fi

In fact, it looks like we do exactly that for the "userns-remap" feature in that same script.

I applied the patch above, built an AMI, and confirmed that the instance stays running. However, when running a build, I ran into the next error:

Checking docker
--
  | CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
  | Checking disk space
  | df: ‘/var/lib/docker/’: No such file or directory
  | Cleaning up docker resources older than 4h
  | Total reclaimed space: 0B
  | Checking disk space again
  | df: ‘/var/lib/docker/’: No such file or directory
  | Disk health checks failed
  | 🚨 Error: Error setting up bootstrap: The global environment hook exited with status 1
  |  

Which I fixed by changing this:

--- a/packer/conf/bin/bk-check-disk-space.sh
+++ b/packer/conf/bin/bk-check-disk-space.sh
@@ -4,7 +4,7 @@ set -euo pipefail
 DISK_MIN_AVAILABLE=${DISK_MIN_AVAILABLE:-5242880} # 5GB
 DISK_MIN_INODES=${DISK_MIN_INODES:-250000} # docker needs lots

-DOCKER_DIR="/var/lib/docker/"
+DOCKER_DIR="$(jq -r '."data-root" // "/var/lib/docker"' /etc/docker/daemon.json)"

 disk_avail=$(df -k --output=avail "$DOCKER_DIR" | tail -n1)

Now everything appears to be working.

@ouranos
Copy link
Contributor

ouranos commented Jul 5, 2020

Hi @lox,
I've tested this last week.

I've:

  • rebased this branch onto the latest master branch
  • applied @dgarbus patches

I've replaced our "builder" stack (mostly docker/yarn/webpack) by this new stack keeping most of the parameters the same and just changing the instance type: m5.4xlarge with 500 GiB EBS (1500 IOPS)
-> m5d.4xlarge

With a cold docker cache, the build went from ~60min down to ~16min. So this is a massive improvement for us!

I had to apply the following patch to fix the striped volume creation and mount:

--- a/packer/linux/conf/bin/bk-mount-instance-storage.sh
+++ b/packer/linux/conf/bin/bk-mount-instance-storage.sh
@@ -10,16 +10,13 @@ if [[ "${BUILDKITE_ENABLE_INSTANCE_STORAGE:-false}" != "true" ]] ; then
   exit 0
 fi

+# Install nvme-cli to list NVMe SSD instance store volumes
+yum -y -d0 install nvme-cli
+
 devicemount=/ephemeral
 logicalname=/dev/md0
-candidates=( '/dev/nvme1n1' )
-devices=()
+devices=($(nvme list | grep "Amazon EC2 NVMe Instance Storage"| cut -f1 -d' '))

-for candidate in "${candidates[@]}" ; do
-  if [[ -b $candidate ]] ; then
-    devices+=("$candidate")
-  fi
-done

 if [[ "${#devices[@]}" -gt 0 ]] ; then
   mkdir -p "$devicemount"
@@ -35,13 +32,13 @@ if [[ "${#devices[@]}" -eq 1 ]] ; then
   fi

 elif [[ "${#devices[@]}" -gt 1 ]] ; then
-  yes | mdadm \
+  mdadm \
     --create "$logicalname" \
     --level=0 \
     -c256 \
     --raid-devices="${#devices[@]}" "${devices[@]}"

-  echo \'DEVICE "${devices[*]}"\' > /etc/mdadm.conf
+  echo "DEVICE ${devices[*]}" > /etc/mdadm.conf

   mdadm --detail --scan >> /etc/mdadm.conf
   blockdev --setra 65536 "$logicalname"

@ouranos
Copy link
Contributor

ouranos commented Jul 6, 2020

I also have a branch based off stable-4-4 on our fork if that helps:
stable-4-4...range-me:add-instance-storage-stable

@nitrocode
Copy link
Contributor

nitrocode commented Sep 15, 2020

Great addition! i believe this will help us as well. What's the status of this pr?

Is the only change left incorporating @ouranos changes into @lox pr?

Or would it be better to use the branch with @ouranos additional commits in a separate pr?

@woodhull
Copy link

woodhull commented Dec 2, 2020

We've also tried this out. It made a modest improvement to docker image build times in our situation. Would still be good to see this merged so that it can be experimented with by consumers without forking the stack.

@ouranos
Copy link
Contributor

ouranos commented Jan 18, 2021

@yob @lox Any updates on this?
It's working great for us but we have to maintain a branch and keep it up to date with upstream (and build new AMIs).
It would be great to get it merged (any branch would work for us, eg: 5-0-stable or master)

I'd be happy to rebase/update our branch and open a new PR if that helps.

@nijave
Copy link

nijave commented Feb 2, 2021

We're using something similar for node since the massive quantity of files really drags down the default config. We've seen NVMe instances with soft RAID0 hit over 70k IOPs so there's definitely a boost here. If I remember/have some time I'll post what we have (we've been using it over a year now)

I think our main difference is mounting NVMe to /media/ephemeral then bind mounting:

/media/ephemeral/docker -> /var/lib/docker
/media/ephemeral/buildkite -> /var/lib/buildkite

@ouranos
Copy link
Contributor

ouranos commented Feb 8, 2021

@yob @lox Any updates on this?
It's working great for us but we have to maintain a branch and keep it up to date with upstream (and build new AMIs).
It would be great to get it merged (any branch would work for us, eg: 5-0-stable or master)

I'd be happy to rebase/update our branch and open a new PR if that helps.

@sj26 @pda Pinging you since I saw you were looking at #790 which fall into the same category.

Let me know what you think of this and if I can help in getting it merged 🙏

@ouranos
Copy link
Contributor

ouranos commented Sep 8, 2021

Hi all,

Just saw some activity on this today and I realised I never prepared the PR I offered in my last message.

I spent some time last week updating another of our stack to nvme storage and that went quite well.

@keithduncan let me know if you need any help on this. I have a bit more time to spend on buildkite, so I'm happy to create a PR from our fork as I offered last time (and didn't do 😅 )

@keithduncan
Copy link
Contributor

Hi @ouranos, I think I’ve incorporated all the patches mentioned here and am testing this on Linux now. If you spot anything missing from this implementation compared to yours let me know and I can look at incorporating it 😄

One area I haven’t looked into yet is adding this to our Windows AMI, and is something I’m interested in adding. Pointers in this area would be appreciated!

@keithduncan
Copy link
Contributor

it's happening gif

@keithduncan keithduncan merged commit 43fe73f into master Sep 9, 2021
@keithduncan keithduncan deleted the add-instance-storage branch September 9, 2021 02:31
@ouranos
Copy link
Contributor

ouranos commented Sep 9, 2021

Great work @keithduncan! 🙌

Hi @ouranos, I think I’ve incorporated all the patches mentioned here and am testing this on Linux now. If you spot anything missing from this implementation compared to yours let me know and I can look at incorporating it 😄

I've compared my branch with what is now in master and it looks good. The only other changes we have are unrelated (installing CloudWatch agent for memory monitoring, installing buildx, ...)

I have an isolated stack & set of pipelines for testing so I'll test the new stack and report. The branch I'm using is branched off v5.1.0 so that'll also be a good opportunity to update to v5.6.0.

One area I haven’t looked into yet is adding this to our Windows AMI, and is something I’m interested in adding. Pointers in this area would be appreciated!

Unfortunately, we're only using Linux AMI so I won't be able to help much with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instance Storage Support