Skip to content

Improve stash efficiency#495

Merged
dmitry-lyfar merged 10 commits into
mainfrom
feature/global-sdk-cache
Oct 13, 2025
Merged

Improve stash efficiency#495
dmitry-lyfar merged 10 commits into
mainfrom
feature/global-sdk-cache

Conversation

@jonathan-conder
Copy link
Copy Markdown
Contributor

@jonathan-conder jonathan-conder commented Sep 24, 2025

Description

Uses cloned instances instead of snapshots alone, to enable instance-only copies for stashing and unstashing. This speeds up refresh for large workshops.

This is very much a breaking change, I suggest running the remove hook and maybe deleting the state before migrating to this branch. The "before" is important because this PR modifies the hook.

I also limited workshop and SDK names to 40 characters, following SDKcraft.

Preserving workshop metadata when refresh fails

The first commit introduces a new task rebuild-workshop, which is like create-workshop but its undo handler is a no-op (and it doesn't remove the workshop if the do handler fails partway through). To make this work, UnstashWorkshop has to use lxc copy --refresh instead of lxc copy.

This removes the need to distinguish between "remove for good" and "remove but keep the DHCP lease." Instead, the workshop exists contiguously from launch to remove.

To avoid reintroducing DHCP instability, we simply give the stash a different MAC address from the main instance. For consistency, we avoid copying most volatile.* options to and from the stash.

Design

SDK snapshots and the stash are both implemented as "layers," i.e. LXD instances which live in the workshop-layers.<USER> project. These instances are never started, they just store a clone of the workshop's rootfs and associated metadata like config options and devices.

I tried to confine the changes to the LXD backend and not change the API too much. One thing I did change is that Snapshot and Restore now take an SDK name instead of a snapid.

All SDK snapshots belong to a workshop or a stashed workshop. This means the behavioural changes are limited to:

  • Snapshot is like lxc copy rather than lxc snapshot. LXD still creates a ZFS snapshot internally, but also creates a new instance based on the snapshot.
  • StashWorkshop copies the workshop and all of its snapshots, as LXD instances. This is cheaper than before because there's no need to assemble the copied snapshots into a single filesystem. It's like lxc copy --instance-only rather than lxc copy.
  • Restore removes the same snapshots as before (as does LaunchOrRebuildWorkshop). It's like lxc copy --refresh rather than lxc restore.
  • UnstashWorkshop removes snapshots created during the refresh and restores missing snapshots from the stash. It's like lxc copy --refresh rather than lxc copy, thanks to the new rebuild-workshop task.
  • RemoveWorkshop and RemoveWorkshopStash attempt to remove all snapshots owned by the workshop or stash.

I think the main downside of this approach is that most functions operate on several instances at a time. It's hard to recover if something goes wrong partway through. The global SDK cache should resolve most of these issues, but I think enabling it will require much bigger changes outside the LXD backend.

Naming

For the global SDK cache, layers will likely be named using unique hashes. We can't do that yet because the backend doesn't have enough information to compute an appropriate hash. Instead SDK layers are named like <SDK>-<RANDOM STRING>. Stashed workshops and layers are prefixed by stash-.

The random string is 16 characters long, to avoid confusing it with a project ID. The longest possible name stash-<40 chars>-<16 chars> has length 63 exactly.

The previous scheme <WORKSHOP>.<SDK> can't be used as an instance name because it contains a ., and LXD limits instance names to 63 characters. To keep track of ownership, the workshop name and project ID are stored in layers as extra config options. SDK layers also store the SDK name. All layers have a type, either sdk, stash or stash-sdk. The latter will be removed when we switch to a global SDK cache.

I renamed the workshop-stash.<USER> project to workshop-layers.<USER>. It's probably possible to store all layers in the main workshop.<USER> project, if there's a reason to consolidate them.

Performance

I tested performance using the following workshop:

name: nav2
base: ubuntu@24.04
sdks:
  - name: project-ros2
  - name: project-nav2

To make project-ros2 I extracted the ROS2 SDK, removed the setup-project hook and removed the snaps from setup-base. The other SDK is just:

$ cat .workshop/nav2/sdk.yaml 
name: nav2
$ cat .workshop/nav2/hooks/setup-base 
echo 1

In between refreshes I incremented echo 1 to echo 2, etc., so the Workshop is always restored from a ros2 snapshot.

Refresh times

Main branch:

$ time workshop refresh
"nav2" refreshed

real	0m20.821s
user	0m0.198s
sys	0m0.068s
$ time workshop refresh
"nav2" refreshed

real	0m22.594s
user	0m0.296s
sys	0m0.101s
$ time workshop refresh
"nav2" refreshed

real	0m22.598s
user	0m0.275s
sys	0m0.093s
$ time workshop refresh
"nav2" refreshed

real	0m22.879s
user	0m0.254s
sys	0m0.083s

This branch:

$ time workshop refresh 
"nav2" refreshed

real	0m18.656s
user	0m0.217s
sys	0m0.073s
$ time workshop refresh 
"nav2" refreshed

real	0m20.861s
user	0m0.248s
sys	0m0.100s
$ time workshop refresh 
"nav2" refreshed

real	0m20.086s
user	0m0.246s
sys	0m0.066s
$ time workshop refresh 
"nav2" refreshed

real	0m21.239s
user	0m0.278s
sys	0m0.080s

Looks like a small, but clear, improvement.

Space consumed

Main branch, pre-refresh:

NAME                                              USED  AVAIL  REFER  MOUNTPOINT
workshop/containers                               445M  27.7G    24K  legacy
workshop/containers/workshop.jono_nav2-bee72369   445M  27.7G   908M  legacy
workshop/deleted/containers                        24K  27.7G    24K  legacy

Main branch, mid-refresh:

NAME                                                          USED  AVAIL  REFER  MOUNTPOINT
workshop/containers                                           890M  27.2G    24K  legacy
workshop/containers/workshop-stash.jono_stash-nav2-bee72369   446M  27.2G   908M  legacy
workshop/containers/workshop.jono_nav2-bee72369               445M  27.2G   908M  legacy
workshop/deleted/containers                                    24K  27.2G    24K  legacy

Main branch, post-refresh:

NAME                                              USED  AVAIL  REFER  MOUNTPOINT
workshop/containers                               445M  27.7G    24K  legacy
workshop/containers/workshop.jono_nav2-bee72369   445M  27.7G   908M  legacy
workshop/deleted/containers                        24K  27.7G    24K  legacy

This branch, pre-refresh:

NAME                                                               USED  AVAIL  REFER  MOUNTPOINT
workshop/containers                                               1.96M  27.7G    24K  legacy
workshop/containers/workshop-layers.jono_nav2-ffee8e20be0597bb     101K  27.7G   908M  legacy
workshop/containers/workshop-layers.jono_ros2-cf4dee669ba1a0c8     101K  27.7G   907M  legacy
workshop/containers/workshop-layers.jono_system-f982ce85d78e71d8    22K  27.7G   498M  legacy
workshop/containers/workshop.jono_nav2-bee72369                   1.72M  27.7G   908M  legacy
workshop/deleted/containers                                        445M  27.7G    24K  legacy
workshop/deleted/containers/4586daf1-c8c0-4d1c-b7ae-aa4c3086634f   445M  27.7G   907M  legacy

This branch, mid-refresh:

NAME                                                                     USED  AVAIL  REFER  MOUNTPOINT
workshop/containers                                                     1.62M  27.7G    24K  legacy
workshop/containers/workshop-layers.jono_ros2-cf4dee669ba1a0c8           101K  27.7G   907M  legacy
workshop/containers/workshop-layers.jono_stash-nav2-bee72369             101K  27.7G   908M  legacy
workshop/containers/workshop-layers.jono_stash-nav2-ffee8e20be0597bb     101K  27.7G   908M  legacy
workshop/containers/workshop-layers.jono_stash-ros2-cf4dee669ba1a0c8     101K  27.7G   907M  legacy
workshop/containers/workshop-layers.jono_stash-system-f982ce85d78e71d8    22K  27.7G   498M  legacy
workshop/containers/workshop-layers.jono_system-f982ce85d78e71d8          22K  27.7G   498M  legacy
workshop/containers/workshop.jono_nav2-bee72369                         1.16M  27.7G   908M  legacy
workshop/deleted/containers                                              448M  27.7G    24K  legacy
workshop/deleted/containers/4586daf1-c8c0-4d1c-b7ae-aa4c3086634f         445M  27.7G   907M  legacy
workshop/deleted/containers/8786c42c-5f5c-4de9-a085-6abd6a15912d        2.34M  27.7G   908M  legacy
workshop/deleted/containers/da3e4e85-4a17-4c0b-ad4e-7a8229906bfb         101K  27.7G   908M  legacy

This branch, post-refresh:

NAME                                                               USED  AVAIL  REFER  MOUNTPOINT
workshop/containers                                               1.89M  27.7G    24K  legacy
workshop/containers/workshop-layers.jono_nav2-14e237260f0c5af6     101K  27.7G   908M  legacy
workshop/containers/workshop-layers.jono_ros2-cf4dee669ba1a0c8     101K  27.7G   907M  legacy
workshop/containers/workshop-layers.jono_system-f982ce85d78e71d8    22K  27.7G   498M  legacy
workshop/containers/workshop.jono_nav2-bee72369                   1.65M  27.7G   908M  legacy
workshop/deleted/containers                                        445M  27.7G    24K  legacy
workshop/deleted/containers/4586daf1-c8c0-4d1c-b7ae-aa4c3086634f   445M  27.7G   907M  legacy

In main, the stash consumes about the same space as the workshop. In this branch that overhead is almost completely gone. But there's a downside, which is that deleted containers can accumulate over time. This example isn't bad, but only because we didn't really change the workshop after installing ros2. It's quite easy to rack up a lot of wasted space.

We plan on working with the LXD team to address the issue. In the meantime, this script will remove the unnecessary filesystems:

set -eo pipefail

volumes() {
	zfs get -H -o value -d 1 -t filesystem,volume name workshop/deleted/containers
}

snapshots() {
	zfs get -H -o value -t snapshot name "$1"
}

clones() {
	zfs get -H -o value clones "$1"
}

volumes | while IFS= read -r volume; do
	[ "$volume" != workshop/deleted/containers ] || continue

	snapshots "$volume" | {
		latest=''
		while IFS= read -r snapshot; do
			clones="$(clones "$snapshot")"
			if [[ "$clones" =~ ^-?$ ]]; then
				zfs destroy "$snapshot"
			else
				latest="${clones%%,*}"
			fi
		done

		if [ -n "$latest" ]; then
			zfs promote "$latest"
		fi
		zfs destroy "$volume"
	}
done

NOTE: it assumes all snapshots are named copy-<UUID>. This is the case if they were created by Workshop. To make it more robust, it should probably rename the snapshots before calling zfs promote.

Self-review quick check

  • Make decisions that cost a lot to reverse explicit in the PR description.
  • Avoid nested conditions.
  • Delete dead code and redundant comments.
  • Normalise symmetries by sticking to doing identical things identically.
// one way to handle errors
if err := f(); err != nil {
   ...
}

// one way to handle multiple returns
val, err := f()
if err != nil {
   ...
}
...
  • Check that coupled code elements, files, and directories are adjacent. For example, test data is stored as close as possible to a test.
  • Put variable declaration and initialisation together.
  • Divide large expressions into digestable and self-explanatory ones. Use multiple variables if required.
  • Put a blank line between two logically different chunks of code.
  • Follow the style guide for new error messages.

Docs

  • I have checked and added or updated relevant documentation.
  • I have checked and added or updated relevant release notes.
  • I have included the technical author in the review.

Or:

  • I confirm the PR has no implications for documentation.

@jonathan-conder jonathan-conder self-assigned this Sep 24, 2025
@jonathan-conder jonathan-conder force-pushed the feature/global-sdk-cache branch 8 times, most recently from 6d1a291 to 4d2fe4d Compare October 7, 2025 05:22
@jonathan-conder jonathan-conder force-pushed the feature/global-sdk-cache branch from 4d2fe4d to 939e481 Compare October 8, 2025 05:47
LXD supports "incremental" or "refresh" copies, which means it replaces
the instance's root storage volume without ever removing the instance
itself. When doing this, it does not update the instance's config or
devices for some reason (the same happens when rebuilding from a base
image).

Currently, when restoring the stash, we first clear the workshop's MAC
address so that LXD doesn't try to clear its DHCP lease. This is no
longer necessary, since we now copy the stash over the existing
workshop.

However, removing the stash could potentially lead to the same issue. To
prevent this, we omit the MAC address and other volatile attributes when
creating the stash. This is handled by LXD automatically, by passing an
empty config map to the copy operation. When restoring the stash, we
manually copy over the same attributes: all the non-volatile ones plus
the base image and idmap. The corresponding function in LXD is called
InstanceIncludeWhenCopying.
SDK names are already limited to 40 characters by SDKcraft. We should
use the same limit in Workshop so that SDK layers don't exceed the LXD
instance name limit of 63 characters.

Workshops are already used in instance names. In theory the max length
is 63 - 9 = 54 but we might as well use the same limit as SDKs.
An upcoming change to SDK snapshots will consolidate them with the
stash, so it makes sense to implement them both in the same file. SDK
snapshots can be thought of as "layers," as can the stash. So the new
filename reflects this.
@jonathan-conder jonathan-conder force-pushed the feature/global-sdk-cache branch from 939e481 to 65c7f2f Compare October 8, 2025 06:00
@jonathan-conder jonathan-conder marked this pull request as ready for review October 8, 2025 06:54
Copy link
Copy Markdown
Contributor

@akcano akcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor change needed.

Comment thread docs/explanation/workshops/concepts.rst Outdated
@jonathan-conder jonathan-conder force-pushed the feature/global-sdk-cache branch from 65c7f2f to 7db55bc Compare October 8, 2025 21:22
@jonathan-conder
Copy link
Copy Markdown
Contributor Author

jonathan-conder commented Oct 9, 2025

I think we'll need a transitional version of the post-refresh hook which also removes the stash project. Pre-refresh doesn't cut it because the daemon will still be running

Comment thread internal/workshop/lxd/lxd_backend_layers.go Outdated
Comment thread internal/workshop/lxd/lxd_backend_layers.go
Comment thread internal/workshop/lxd/lxd_backend_layers.go Outdated
This will soon hold SDK layers in addition to stashed workshops. Future
iterations will bring the stash even closer to SDK layers, so it makes
sense to keep them in the same project. It might even be possible to
move them into the main project at some point.
The intent is to associate SDK layers with their parent workshop. It
doesn't hurt to set this on the workshops themselves, rather than adding
it when creating a new layer.
Also reimplements stash to preserve and remove layers when appropriate.
@jonathan-conder jonathan-conder force-pushed the feature/global-sdk-cache branch from 8a3c8ee to 12628b0 Compare October 10, 2025 04:51
@github-actions
Copy link
Copy Markdown

TICS Quality Gate

✔️ Passed

workshop

All conditions passed

See the results in the TICS Viewer

The following files have been checked for this project
  • internal/overlord/hookstate/handlers.go
  • internal/overlord/workshopstate/handlers.go
  • internal/overlord/workshopstate/manager.go
  • internal/overlord/workshopstate/request.go
  • internal/sdk/validate.go
  • internal/workshop/backend.go
  • internal/workshop/fakebackend/backend.go
  • internal/workshop/lxd/lxd_backend_layers.go
  • internal/workshop/lxd/lxd_backend_project.go
  • internal/workshop/lxd/lxd_backend.go
  • internal/workshop/workshop_file.go
  • internal/workshop/workshop.go

Automatic-tests / code-coverage / TICS GitHub Action

@dmitry-lyfar dmitry-lyfar merged commit c4fde13 into main Oct 13, 2025
13 checks passed
@dmitry-lyfar dmitry-lyfar deleted the feature/global-sdk-cache branch October 13, 2025 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants