Improve stash efficiency#495
Merged
Merged
Conversation
6d1a291 to
4d2fe4d
Compare
4d2fe4d to
939e481
Compare
LXD supports "incremental" or "refresh" copies, which means it replaces the instance's root storage volume without ever removing the instance itself. When doing this, it does not update the instance's config or devices for some reason (the same happens when rebuilding from a base image). Currently, when restoring the stash, we first clear the workshop's MAC address so that LXD doesn't try to clear its DHCP lease. This is no longer necessary, since we now copy the stash over the existing workshop. However, removing the stash could potentially lead to the same issue. To prevent this, we omit the MAC address and other volatile attributes when creating the stash. This is handled by LXD automatically, by passing an empty config map to the copy operation. When restoring the stash, we manually copy over the same attributes: all the non-volatile ones plus the base image and idmap. The corresponding function in LXD is called InstanceIncludeWhenCopying.
SDK names are already limited to 40 characters by SDKcraft. We should use the same limit in Workshop so that SDK layers don't exceed the LXD instance name limit of 63 characters. Workshops are already used in instance names. In theory the max length is 63 - 9 = 54 but we might as well use the same limit as SDKs.
An upcoming change to SDK snapshots will consolidate them with the stash, so it makes sense to implement them both in the same file. SDK snapshots can be thought of as "layers," as can the stash. So the new filename reflects this.
939e481 to
65c7f2f
Compare
akcano
requested changes
Oct 8, 2025
Contributor
akcano
left a comment
There was a problem hiding this comment.
LGTM, one minor change needed.
65c7f2f to
7db55bc
Compare
Contributor
Author
|
I think we'll need a transitional version of the |
dmitry-lyfar
reviewed
Oct 9, 2025
dmitry-lyfar
approved these changes
Oct 10, 2025
This will soon hold SDK layers in addition to stashed workshops. Future iterations will bring the stash even closer to SDK layers, so it makes sense to keep them in the same project. It might even be possible to move them into the main project at some point.
The intent is to associate SDK layers with their parent workshop. It doesn't hurt to set this on the workshops themselves, rather than adding it when creating a new layer.
Also reimplements stash to preserve and remove layers when appropriate.
8a3c8ee to
12628b0
Compare
TICS Quality Gate✔️ PassedworkshopAll conditions passedSee the results in the TICS Viewer The following files have been checked for this project
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Uses cloned instances instead of snapshots alone, to enable instance-only copies for stashing and unstashing. This speeds up refresh for large workshops.
This is very much a breaking change, I suggest running the remove hook and maybe deleting the state before migrating to this branch. The "before" is important because this PR modifies the hook.
I also limited workshop and SDK names to 40 characters, following SDKcraft.
Preserving workshop metadata when refresh fails
The first commit introduces a new task
rebuild-workshop, which is likecreate-workshopbut its undo handler is a no-op (and it doesn't remove the workshop if the do handler fails partway through). To make this work,UnstashWorkshophas to uselxc copy --refreshinstead oflxc copy.This removes the need to distinguish between "remove for good" and "remove but keep the DHCP lease." Instead, the workshop exists contiguously from
launchtoremove.To avoid reintroducing DHCP instability, we simply give the stash a different MAC address from the main instance. For consistency, we avoid copying most
volatile.*options to and from the stash.Design
SDK snapshots and the stash are both implemented as "layers," i.e. LXD instances which live in the
workshop-layers.<USER>project. These instances are never started, they just store a clone of the workshop'srootfsand associated metadata like config options and devices.I tried to confine the changes to the LXD backend and not change the API too much. One thing I did change is that
SnapshotandRestorenow take an SDK name instead of asnapid.All SDK snapshots belong to a workshop or a stashed workshop. This means the behavioural changes are limited to:
Snapshotis likelxc copyrather thanlxc snapshot. LXD still creates a ZFS snapshot internally, but also creates a new instance based on the snapshot.StashWorkshopcopies the workshop and all of its snapshots, as LXD instances. This is cheaper than before because there's no need to assemble the copied snapshots into a single filesystem. It's likelxc copy --instance-onlyrather thanlxc copy.Restoreremoves the same snapshots as before (as doesLaunchOrRebuildWorkshop). It's likelxc copy --refreshrather thanlxc restore.UnstashWorkshopremoves snapshots created during the refresh and restores missing snapshots from the stash. It's likelxc copy --refreshrather thanlxc copy, thanks to the newrebuild-workshoptask.RemoveWorkshopandRemoveWorkshopStashattempt to remove all snapshots owned by the workshop or stash.I think the main downside of this approach is that most functions operate on several instances at a time. It's hard to recover if something goes wrong partway through. The global SDK cache should resolve most of these issues, but I think enabling it will require much bigger changes outside the LXD backend.
Naming
For the global SDK cache, layers will likely be named using unique hashes. We can't do that yet because the backend doesn't have enough information to compute an appropriate hash. Instead SDK layers are named like
<SDK>-<RANDOM STRING>. Stashed workshops and layers are prefixed bystash-.The random string is 16 characters long, to avoid confusing it with a project ID. The longest possible name
stash-<40 chars>-<16 chars>has length 63 exactly.The previous scheme
<WORKSHOP>.<SDK>can't be used as an instance name because it contains a., and LXD limits instance names to 63 characters. To keep track of ownership, the workshop name and project ID are stored in layers as extra config options. SDK layers also store the SDK name. All layers have a type, eithersdk,stashorstash-sdk. The latter will be removed when we switch to a global SDK cache.I renamed the
workshop-stash.<USER>project toworkshop-layers.<USER>. It's probably possible to store all layers in the mainworkshop.<USER>project, if there's a reason to consolidate them.Performance
I tested performance using the following workshop:
To make
project-ros2I extracted the ROS2 SDK, removed thesetup-projecthook and removed the snaps fromsetup-base. The other SDK is just:In between refreshes I incremented
echo 1toecho 2, etc., so the Workshop is always restored from aros2snapshot.Refresh times
Main branch:
This branch:
Looks like a small, but clear, improvement.
Space consumed
Main branch, pre-refresh:
Main branch, mid-refresh:
Main branch, post-refresh:
This branch, pre-refresh:
This branch, mid-refresh:
This branch, post-refresh:
In
main, the stash consumes about the same space as the workshop. In this branch that overhead is almost completely gone. But there's a downside, which is that deleted containers can accumulate over time. This example isn't bad, but only because we didn't really change the workshop after installingros2. It's quite easy to rack up a lot of wasted space.We plan on working with the LXD team to address the issue. In the meantime, this script will remove the unnecessary filesystems:
NOTE: it assumes all snapshots are named
copy-<UUID>. This is the case if they were created by Workshop. To make it more robust, it should probably rename the snapshots before callingzfs promote.Self-review quick check
Docs
Or: