Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace #6303

maggie-lou · 2024-04-04T19:14:42Z

Currently, when users hit "Re-run from a clean workspace" on a workflow invocation, it will invalidate all snapshots for that repo. This PR changes the behavior to only invalidate the snapshot related to that workflow run

Ex. If you have multiple workflows: checkstyle, tests. When you hit the "rerun from clean" button on the checkstyle workflow, it will only invalidate the checkstyle snapshot. The test snapshot will still be valid

This is implemented by storing a version ID in a SnapshotVersionMetadata entry in the remote cache. For the same workflow (consistent vm configuration, action name, and platform hash), you can invalidate previous snapshots by updating the version ID.

Related issues: N/A

I have future PRs to:

add a button to the workflow page to invalidate all snapshots for the repo (in case you still want the existing behavior)
add a button to the action details page to invalidate the snapshot for a RBE action that ran on a firecracker snapshot

bduffany

There are a couple things about the current approach in this PR that involve some tradeoffs:

InvalidateSnapshot depends on the ExecuteResponse existing in cache in order to know which SnapshotKey to invalidate. If the ExecuteResponse has expired from cache, then the user will see an error when they click the "Invalidate snapshots" button. This seems like it would be frustrating, although it seems likely that in most cases the user would be clicking this on a workflow that executed recently and therefore hasn't expired. Roughly speaking, it feels error-prone that invalidating something in the cache depends on something else already existing in the cache, which has a different lifecycle and can expire sooner.
The snapshot version is stored in AC, and adds a dependency on an AC lookup to know which snapshot to use. If this AC entry happens to expire before the snapshot manifest expires, then it will be as though the snapshot was not invalidated (since we default to "" for the version key). Roughly speaking, it feels a bit error-prone that we are invalidating a cache entry by adding a different entry to the cache which has a different lifecycle and may expire sooner.
There is additional complexity being added, both in terms of the code being added and the amount of data dependencies being incorporated into the snapshot key, which is already somewhat complex.

What the PR is buying is the ability to invalidate snapshots at the workflow action level (e.g. "Test" vs "Checkstyle") rather than at the repo level. However, I don't think we have gotten any feedback about this yet, and I suspect that most people just use workflows for CI and are running just a single workflow action (though it would help to have the data for this).

My overall analysis of the cost/benefit is that it seems more beneficial to stick with the current approach of bumping the remote instance name since it is simpler and more robust, and probably good enough. wdyt?

maggie-lou · 2024-04-08T15:56:44Z

There are a couple things about the current approach in this PR that involve some tradeoffs:

Yeah I definitely see where you're coming from. My main motivation for wanting this series of changes was so that we can invalidate a snapshot for an RBE action (non workflow). Currently the only way to do that is to ask our customers to reset a cache-busting platform property in their BUILD file. That's a clunky customer experience, and makes it harder for us to debug snapshot issues. This seems like a more valuable feature to have, but let me know what you think

I understand your concerns around the version metadata expiring earlier than the snapshot manifest in theory. In practice I'd expect that to never happen, because the version metadata key should be read more frequently than the manifests, because it doesn't include the git branch. So every time we run a workflow no matter the branch, the access timestamp should get refreshed

(I looked into a couple of our customers' workflows, and I could see a common pattern being having one CI workflow and one linting workflow. But I agree with your analysis that invalidating both probably isn't a huge deal right now)

bduffany · 2024-04-08T17:15:31Z

My main motivation for wanting this series of changes was so that we can invalidate a snapshot for an RBE action (non workflow).

what would the changes look like for making this work for regular actions? I guess maybe adding the executor_host_id to the snapshot key (since regular actions are shareable only within an executor) and storing the snapshot version ActionResult remotely instead of locally?

I think we'd still have the same issue w/ the remote snapshot version having a different lifecycle than the local snapshot version, which could be confusing. e.g. you might invalidate a snapshot thinking that you're making it inaccessible so it will expire from cache, but this is technically not what's happening. Just wondering if there's a simple way that we can avoid this problem

maggie-lou · 2024-04-08T18:03:59Z

what would the changes look like for making this work for regular actions?

I was just thinking of going the simplest approach of adding almost the exact same button as in this PR to the action details page and continuing to store the snapshot version remotely. It would invalidate snapshots on all executors.

Adding the executor_host_id to the snapshot key would add some complexity because we'd need to route invalidation requests to the correct executor. Our routing should try to route similar requests to the same executor anyway, so it should be okay to invalidate snapshots across multiple hosts.

In this implementation, version ID starts as the empty string. We could change it to originally set a value. That way if the version metadata in the cache expires, we would essentially generate a new version ID and invalidate all existing snapshots. This would lead to more snapshot invalidations (false positives), but would prevent potentially falling back to undesirable snapshots because the snapshot version metadata has disappeared

bduffany · 2024-04-08T18:32:15Z

I was just thinking of going the simplest approach of adding almost the exact same button as in this PR to the action details page and continuing to store the snapshot version remotely.

Ah ok I missed that this PR is already storing the version remotely too even for local sharing.

Our routing should try to route similar requests to the same executor anyway, so it should be okay to invalidate snapshots across multiple hosts.

ok, this seems reasonable 👍

We could change it to originally set a value. That way if the version metadata in the cache expires, we would essentially generate a new version ID and invalidate all existing snapshots.

I think this would be a good idea even though there is the possibility of false positives - if actions are executed frequently enough then we would expect the version to stay fresh in cache - do you want to do that in this PR?

maggie-lou · 2024-04-08T18:40:00Z

I think this would be a good idea even though there is the possibility of false positives - if actions are executed frequently enough then we would expect the version to stay fresh in cache - do you want to do that in this PR?

Yeah I'll add that to this PR.

maggie-lou · 2024-04-08T22:42:57Z

Yeah I'll add that to this PR.

Added

bduffany · 2024-04-09T17:10:07Z

enterprise/server/remote_execution/snaploader/snaploader_test.go

+
+		// Task and configuration hash stay consistent, so that the only thing
+		// changing in the snapshot key is the version
+		task := &repb.ExecutionTask{}


if remoteEnabled is true should we also make this a CI runner task so that we exercise the remote sharing codepath? or set --debug_force_remote_snapshots

The ci_runner check (and debug_force_remote_snapshots) are applied at the firecracker level, in order to generate the remoteEnabled boolean to pass to snaploader code. The snaploader tests shoudn't need to worry about that because we can just directly pass the boolean

proto/auditlog.proto

bduffany · 2024-04-09T17:15:33Z

proto/remote_execution.proto

+  message SnapshotKey {
+    // Remote instance name associated with the snapshot.
+    string instance_name = 1;
+
+    // SHA256 hash of the Platform proto (exec properties) associated with the VM
+    // snapshot.
+    string platform_hash = 2;
+
+    // SHA256 Hash of the VMConfiguration of the paused snapshot.
+    string configuration_hash = 3;
+
+    // Git ref associated with the snapshot. For workflows, this represents the
+    // branch that was checked out when running the workflow.
+    string ref = 5;
+
+    // If set, this key corresponds to a specific snapshot run.
+    // If not set, this key should fetch the newest snapshot matching the other
+    // parameters.
+    string snapshot_id = 6;
+  }


should this proto have the new version field?

relatedly, it seems like we could easily forget to update this proto if we update the one in firecracker.proto - wdyt about removing the one in firecracker.proto and having this be the canonical one? (Also maybe un-nest it from VMMetadata since it makes the generated code slightly less readable IMO)

What do you think about this change? (WIP - linking for conceptual feedback) https://github.com/buildbuddy-io/buildbuddy/pull/6341/files

I agree with your rec to consolidate the protos

I like it, that seems like a good use of auxiliary_metadata

maggie-lou · 2024-04-12T17:53:12Z

This PR depends on VMMetadata being set in a different place in the execute response, so going to wait to deploy this until a week after #6341

maggie-lou force-pushed the invalidate_button branch 17 times, most recently from f43183d to 86ccdc0 Compare April 5, 2024 17:29

maggie-lou changed the title ~~WIP snapshot invalidation~~ Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace Apr 5, 2024

maggie-lou force-pushed the invalidate_button branch 3 times, most recently from 67c62ee to c3deae1 Compare April 5, 2024 21:34

maggie-lou marked this pull request as ready for review April 5, 2024 21:38

maggie-lou mentioned this pull request Apr 5, 2024

[SS] Differentiate buttons to invalidate all snapshots and to execute a single run on a clean runner #6258

Closed

maggie-lou requested review from bduffany and tylerwilliams April 5, 2024 21:43

maggie-lou force-pushed the invalidate_button branch 2 times, most recently from 487cfcc to 0ec6641 Compare April 7, 2024 19:51

maggie-lou mentioned this pull request Apr 7, 2024

Add a button to invalidate all workflows for a repo #6308

Merged

bduffany reviewed Apr 8, 2024

View reviewed changes

maggie-lou requested a review from bduffany April 8, 2024 22:42

maggie-lou force-pushed the invalidate_button branch from 2e23944 to c8c731d Compare April 9, 2024 15:47

bduffany approved these changes Apr 9, 2024

View reviewed changes

maggie-lou mentioned this pull request Apr 10, 2024

Move VMMetadata proto to a more fitting home #6341

Merged

maggie-lou added 3 commits April 11, 2024 14:10

Support invalidating a single snapshot

b9b4f18

Add precautions if the version metadata expires from the cache

8cbcfbb

PR feedback

f0632dd

maggie-lou force-pushed the invalidate_button branch 5 times, most recently from fdd15f7 to a8dd9cb Compare April 12, 2024 17:31

Rebase after #6341

c8d0eaa

maggie-lou force-pushed the invalidate_button branch from a8dd9cb to c8d0eaa Compare April 12, 2024 17:47

maggie-lou merged commit 2fb25d5 into master Apr 18, 2024
17 of 19 checks passed

maggie-lou deleted the invalidate_button branch April 18, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace #6303

Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace #6303

maggie-lou commented Apr 4, 2024 •

edited

bduffany left a comment •

edited

maggie-lou commented Apr 8, 2024

bduffany commented Apr 8, 2024

maggie-lou commented Apr 8, 2024

bduffany commented Apr 8, 2024

maggie-lou commented Apr 8, 2024

maggie-lou commented Apr 8, 2024

bduffany Apr 9, 2024

maggie-lou Apr 10, 2024

bduffany Apr 9, 2024

maggie-lou Apr 10, 2024 •

edited

bduffany Apr 10, 2024

maggie-lou commented Apr 12, 2024

Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace #6303

Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace #6303

Conversation

maggie-lou commented Apr 4, 2024 • edited

bduffany left a comment • edited

Choose a reason for hiding this comment

maggie-lou commented Apr 8, 2024

bduffany commented Apr 8, 2024

maggie-lou commented Apr 8, 2024

bduffany commented Apr 8, 2024

maggie-lou commented Apr 8, 2024

maggie-lou commented Apr 8, 2024

bduffany Apr 9, 2024

Choose a reason for hiding this comment

maggie-lou Apr 10, 2024

Choose a reason for hiding this comment

bduffany Apr 9, 2024

Choose a reason for hiding this comment

maggie-lou Apr 10, 2024 • edited

Choose a reason for hiding this comment

bduffany Apr 10, 2024

Choose a reason for hiding this comment

maggie-lou commented Apr 12, 2024

maggie-lou commented Apr 4, 2024 •

edited

bduffany left a comment •

edited

maggie-lou Apr 10, 2024 •

edited