Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace #6303

Merged
merged 4 commits into from
Apr 18, 2024

Conversation

maggie-lou
Copy link
Contributor

@maggie-lou maggie-lou commented Apr 4, 2024

Currently, when users hit "Re-run from a clean workspace" on a workflow invocation, it will invalidate all snapshots for that repo. This PR changes the behavior to only invalidate the snapshot related to that workflow run

Ex. If you have multiple workflows: checkstyle, tests. When you hit the "rerun from clean" button on the checkstyle workflow, it will only invalidate the checkstyle snapshot. The test snapshot will still be valid

This is implemented by storing a version ID in a SnapshotVersionMetadata entry in the remote cache. For the same workflow (consistent vm configuration, action name, and platform hash), you can invalidate previous snapshots by updating the version ID.

Related issues: N/A

I have future PRs to:

  • add a button to the workflow page to invalidate all snapshots for the repo (in case you still want the existing behavior)
  • add a button to the action details page to invalidate the snapshot for a RBE action that ran on a firecracker snapshot

@maggie-lou maggie-lou force-pushed the invalidate_button branch 17 times, most recently from f43183d to 86ccdc0 Compare April 5, 2024 17:29
@maggie-lou maggie-lou changed the title WIP snapshot invalidation Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace Apr 5, 2024
@maggie-lou maggie-lou force-pushed the invalidate_button branch 3 times, most recently from 67c62ee to c3deae1 Compare April 5, 2024 21:34
@maggie-lou maggie-lou marked this pull request as ready for review April 5, 2024 21:38
@maggie-lou maggie-lou force-pushed the invalidate_button branch 2 times, most recently from 487cfcc to 0ec6641 Compare April 7, 2024 19:51
Copy link
Member

@bduffany bduffany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple things about the current approach in this PR that involve some tradeoffs:

  1. InvalidateSnapshot depends on the ExecuteResponse existing in cache in order to know which SnapshotKey to invalidate. If the ExecuteResponse has expired from cache, then the user will see an error when they click the "Invalidate snapshots" button. This seems like it would be frustrating, although it seems likely that in most cases the user would be clicking this on a workflow that executed recently and therefore hasn't expired. Roughly speaking, it feels error-prone that invalidating something in the cache depends on something else already existing in the cache, which has a different lifecycle and can expire sooner.
  2. The snapshot version is stored in AC, and adds a dependency on an AC lookup to know which snapshot to use. If this AC entry happens to expire before the snapshot manifest expires, then it will be as though the snapshot was not invalidated (since we default to "" for the version key). Roughly speaking, it feels a bit error-prone that we are invalidating a cache entry by adding a different entry to the cache which has a different lifecycle and may expire sooner.
  3. There is additional complexity being added, both in terms of the code being added and the amount of data dependencies being incorporated into the snapshot key, which is already somewhat complex.

What the PR is buying is the ability to invalidate snapshots at the workflow action level (e.g. "Test" vs "Checkstyle") rather than at the repo level. However, I don't think we have gotten any feedback about this yet, and I suspect that most people just use workflows for CI and are running just a single workflow action (though it would help to have the data for this).

My overall analysis of the cost/benefit is that it seems more beneficial to stick with the current approach of bumping the remote instance name since it is simpler and more robust, and probably good enough. wdyt?

@maggie-lou
Copy link
Contributor Author

There are a couple things about the current approach in this PR that involve some tradeoffs:

Yeah I definitely see where you're coming from. My main motivation for wanting this series of changes was so that we can invalidate a snapshot for an RBE action (non workflow). Currently the only way to do that is to ask our customers to reset a cache-busting platform property in their BUILD file. That's a clunky customer experience, and makes it harder for us to debug snapshot issues. This seems like a more valuable feature to have, but let me know what you think

I understand your concerns around the version metadata expiring earlier than the snapshot manifest in theory. In practice I'd expect that to never happen, because the version metadata key should be read more frequently than the manifests, because it doesn't include the git branch. So every time we run a workflow no matter the branch, the access timestamp should get refreshed

(I looked into a couple of our customers' workflows, and I could see a common pattern being having one CI workflow and one linting workflow. But I agree with your analysis that invalidating both probably isn't a huge deal right now)

@bduffany
Copy link
Member

bduffany commented Apr 8, 2024

My main motivation for wanting this series of changes was so that we can invalidate a snapshot for an RBE action (non workflow).

what would the changes look like for making this work for regular actions? I guess maybe adding the executor_host_id to the snapshot key (since regular actions are shareable only within an executor) and storing the snapshot version ActionResult remotely instead of locally?

I think we'd still have the same issue w/ the remote snapshot version having a different lifecycle than the local snapshot version, which could be confusing. e.g. you might invalidate a snapshot thinking that you're making it inaccessible so it will expire from cache, but this is technically not what's happening. Just wondering if there's a simple way that we can avoid this problem

@maggie-lou
Copy link
Contributor Author

what would the changes look like for making this work for regular actions?

I was just thinking of going the simplest approach of adding almost the exact same button as in this PR to the action details page and continuing to store the snapshot version remotely. It would invalidate snapshots on all executors.

Adding the executor_host_id to the snapshot key would add some complexity because we'd need to route invalidation requests to the correct executor. Our routing should try to route similar requests to the same executor anyway, so it should be okay to invalidate snapshots across multiple hosts.

In this implementation, version ID starts as the empty string. We could change it to originally set a value. That way if the version metadata in the cache expires, we would essentially generate a new version ID and invalidate all existing snapshots. This would lead to more snapshot invalidations (false positives), but would prevent potentially falling back to undesirable snapshots because the snapshot version metadata has disappeared

@bduffany
Copy link
Member

bduffany commented Apr 8, 2024

I was just thinking of going the simplest approach of adding almost the exact same button as in this PR to the action details page and continuing to store the snapshot version remotely.

Ah ok I missed that this PR is already storing the version remotely too even for local sharing.

Our routing should try to route similar requests to the same executor anyway, so it should be okay to invalidate snapshots across multiple hosts.

ok, this seems reasonable 👍

We could change it to originally set a value. That way if the version metadata in the cache expires, we would essentially generate a new version ID and invalidate all existing snapshots.

I think this would be a good idea even though there is the possibility of false positives - if actions are executed frequently enough then we would expect the version to stay fresh in cache - do you want to do that in this PR?

@maggie-lou
Copy link
Contributor Author

I think this would be a good idea even though there is the possibility of false positives - if actions are executed frequently enough then we would expect the version to stay fresh in cache - do you want to do that in this PR?

Yeah I'll add that to this PR.

@maggie-lou maggie-lou requested a review from bduffany April 8, 2024 22:42
@maggie-lou
Copy link
Contributor Author

Yeah I'll add that to this PR.

Added


// Task and configuration hash stay consistent, so that the only thing
// changing in the snapshot key is the version
task := &repb.ExecutionTask{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if remoteEnabled is true should we also make this a CI runner task so that we exercise the remote sharing codepath? or set --debug_force_remote_snapshots

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ci_runner check (and debug_force_remote_snapshots) are applied at the firecracker level, in order to generate the remoteEnabled boolean to pass to snaploader code. The snaploader tests shoudn't need to worry about that because we can just directly pass the boolean

proto/auditlog.proto Outdated Show resolved Hide resolved
Comment on lines 2320 to 2341
message SnapshotKey {
// Remote instance name associated with the snapshot.
string instance_name = 1;

// SHA256 hash of the Platform proto (exec properties) associated with the VM
// snapshot.
string platform_hash = 2;

// SHA256 Hash of the VMConfiguration of the paused snapshot.
string configuration_hash = 3;

// Git ref associated with the snapshot. For workflows, this represents the
// branch that was checked out when running the workflow.
string ref = 5;

// If set, this key corresponds to a specific snapshot run.
// If not set, this key should fetch the newest snapshot matching the other
// parameters.
string snapshot_id = 6;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this proto have the new version field?

relatedly, it seems like we could easily forget to update this proto if we update the one in firecracker.proto - wdyt about removing the one in firecracker.proto and having this be the canonical one? (Also maybe un-nest it from VMMetadata since it makes the generated code slightly less readable IMO)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this change? (WIP - linking for conceptual feedback) https://github.com/buildbuddy-io/buildbuddy/pull/6341/files

I agree with your rec to consolidate the protos

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, that seems like a good use of auxiliary_metadata

@maggie-lou maggie-lou force-pushed the invalidate_button branch 5 times, most recently from fdd15f7 to a8dd9cb Compare April 12, 2024 17:31
@maggie-lou
Copy link
Contributor Author

This PR depends on VMMetadata being set in a different place in the execute response, so going to wait to deploy this until a week after #6341

@maggie-lou maggie-lou merged commit 2fb25d5 into master Apr 18, 2024
17 of 19 checks passed
@maggie-lou maggie-lou deleted the invalidate_button branch April 18, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants