-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only invalidate the specific workflow snapshot when re-running a workflow from a clean workspace #6303
Conversation
f43183d
to
86ccdc0
Compare
67c62ee
to
c3deae1
Compare
487cfcc
to
0ec6641
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple things about the current approach in this PR that involve some tradeoffs:
- InvalidateSnapshot depends on the ExecuteResponse existing in cache in order to know which SnapshotKey to invalidate. If the ExecuteResponse has expired from cache, then the user will see an error when they click the "Invalidate snapshots" button. This seems like it would be frustrating, although it seems likely that in most cases the user would be clicking this on a workflow that executed recently and therefore hasn't expired. Roughly speaking, it feels error-prone that invalidating something in the cache depends on something else already existing in the cache, which has a different lifecycle and can expire sooner.
- The snapshot version is stored in AC, and adds a dependency on an AC lookup to know which snapshot to use. If this AC entry happens to expire before the snapshot manifest expires, then it will be as though the snapshot was not invalidated (since we default to "" for the version key). Roughly speaking, it feels a bit error-prone that we are invalidating a cache entry by adding a different entry to the cache which has a different lifecycle and may expire sooner.
- There is additional complexity being added, both in terms of the code being added and the amount of data dependencies being incorporated into the snapshot key, which is already somewhat complex.
What the PR is buying is the ability to invalidate snapshots at the workflow action level (e.g. "Test" vs "Checkstyle") rather than at the repo level. However, I don't think we have gotten any feedback about this yet, and I suspect that most people just use workflows for CI and are running just a single workflow action (though it would help to have the data for this).
My overall analysis of the cost/benefit is that it seems more beneficial to stick with the current approach of bumping the remote instance name since it is simpler and more robust, and probably good enough. wdyt?
Yeah I definitely see where you're coming from. My main motivation for wanting this series of changes was so that we can invalidate a snapshot for an RBE action (non workflow). Currently the only way to do that is to ask our customers to reset a cache-busting platform property in their BUILD file. That's a clunky customer experience, and makes it harder for us to debug snapshot issues. This seems like a more valuable feature to have, but let me know what you think I understand your concerns around the version metadata expiring earlier than the snapshot manifest in theory. In practice I'd expect that to never happen, because the version metadata key should be read more frequently than the manifests, because it doesn't include the git branch. So every time we run a workflow no matter the branch, the access timestamp should get refreshed (I looked into a couple of our customers' workflows, and I could see a common pattern being having one CI workflow and one linting workflow. But I agree with your analysis that invalidating both probably isn't a huge deal right now) |
what would the changes look like for making this work for regular actions? I guess maybe adding the executor_host_id to the snapshot key (since regular actions are shareable only within an executor) and storing the snapshot version ActionResult remotely instead of locally? I think we'd still have the same issue w/ the remote snapshot version having a different lifecycle than the local snapshot version, which could be confusing. e.g. you might invalidate a snapshot thinking that you're making it inaccessible so it will expire from cache, but this is technically not what's happening. Just wondering if there's a simple way that we can avoid this problem |
I was just thinking of going the simplest approach of adding almost the exact same button as in this PR to the action details page and continuing to store the snapshot version remotely. It would invalidate snapshots on all executors. Adding the executor_host_id to the snapshot key would add some complexity because we'd need to route invalidation requests to the correct executor. Our routing should try to route similar requests to the same executor anyway, so it should be okay to invalidate snapshots across multiple hosts. In this implementation, version ID starts as the empty string. We could change it to originally set a value. That way if the version metadata in the cache expires, we would essentially generate a new version ID and invalidate all existing snapshots. This would lead to more snapshot invalidations (false positives), but would prevent potentially falling back to undesirable snapshots because the snapshot version metadata has disappeared |
Ah ok I missed that this PR is already storing the version remotely too even for local sharing.
ok, this seems reasonable 👍
I think this would be a good idea even though there is the possibility of false positives - if actions are executed frequently enough then we would expect the version to stay fresh in cache - do you want to do that in this PR? |
Yeah I'll add that to this PR. |
Added |
2e23944
to
c8c731d
Compare
|
||
// Task and configuration hash stay consistent, so that the only thing | ||
// changing in the snapshot key is the version | ||
task := &repb.ExecutionTask{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if remoteEnabled
is true should we also make this a CI runner task so that we exercise the remote sharing codepath? or set --debug_force_remote_snapshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ci_runner check (and debug_force_remote_snapshots
) are applied at the firecracker level, in order to generate the remoteEnabled
boolean to pass to snaploader code. The snaploader tests shoudn't need to worry about that because we can just directly pass the boolean
proto/remote_execution.proto
Outdated
message SnapshotKey { | ||
// Remote instance name associated with the snapshot. | ||
string instance_name = 1; | ||
|
||
// SHA256 hash of the Platform proto (exec properties) associated with the VM | ||
// snapshot. | ||
string platform_hash = 2; | ||
|
||
// SHA256 Hash of the VMConfiguration of the paused snapshot. | ||
string configuration_hash = 3; | ||
|
||
// Git ref associated with the snapshot. For workflows, this represents the | ||
// branch that was checked out when running the workflow. | ||
string ref = 5; | ||
|
||
// If set, this key corresponds to a specific snapshot run. | ||
// If not set, this key should fetch the newest snapshot matching the other | ||
// parameters. | ||
string snapshot_id = 6; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this proto have the new version field?
relatedly, it seems like we could easily forget to update this proto if we update the one in firecracker.proto - wdyt about removing the one in firecracker.proto and having this be the canonical one? (Also maybe un-nest it from VMMetadata
since it makes the generated code slightly less readable IMO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this change? (WIP - linking for conceptual feedback) https://github.com/buildbuddy-io/buildbuddy/pull/6341/files
I agree with your rec to consolidate the protos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, that seems like a good use of auxiliary_metadata
fdd15f7
to
a8dd9cb
Compare
a8dd9cb
to
c8d0eaa
Compare
This PR depends on VMMetadata being set in a different place in the execute response, so going to wait to deploy this until a week after #6341 |
Currently, when users hit "Re-run from a clean workspace" on a workflow invocation, it will invalidate all snapshots for that repo. This PR changes the behavior to only invalidate the snapshot related to that workflow run
Ex. If you have multiple workflows: checkstyle, tests. When you hit the "rerun from clean" button on the checkstyle workflow, it will only invalidate the checkstyle snapshot. The test snapshot will still be valid
This is implemented by storing a version ID in a SnapshotVersionMetadata entry in the remote cache. For the same workflow (consistent vm configuration, action name, and platform hash), you can invalidate previous snapshots by updating the version ID.
Related issues: N/A
I have future PRs to: