-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
volume snapshot delete and restore support #10267
Conversation
/werft run 👍 started the job as gitpod-build-pavel-10259-1.14 |
9be3d79
to
753cbc6
Compare
Starting to review now |
/werft run with-clean-slate-deployment=true 👍 started the job as gitpod-build-pavel-10259-1.23 |
6157db8
to
d1c06db
Compare
/hold |
let index = 0; | ||
|
||
let availableClusters = allClusters.filter((c) => c.state === "available"); | ||
for (let cluster of availableClusters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expected this method to connect to a specific ws-manager based on a "region" field, either retrieved from VolumeSnapshot
, Workspace
or WorkspaceInstance
. Iterating over all currently connected clients makes it impossible to detect cases where we are no longer connected to a certain ws-manager.
Or is this due to the fact that we might have multiple ws-manager (with different names) in the same GCloud region, and we do not know how to map those? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite. Imagine following scenario:
EU cluster where WS was started from Volume Snapshot X
US cluster where WS was also started from same Volume Snapshot X (from prebuild for example)
Then we want to GC that volume snapshot.
We need to do two things now:
- Ensure that we delete k8s object representing this snapshot from ALL clusters (to make sure we do not accumulate those over time, as it will affect k8s api server)
- Ensure we delete volume snapshot object from cloud provider
So this code ensures that. That is why it attempts to talk to all existing clusters and clean it up from all of them.
If we are no longer connected to some cluster, I think that is fine, as I would assume that cluster is getting decommissioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do two things now:
Ah, that explains it. Just to ensure we're on the same page: ws-manager
does not remove the PVC object from the control plane when the corresponding workspace is stopped?
Ensure we delete volume snapshot object from cloud provider
I thought we just need to do this, and also wondered if this was something we put into content-service...? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to "block" the merging of the PR here, but I feel I'm still missing some context to understand why the PVC lifecycle looks the way it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to ensure we're on the same page:
ws-manager
does not remove the PVC object from the control plane when the corresponding workspace is stopped?
No, when the workspace stopped, the PVC object would be removed after the VolumeSnapshot object is ready to use. (Which means the PVC snapshot is done).
Ensure we delete volume snapshot object from cloud provider
I thought we just need to do this, and also wondered if this was something we put into content-service...? 🤔
The lifecycle of PersistentVolumeClaim/VolumeSnapshot is managed by ws-manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to ensure we're on the same page: ws-manager does not remove the PVC object from the control plane when the corresponding workspace is stopped?
it does. There are two objects in play here. PVC is an actual storage object, it is only alive while workspace is running. When workspace is stopped, wsman will "convert" PVC into VolumeSnapshot object that is stored in cloud provider (outside of k8s control).
When we open workspace, PVC is created and restored from VolumeSnapshot object.
When it is time to delete\GC volume snapshot, we might have same VolumeSnapshot across multiple clusters (prebuild that was opened in different clusters for example). So we want to clean up that volume snapshot object from all clusters (as otherwise we will be accumulating a lot of them over the lifecycle of the cluster), and then also remove actual snapshot from cloud provider as well.
All this work is done by wsman currently.
It will be better to move that into content service indeed, but currently it is not possible, because content service still lives in webapp cluster, and should be migrated into workspace clusters first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @sagor999 for syncing! 👍
I understood that:
- workspace takes care of deleting the actual disk snapshots
- we keep the cluster-local VolumeSnapshots and VolumeSnapshotContent in the cluster to optimize for fast workspace starts
- but to avoid cluttering the clusters too much, we aim to remove those here whenever their disk snapshot get deleted
It would be nice if we added some comments along those lines to the GC. 🙏
@sagor999 Regarding merging of this PR: Once the initial review is through, it'd would be awesome to decouple this PR into smaller chunks. For instance:
|
components/gitpod-db/src/typeorm/migration/1654628106102-VolumeSnapshotAddWSId.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🛹 🚀
Hey Pavel, I re-run this PR to test; I found that after the workspace stopped, I clicked |
@jenting it will be deleted when GC runs and deletes workspace for real. |
/werft run with-clean-slate-deployment=true 👍 started the job as gitpod-build-pavel-10259-1.40 |
req.setId(vs.id); | ||
req.setVolumeHandle(vs.volumeHandle); | ||
|
||
let softDelete = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Already mentioned it, but just to highlight: Especially this part feels a tad brittle, and it would be nice to find a better way to handle/encapsulate that logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
/unhold |
Description
This PR adds two features:
Fixes #10259
How to test
Loom video showing this in action:
https://www.loom.com/share/51b41802054d4de89902aaa358171624
Release Notes
Documentation