volume snapshot delete and restore support #10267

sagor999 · 2022-05-25T22:47:19Z

Description

This PR adds two features:

New workspace volume snapshot is created. We need to delete old volume snapshot for that workspace now.
Open workspace in the cluster where volume snapshot doesn't exist yet. We need to restore volume snapshot.
Workspace is deleted: it will garbage collect all volume snapshots associated with that workspace.

How to test

Loom video showing this in action:
https://www.loom.com/share/51b41802054d4de89902aaa358171624

Release Notes

none

Documentation

sagor999 · 2022-05-26T23:12:23Z

/werft run

👍 started the job as gitpod-build-pavel-10259-1.14
(with .werft/ from main)

components/ws-manager-api/core.proto

components/ws-manager/pkg/manager/manager.go

geropl · 2022-05-30T15:31:17Z

Starting to review now

components/ws-manager-bridge/src/bridge.ts

sagor999 · 2022-06-07T23:48:56Z

/werft run with-clean-slate-deployment=true

👍 started the job as gitpod-build-pavel-10259-1.23
(with .werft/ from main)

sagor999 · 2022-06-08T19:15:10Z

/hold
to ensure I get review from webapp team.

components/server/src/workspace/garbage-collector.ts

components/server/src/workspace/workspace-deletion-service.ts

geropl · 2022-06-09T07:30:21Z

components/server/src/workspace/workspace-deletion-service.ts

+            let index = 0;
+
+            let availableClusters = allClusters.filter((c) => c.state === "available");
+            for (let cluster of availableClusters) {


I expected this method to connect to a specific ws-manager based on a "region" field, either retrieved from VolumeSnapshot, Workspace or WorkspaceInstance. Iterating over all currently connected clients makes it impossible to detect cases where we are no longer connected to a certain ws-manager.

Or is this due to the fact that we might have multiple ws-manager (with different names) in the same GCloud region, and we do not know how to map those? 🤔

Not quite. Imagine following scenario:
EU cluster where WS was started from Volume Snapshot X
US cluster where WS was also started from same Volume Snapshot X (from prebuild for example)

Then we want to GC that volume snapshot.
We need to do two things now:

Ensure that we delete k8s object representing this snapshot from ALL clusters (to make sure we do not accumulate those over time, as it will affect k8s api server)

Ensure we delete volume snapshot object from cloud provider

So this code ensures that. That is why it attempts to talk to all existing clusters and clean it up from all of them.
If we are no longer connected to some cluster, I think that is fine, as I would assume that cluster is getting decommissioned.

We need to do two things now:

Ah, that explains it. Just to ensure we're on the same page: ws-manager does not remove the PVC object from the control plane when the corresponding workspace is stopped?

Ensure we delete volume snapshot object from cloud provider

I thought we just need to do this, and also wondered if this was something we put into content-service...? 🤔

Sorry to "block" the merging of the PR here, but I feel I'm still missing some context to understand why the PVC lifecycle looks the way it does.

Just to ensure we're on the same page: ws-manager does not remove the PVC object from the control plane when the corresponding workspace is stopped?

No, when the workspace stopped, the PVC object would be removed after the VolumeSnapshot object is ready to use. (Which means the PVC snapshot is done).

Ensure we delete volume snapshot object from cloud provider

I thought we just need to do this, and also wondered if this was something we put into content-service...? 🤔

The lifecycle of PersistentVolumeClaim/VolumeSnapshot is managed by ws-manager.

@geropl

Just to ensure we're on the same page: ws-manager does not remove the PVC object from the control plane when the corresponding workspace is stopped?

it does. There are two objects in play here. PVC is an actual storage object, it is only alive while workspace is running. When workspace is stopped, wsman will "convert" PVC into VolumeSnapshot object that is stored in cloud provider (outside of k8s control).

When we open workspace, PVC is created and restored from VolumeSnapshot object.

When it is time to delete\GC volume snapshot, we might have same VolumeSnapshot across multiple clusters (prebuild that was opened in different clusters for example). So we want to clean up that volume snapshot object from all clusters (as otherwise we will be accumulating a lot of them over the lifecycle of the cluster), and then also remove actual snapshot from cloud provider as well.

All this work is done by wsman currently.

It will be better to move that into content service indeed, but currently it is not possible, because content service still lives in webapp cluster, and should be migrated into workspace clusters first.

Thx @sagor999 for syncing! 👍
I understood that:

workspace takes care of deleting the actual disk snapshots

we keep the cluster-local VolumeSnapshots and VolumeSnapshotContent in the cluster to optimize for fast workspace starts

but to avoid cluttering the clusters too much, we aim to remove those here whenever their disk snapshot get deleted

It would be nice if we added some comments along those lines to the GC. 🙏

geropl · 2022-06-09T07:33:33Z

@sagor999 Regarding merging of this PR: Once the initial review is through, it'd would be awesome to decouple this PR into smaller chunks. For instance:

DB changes
ws-manager changes
server+bridge changes

components/gitpod-db/src/typeorm/migration/1654628106102-VolumeSnapshotAddWSId.ts

components/gitpod-db/src/typeorm/workspace-db-impl.ts

components/ws-manager/pkg/manager/manager.go

jenting

LGTM 🛹 🚀

jenting · 2022-06-14T09:28:05Z

Hey Pavel, I re-run this PR to test; I found that after the workspace stopped, I clicked Delete Workspace on the GUI. However, the VolumeSnapshot is still there.
But I suspect the VolumeSnapshot should be deleted by the ws-manager. Correct?
Besides that, I see the loom video doesn't demo this part.

sagor999 · 2022-06-14T14:44:34Z

@jenting it will be deleted when GC runs and deletes workspace for real.
I simulated that part manually by updating workspace delete timestamp to be one week old, then GC ran and cleaned everything up.
But thank you for double checking! ❤️

sagor999 · 2022-06-14T19:00:51Z

/werft run with-clean-slate-deployment=true

👍 started the job as gitpod-build-pavel-10259-1.40
(with .werft/ from main)

geropl · 2022-06-14T19:32:28Z

components/server/src/workspace/workspace-deletion-service.ts

+                req.setId(vs.id);
+                req.setVolumeHandle(vs.volumeHandle);
+
+                let softDelete = true;


nit: Already mentioned it, but just to highlight: Especially this part feels a tad brittle, and it would be nice to find a better way to handle/encapsulate that logic.

geropl

LGTM 👍

sagor999 · 2022-06-14T21:07:04Z

/unhold

roboquat added do-not-merge/work-in-progress do-not-merge/release-note-label-needed size/XXL labels May 25, 2022

sagor999 force-pushed the pavel/10259-1 branch from 07aea5c to 0775a49 Compare May 27, 2022 01:03

sagor999 changed the title ~~wip~~ volume snapshot delete and restore support May 27, 2022

roboquat added release-note and removed do-not-merge/release-note-label-needed labels May 27, 2022

sagor999 marked this pull request as ready for review May 27, 2022 01:08

sagor999 requested review from a team May 27, 2022 01:08

sagor999 requested a review from aledbf as a code owner May 27, 2022 01:08

roboquat removed the do-not-merge/work-in-progress label May 27, 2022

github-actions bot added team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team labels May 27, 2022

jenting reviewed May 27, 2022

View reviewed changes

sagor999 force-pushed the pavel/10259-1 branch 2 times, most recently from 9be3d79 to 753cbc6 Compare May 28, 2022 18:29

geropl self-assigned this May 30, 2022

geropl reviewed May 30, 2022

View reviewed changes

components/ws-manager-bridge/src/bridge.ts Outdated Show resolved Hide resolved

sagor999 marked this pull request as draft June 1, 2022 19:45

roboquat added the do-not-merge/work-in-progress label Jun 1, 2022

sagor999 force-pushed the pavel/10259-1 branch 3 times, most recently from 6157db8 to d1c06db Compare June 8, 2022 18:51

roboquat added the release-note-none label Jun 8, 2022

sagor999 requested a review from geropl June 8, 2022 19:14

roboquat added the do-not-merge/hold label Jun 8, 2022

geropl reviewed Jun 9, 2022

View reviewed changes

components/server/src/workspace/garbage-collector.ts Show resolved Hide resolved

geropl reviewed Jun 9, 2022

View reviewed changes

components/server/src/workspace/workspace-deletion-service.ts Outdated Show resolved Hide resolved

geropl reviewed Jun 9, 2022

View reviewed changes

components/gitpod-db/src/typeorm/migration/1654628106102-VolumeSnapshotAddWSId.ts Outdated Show resolved Hide resolved

geropl reviewed Jun 9, 2022

View reviewed changes

components/gitpod-db/src/typeorm/workspace-db-impl.ts Show resolved Hide resolved

jenting reviewed Jun 9, 2022

View reviewed changes

components/ws-manager/pkg/manager/manager.go Show resolved Hide resolved

sagor999 force-pushed the pavel/10259-1 branch from 641f788 to 2c9a031 Compare June 10, 2022 00:07

jenting approved these changes Jun 10, 2022

View reviewed changes

aledbf approved these changes Jun 14, 2022

View reviewed changes

sagor999 added 3 commits June 14, 2022 18:32

volume snapshot delete and restore support, garbage collection

aad1d00

add limit to volume snapshot GC

e9b3fdc

fix index on volume snapshot table

3c02797

sagor999 force-pushed the pavel/10259-1 branch from ff83a14 to 3c02797 Compare June 14, 2022 18:32

geropl reviewed Jun 14, 2022

View reviewed changes

geropl approved these changes Jun 14, 2022

View reviewed changes

add comments

7e4b0d8

roboquat removed the do-not-merge/hold label Jun 14, 2022

roboquat merged commit f0daee2 into main Jun 14, 2022

roboquat deleted the pavel/10259-1 branch June 14, 2022 21:07

roboquat added deployed: webapp Meta team change is running in production deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volume snapshot delete and restore support #10267

volume snapshot delete and restore support #10267

sagor999 commented May 25, 2022 •

edited

sagor999 commented May 26, 2022 •

edited by werft-gitpod-dev-com bot

geropl commented May 30, 2022

sagor999 commented Jun 7, 2022 •

edited by werft-gitpod-dev-com bot

sagor999 commented Jun 8, 2022

geropl Jun 9, 2022

sagor999 Jun 10, 2022

geropl Jun 13, 2022 •

edited

geropl Jun 13, 2022

jenting Jun 13, 2022

sagor999 Jun 13, 2022

geropl Jun 14, 2022

geropl commented Jun 9, 2022

jenting left a comment

jenting commented Jun 14, 2022 •

edited

sagor999 commented Jun 14, 2022 •

edited by werft-gitpod-dev-com bot

sagor999 commented Jun 14, 2022 •

edited by werft-gitpod-dev-com bot

geropl Jun 14, 2022

geropl left a comment

sagor999 commented Jun 14, 2022

volume snapshot delete and restore support #10267

volume snapshot delete and restore support #10267

Conversation

sagor999 commented May 25, 2022 • edited

Description

How to test

Release Notes

Documentation

sagor999 commented May 26, 2022 • edited by werft-gitpod-dev-com bot

geropl commented May 30, 2022

sagor999 commented Jun 7, 2022 • edited by werft-gitpod-dev-com bot

sagor999 commented Jun 8, 2022

geropl Jun 9, 2022

Choose a reason for hiding this comment

sagor999 Jun 10, 2022

Choose a reason for hiding this comment

geropl Jun 13, 2022 • edited

Choose a reason for hiding this comment

geropl Jun 13, 2022

Choose a reason for hiding this comment

jenting Jun 13, 2022

Choose a reason for hiding this comment

sagor999 Jun 13, 2022

Choose a reason for hiding this comment

geropl Jun 14, 2022

Choose a reason for hiding this comment

geropl commented Jun 9, 2022

jenting left a comment

Choose a reason for hiding this comment

jenting commented Jun 14, 2022 • edited

sagor999 commented Jun 14, 2022 • edited by werft-gitpod-dev-com bot

sagor999 commented Jun 14, 2022 • edited by werft-gitpod-dev-com bot

geropl Jun 14, 2022

Choose a reason for hiding this comment

geropl left a comment

Choose a reason for hiding this comment

sagor999 commented Jun 14, 2022

sagor999 commented May 25, 2022 •

edited

sagor999 commented May 26, 2022 •

edited by werft-gitpod-dev-com bot

sagor999 commented Jun 7, 2022 •

edited by werft-gitpod-dev-com bot

geropl Jun 13, 2022 •

edited

jenting commented Jun 14, 2022 •

edited

sagor999 commented Jun 14, 2022 •

edited by werft-gitpod-dev-com bot

sagor999 commented Jun 14, 2022 •

edited by werft-gitpod-dev-com bot