Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volume snapshot delete and restore support #10267

Merged
merged 4 commits into from
Jun 14, 2022
Merged

volume snapshot delete and restore support #10267

merged 4 commits into from
Jun 14, 2022

Conversation

sagor999
Copy link
Contributor

@sagor999 sagor999 commented May 25, 2022

Description

This PR adds two features:

  1. New workspace volume snapshot is created. We need to delete old volume snapshot for that workspace now.
  2. Open workspace in the cluster where volume snapshot doesn't exist yet. We need to restore volume snapshot.
  3. Workspace is deleted: it will garbage collect all volume snapshots associated with that workspace.

Fixes #10259

How to test

Loom video showing this in action:
https://www.loom.com/share/51b41802054d4de89902aaa358171624

Release Notes

none

Documentation

@sagor999
Copy link
Contributor Author

sagor999 commented May 26, 2022

/werft run

👍 started the job as gitpod-build-pavel-10259-1.14
(with .werft/ from main)

@sagor999 sagor999 changed the title wip volume snapshot delete and restore support May 27, 2022
@sagor999 sagor999 marked this pull request as ready for review May 27, 2022 01:08
@sagor999 sagor999 requested review from a team May 27, 2022 01:08
@sagor999 sagor999 requested a review from aledbf as a code owner May 27, 2022 01:08
@github-actions github-actions bot added team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team labels May 27, 2022
@sagor999 sagor999 force-pushed the pavel/10259-1 branch 2 times, most recently from 9be3d79 to 753cbc6 Compare May 28, 2022 18:29
@geropl
Copy link
Member

geropl commented May 30, 2022

Starting to review now

@geropl geropl self-assigned this May 30, 2022
@sagor999 sagor999 marked this pull request as draft June 1, 2022 19:45
@sagor999
Copy link
Contributor Author

sagor999 commented Jun 7, 2022

/werft run with-clean-slate-deployment=true

👍 started the job as gitpod-build-pavel-10259-1.23
(with .werft/ from main)

@sagor999 sagor999 force-pushed the pavel/10259-1 branch 3 times, most recently from 6157db8 to d1c06db Compare June 8, 2022 18:51
@sagor999 sagor999 requested a review from geropl June 8, 2022 19:14
@sagor999
Copy link
Contributor Author

sagor999 commented Jun 8, 2022

/hold
to ensure I get review from webapp team.

let index = 0;

let availableClusters = allClusters.filter((c) => c.state === "available");
for (let cluster of availableClusters) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected this method to connect to a specific ws-manager based on a "region" field, either retrieved from VolumeSnapshot, Workspace or WorkspaceInstance. Iterating over all currently connected clients makes it impossible to detect cases where we are no longer connected to a certain ws-manager.

Or is this due to the fact that we might have multiple ws-manager (with different names) in the same GCloud region, and we do not know how to map those? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. Imagine following scenario:
EU cluster where WS was started from Volume Snapshot X
US cluster where WS was also started from same Volume Snapshot X (from prebuild for example)

Then we want to GC that volume snapshot.
We need to do two things now:

  1. Ensure that we delete k8s object representing this snapshot from ALL clusters (to make sure we do not accumulate those over time, as it will affect k8s api server)
  2. Ensure we delete volume snapshot object from cloud provider

So this code ensures that. That is why it attempts to talk to all existing clusters and clean it up from all of them.
If we are no longer connected to some cluster, I think that is fine, as I would assume that cluster is getting decommissioned.

Copy link
Member

@geropl geropl Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do two things now:

Ah, that explains it. Just to ensure we're on the same page: ws-manager does not remove the PVC object from the control plane when the corresponding workspace is stopped?

Ensure we delete volume snapshot object from cloud provider

I thought we just need to do this, and also wondered if this was something we put into content-service...? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to "block" the merging of the PR here, but I feel I'm still missing some context to understand why the PVC lifecycle looks the way it does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to ensure we're on the same page: ws-manager does not remove the PVC object from the control plane when the corresponding workspace is stopped?

No, when the workspace stopped, the PVC object would be removed after the VolumeSnapshot object is ready to use. (Which means the PVC snapshot is done).

Ensure we delete volume snapshot object from cloud provider

I thought we just need to do this, and also wondered if this was something we put into content-service...? 🤔

The lifecycle of PersistentVolumeClaim/VolumeSnapshot is managed by ws-manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geropl

Just to ensure we're on the same page: ws-manager does not remove the PVC object from the control plane when the corresponding workspace is stopped?

it does. There are two objects in play here. PVC is an actual storage object, it is only alive while workspace is running. When workspace is stopped, wsman will "convert" PVC into VolumeSnapshot object that is stored in cloud provider (outside of k8s control).

When we open workspace, PVC is created and restored from VolumeSnapshot object.

When it is time to delete\GC volume snapshot, we might have same VolumeSnapshot across multiple clusters (prebuild that was opened in different clusters for example). So we want to clean up that volume snapshot object from all clusters (as otherwise we will be accumulating a lot of them over the lifecycle of the cluster), and then also remove actual snapshot from cloud provider as well.

All this work is done by wsman currently.

It will be better to move that into content service indeed, but currently it is not possible, because content service still lives in webapp cluster, and should be migrated into workspace clusters first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @sagor999 for syncing! 👍
I understood that:

  • workspace takes care of deleting the actual disk snapshots
  • we keep the cluster-local VolumeSnapshots and VolumeSnapshotContent in the cluster to optimize for fast workspace starts
  • but to avoid cluttering the clusters too much, we aim to remove those here whenever their disk snapshot get deleted

It would be nice if we added some comments along those lines to the GC. 🙏

@geropl
Copy link
Member

geropl commented Jun 9, 2022

@sagor999 Regarding merging of this PR: Once the initial review is through, it'd would be awesome to decouple this PR into smaller chunks. For instance:

  • DB changes
  • ws-manager changes
  • server+bridge changes

Copy link
Contributor

@jenting jenting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🛹 🚀

@jenting
Copy link
Contributor

jenting commented Jun 14, 2022

Hey Pavel, I re-run this PR to test; I found that after the workspace stopped, I clicked Delete Workspace on the GUI. However, the VolumeSnapshot is still there.
But I suspect the VolumeSnapshot should be deleted by the ws-manager. Correct?
Besides that, I see the loom video doesn't demo this part.

@sagor999
Copy link
Contributor Author

sagor999 commented Jun 14, 2022

@jenting it will be deleted when GC runs and deletes workspace for real.
I simulated that part manually by updating workspace delete timestamp to be one week old, then GC ran and cleaned everything up.
But thank you for double checking! ❤️

@sagor999
Copy link
Contributor Author

sagor999 commented Jun 14, 2022

/werft run with-clean-slate-deployment=true

👍 started the job as gitpod-build-pavel-10259-1.40
(with .werft/ from main)

req.setId(vs.id);
req.setVolumeHandle(vs.volumeHandle);

let softDelete = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Already mentioned it, but just to highlight: Especially this part feels a tad brittle, and it would be nice to find a better way to handle/encapsulate that logic.

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@sagor999
Copy link
Contributor Author

/unhold

@roboquat roboquat merged commit f0daee2 into main Jun 14, 2022
@roboquat roboquat deleted the pavel/10259-1 branch June 14, 2022 21:07
@roboquat roboquat added deployed: webapp Meta team change is running in production deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: webapp Meta team change is running in production deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/XXL team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ws-manager: add support for deleting snapshotVolumes (to clean up old backups) on workspace delete
5 participants