[ws-daemon] Properly handle mark unmount #5897

csweichel · 2021-09-28T08:43:49Z

Description

This PR moves the mark unmount fallback back to ws-daemon. Prior to this change we'd try and finalise workspace content even before the pod was stopped. During content finalisation we'd try and unmount the mark mount that might have been propagated to ws-daemon during a restart. If that happened, the pod would never actually stop - hence we'd try to finalise even if the pod was not stopped yet.

This change pushes all this mark mount business back into ws-dameon using the dispatch mechanism. If a pod lingers around for longer than its termination grace period, we'll try and unmount the mark mount, eventually causing the pod to stop.

Related Issue(s)

Fixes #5689

How to test

Start a workspace
Restart ws-daemon
Stop the workspace
The workspace should have stopped just fine and the gitpod_ws_daemon_markunmountfallback_active_total metric on ws-daemon should have incremented by one.

Release Notes

[workspace] Make the workspace stopping mechanism more deterministic

codecov · 2021-09-28T09:00:30Z

Codecov Report

Merging #5897 (61b283d) into main (60a93e7) will increase coverage by 3.35%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #5897      +/-   ##
==========================================
+ Coverage   19.04%   22.40%   +3.35%     
==========================================
  Files           2       11       +9     
  Lines         168     1933    +1765     
==========================================
+ Hits           32      433     +401     
- Misses        134     1442    +1308     
- Partials        2       58      +56

Flag	Coverage Δ
components-local-app-app-darwin	`?`
components-local-app-app-linux	`?`
components-local-app-app-windows	`?`
components-ws-daemon-app	`22.40% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
components/ws-daemon/pkg/content/service.go	`0.00% <0.00%> (ø)`
components/local-app/pkg/auth/auth.go
components/local-app/pkg/auth/pkce.go
components/ws-daemon/pkg/content/config.go	`62.50% <0.00%> (ø)`
components/ws-daemon/pkg/quota/size.go	`87.30% <0.00%> (ø)`
components/ws-daemon/pkg/internal/session/store.go	`19.38% <0.00%> (ø)`
components/ws-daemon/pkg/content/initializer.go	`0.00% <0.00%> (ø)`
components/ws-daemon/pkg/resources/limiter.go	`77.77% <0.00%> (ø)`
components/ws-daemon/pkg/resources/controller.go	`31.06% <0.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60a93e7...61b283d. Read the comment docs.

components/ws-daemon/pkg/daemon/markunmount.go

geropl · 2021-09-28T12:22:22Z

components/ws-manager/pkg/manager/monitor.go

@@ -363,22 +363,9 @@ func actOnPodEvent(ctx context.Context, m actingManager, status *api.WorkspaceSt
 		_, gone := wso.Pod.Annotations[wsk8s.ContainerIsGoneAnnotation]

 		if terminated || gone {


geropl · 2021-09-28T12:30:32Z

I might mix this up, but: This is basically the mechanism we had in place before the "stuck in stopping" weeks, just with "unmountMarkMount" instead of kubectl delete pod .. --force, right?
But wasn't the problem with this approach that it is already too late? E.g., containerd already tried to remove the container, failed, and the workspace is in a state where retries won't help?
Or is it that we now delete the mark mount in all scenarios, that when containerd re-tries to delete the pod it succeeds?

Just trying to completely understand this 🤔

csweichel · 2021-09-28T12:45:02Z

I might mix this up, but: This is basically the mechanism we had in place before the "stuck in stopping" weeks, just with "unmountMarkMount" instead of kubectl delete pod .. --force, right?

Indeed, the mechanism is the same. Actually I went back in the Git history and pulled out the containerd workaround code :)

But wasn't the problem with this approach that it is already too late? E.g., containerd already tried to remove the container, failed, and the workspace is in a state where retries won't help?

The key difference between then and now is that containerd behaves differently in this situation. It continuously tries to unmount the root filesystem, and eventually (once this mechanism has kicked in) succeeds.

Or is it that we now delete the mark mount in all scenarios, that when containerd re-tries to delete the pod it succeeds?

💯

geropl

Not sure this helps, but: LGTM 🙃

roboquat · 2021-09-28T14:45:33Z

LGTM label has been added.

Git tree hash: 7e8e635de31c41d0b4a65276bdb33ec9ff0d6665

aledbf · 2021-09-29T09:28:26Z

/werft run

👍 started the job as gitpod-build-cw-fix-5689.5

aledbf · 2021-09-29T09:42:02Z

@csweichel I see two things with this change:

stopping workspaces takes more time (UI)
the counter gitpod_ws_daemon_markunmountfallback_active_total{successful="true"} increases in two

csweichel · 2021-09-29T11:28:06Z

@csweichel I see two things with this change:

stopping workspaces takes more time (UI)

That makes sense as we now wait for containerd to realise that the rootfs was unmounted

the counter gitpod_ws_daemon_markunmountfallback_active_total{successful="true"} increases in two

Did it increment by two for one workspace stop?

aledbf · 2021-09-29T11:29:23Z

Did it increment by two for one workspace stop?

Yes

csweichel · 2021-09-29T12:06:14Z

Testing this I saw a single increase only. Beware that preview environments have a single ghost running as well, whose stop would be affected by this behaviour also.

aledbf · 2021-09-29T12:07:11Z

/lgtm
/approve

roboquat · 2021-09-29T12:07:15Z

LGTM label has been added.

Git tree hash: 7b03f9fcb851847126a83209fa202093976f1dde

roboquat · 2021-09-29T12:07:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, geropl

Associated issue: #5689

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~components/ws-daemon/OWNERS~~ [aledbf]
~~components/ws-manager/OWNERS~~ [aledbf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

roboquat added do-not-merge/work-in-progress release-note labels Sep 28, 2021

roboquat requested review from mrsimonemms and rl-gitpod September 28, 2021 08:43

roboquat added the size/L label Sep 28, 2021

csweichel force-pushed the cw/fix-5689 branch from 78dcf4a to 5e17564 Compare September 28, 2021 08:55

roboquat added size/XL and removed size/L labels Sep 28, 2021

csweichel marked this pull request as ready for review September 28, 2021 09:09

roboquat removed the do-not-merge/work-in-progress label Sep 28, 2021

geropl reviewed Sep 28, 2021

View reviewed changes

components/ws-daemon/pkg/daemon/markunmount.go Outdated Show resolved Hide resolved

geropl reviewed Sep 28, 2021

View reviewed changes

csweichel force-pushed the cw/fix-5689 branch from 5e17564 to c49dd22 Compare September 28, 2021 12:57

geropl approved these changes Sep 28, 2021

View reviewed changes

roboquat assigned geropl Sep 28, 2021

roboquat added the lgtm label Sep 28, 2021

[ws-daemon] Properly handle mark unmount

61b283d

csweichel force-pushed the cw/fix-5689 branch from c49dd22 to 61b283d Compare September 29, 2021 06:29

roboquat removed the lgtm label Sep 29, 2021

csweichel requested review from aledbf and removed request for mrsimonemms September 29, 2021 08:15

roboquat assigned aledbf Sep 29, 2021

roboquat added the lgtm label Sep 29, 2021

roboquat added the approved label Sep 29, 2021

roboquat merged commit da1919f into main Sep 29, 2021

roboquat deleted the cw/fix-5689 branch September 29, 2021 12:07

This was referenced Sep 29, 2021

No logs for short prebuilds #5573

Closed

[ws-manager] Don't stop workspace too early #5688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-daemon] Properly handle mark unmount #5897

[ws-daemon] Properly handle mark unmount #5897

csweichel commented Sep 28, 2021

codecov bot commented Sep 28, 2021 •

edited

geropl Sep 28, 2021

geropl commented Sep 28, 2021

csweichel commented Sep 28, 2021 •

edited by werft-gitpod-dev-com bot

geropl left a comment

roboquat commented Sep 28, 2021

aledbf commented Sep 29, 2021 •

edited by werft-gitpod-dev-com bot

aledbf commented Sep 29, 2021

csweichel commented Sep 29, 2021

aledbf commented Sep 29, 2021

csweichel commented Sep 29, 2021

aledbf commented Sep 29, 2021

roboquat commented Sep 29, 2021

roboquat commented Sep 29, 2021

		@@ -363,22 +363,9 @@ func actOnPodEvent(ctx context.Context, m actingManager, status *api.WorkspaceSt
		_, gone := wso.Pod.Annotations[wsk8s.ContainerIsGoneAnnotation]

		if terminated \|\| gone {

[ws-daemon] Properly handle mark unmount #5897

[ws-daemon] Properly handle mark unmount #5897

Conversation

csweichel commented Sep 28, 2021

Description

Related Issue(s)

How to test

Release Notes

codecov bot commented Sep 28, 2021 • edited

Codecov Report

geropl Sep 28, 2021

Choose a reason for hiding this comment

geropl commented Sep 28, 2021

csweichel commented Sep 28, 2021 • edited by werft-gitpod-dev-com bot

geropl left a comment

Choose a reason for hiding this comment

roboquat commented Sep 28, 2021

aledbf commented Sep 29, 2021 • edited by werft-gitpod-dev-com bot

aledbf commented Sep 29, 2021

csweichel commented Sep 29, 2021

aledbf commented Sep 29, 2021

csweichel commented Sep 29, 2021

aledbf commented Sep 29, 2021

roboquat commented Sep 29, 2021

roboquat commented Sep 29, 2021

codecov bot commented Sep 28, 2021 •

edited

csweichel commented Sep 28, 2021 •

edited by werft-gitpod-dev-com bot

aledbf commented Sep 29, 2021 •

edited by werft-gitpod-dev-com bot