Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-daemon] Properly handle mark unmount #5897

Merged
merged 1 commit into from
Sep 29, 2021
Merged

[ws-daemon] Properly handle mark unmount #5897

merged 1 commit into from
Sep 29, 2021

Conversation

csweichel
Copy link
Contributor

Description

This PR moves the mark unmount fallback back to ws-daemon. Prior to this change we'd try and finalise workspace content even before the pod was stopped. During content finalisation we'd try and unmount the mark mount that might have been propagated to ws-daemon during a restart. If that happened, the pod would never actually stop - hence we'd try to finalise even if the pod was not stopped yet.

This change pushes all this mark mount business back into ws-dameon using the dispatch mechanism. If a pod lingers around for longer than its termination grace period, we'll try and unmount the mark mount, eventually causing the pod to stop.

Related Issue(s)

Fixes #5689

How to test

  1. Start a workspace
  2. Restart ws-daemon
  3. Stop the workspace
  4. The workspace should have stopped just fine and the gitpod_ws_daemon_markunmountfallback_active_total metric on ws-daemon should have incremented by one.

Release Notes

[workspace] Make the workspace stopping mechanism more deterministic

@codecov
Copy link

codecov bot commented Sep 28, 2021

Codecov Report

Merging #5897 (61b283d) into main (60a93e7) will increase coverage by 3.35%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5897      +/-   ##
==========================================
+ Coverage   19.04%   22.40%   +3.35%     
==========================================
  Files           2       11       +9     
  Lines         168     1933    +1765     
==========================================
+ Hits           32      433     +401     
- Misses        134     1442    +1308     
- Partials        2       58      +56     
Flag Coverage Δ
components-local-app-app-darwin ?
components-local-app-app-linux ?
components-local-app-app-windows ?
components-ws-daemon-app 22.40% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/ws-daemon/pkg/content/service.go 0.00% <0.00%> (ø)
components/local-app/pkg/auth/auth.go
components/local-app/pkg/auth/pkce.go
components/ws-daemon/pkg/content/config.go 62.50% <0.00%> (ø)
components/ws-daemon/pkg/quota/size.go 87.30% <0.00%> (ø)
components/ws-daemon/pkg/internal/session/store.go 19.38% <0.00%> (ø)
components/ws-daemon/pkg/content/initializer.go 0.00% <0.00%> (ø)
components/ws-daemon/pkg/resources/limiter.go 77.77% <0.00%> (ø)
components/ws-daemon/pkg/resources/controller.go 31.06% <0.00%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60a93e7...61b283d. Read the comment docs.

@csweichel csweichel marked this pull request as ready for review September 28, 2021 09:09
@@ -363,22 +363,9 @@ func actOnPodEvent(ctx context.Context, m actingManager, status *api.WorkspaceSt
_, gone := wso.Pod.Annotations[wsk8s.ContainerIsGoneAnnotation]

if terminated || gone {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@geropl
Copy link
Member

geropl commented Sep 28, 2021

I might mix this up, but: This is basically the mechanism we had in place before the "stuck in stopping" weeks, just with "unmountMarkMount" instead of kubectl delete pod .. --force, right?
But wasn't the problem with this approach that it is already too late? E.g., containerd already tried to remove the container, failed, and the workspace is in a state where retries won't help?
Or is it that we now delete the mark mount in all scenarios, that when containerd re-tries to delete the pod it succeeds?

Just trying to completely understand this 🤔

@csweichel
Copy link
Contributor Author

csweichel commented Sep 28, 2021

I might mix this up, but: This is basically the mechanism we had in place before the "stuck in stopping" weeks, just with "unmountMarkMount" instead of kubectl delete pod .. --force, right?

Indeed, the mechanism is the same. Actually I went back in the Git history and pulled out the containerd workaround code :)

But wasn't the problem with this approach that it is already too late? E.g., containerd already tried to remove the container, failed, and the workspace is in a state where retries won't help?

The key difference between then and now is that containerd behaves differently in this situation. It continuously tries to unmount the root filesystem, and eventually (once this mechanism has kicked in) succeeds.

Or is it that we now delete the mark mount in all scenarios, that when containerd re-tries to delete the pod it succeeds?

💯

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this helps, but: LGTM 🙃

@roboquat
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7e8e635de31c41d0b4a65276bdb33ec9ff0d6665

@aledbf
Copy link
Member

aledbf commented Sep 29, 2021

/werft run

👍 started the job as gitpod-build-cw-fix-5689.5

@aledbf
Copy link
Member

aledbf commented Sep 29, 2021

@csweichel I see two things with this change:

  • stopping workspaces takes more time (UI)
  • the counter gitpod_ws_daemon_markunmountfallback_active_total{successful="true"} increases in two

@csweichel
Copy link
Contributor Author

@csweichel I see two things with this change:

  • stopping workspaces takes more time (UI)

That makes sense as we now wait for containerd to realise that the rootfs was unmounted

  • the counter gitpod_ws_daemon_markunmountfallback_active_total{successful="true"} increases in two

Did it increment by two for one workspace stop?

@aledbf
Copy link
Member

aledbf commented Sep 29, 2021

Did it increment by two for one workspace stop?

Yes

@csweichel
Copy link
Contributor Author

Testing this I saw a single increase only. Beware that preview environments have a single ghost running as well, whose stop would be affected by this behaviour also.

@aledbf
Copy link
Member

aledbf commented Sep 29, 2021

/lgtm
/approve

@roboquat
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7b03f9fcb851847126a83209fa202093976f1dde

@roboquat
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, geropl

Associated issue: #5689

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat merged commit da1919f into main Sep 29, 2021
@roboquat roboquat deleted the cw/fix-5689 branch September 29, 2021 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants