Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-manager] Stop workspaces timing out during backup #4943

Merged
merged 2 commits into from
Jul 26, 2021

Conversation

csweichel
Copy link
Contributor

Prior to this change workspaces that timed out during backup would never actually stop. Also, the contentFinalization timeout was not used during backup.

How to test

  1. Start a workspace
  2. Change the contentFinalization timeout in ws-manager-config to 1s (quicker testing)
  3. Restart ws-manager
  4. Once the workspace is running remove ws-daemon (e.g. edit the daemonSet's affinity)
  5. Stop the workspace

You should see the workspace actually stopping, e.g.
image

fixes #4937

Note: this change will affect prod configuration, as we're now using the correct timeout for content finalization/backup (15m instead of 1h).
/cc @meysholdt

@codecov
Copy link

codecov bot commented Jul 25, 2021

Codecov Report

Merging #4943 (cf75edc) into main (d035b70) will increase coverage by 36.37%.
The diff coverage is 0.00%.

❗ Current head cf75edc differs from pull request most recent head 1dad52b. Consider uploading reports for the commit 1dad52b to get more accurate results
Impacted file tree graph

@@            Coverage Diff            @@
##           main    #4943       +/-   ##
=========================================
+ Coverage      0   36.37%   +36.37%     
=========================================
  Files         0       13       +13     
  Lines         0     3736     +3736     
=========================================
+ Hits          0     1359     +1359     
- Misses        0     2261     +2261     
- Partials      0      116      +116     
Flag Coverage Δ
components-ws-manager-app 36.37% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/ws-manager/pkg/manager/monitor.go 0.00% <0.00%> (ø)
components/ws-manager/pkg/manager/status.go 71.69% <0.00%> (ø)
components/ws-manager/pkg/manager/manager.go 24.83% <0.00%> (ø)
components/ws-manager/pkg/manager/metrics.go 11.26% <0.00%> (ø)
components/ws-manager/pkg/manager/create.go 78.79% <0.00%> (ø)
components/ws-manager/pkg/manager/probe.go 0.00% <0.00%> (ø)
...s/ws-manager/pkg/manager/internal/grpcpool/pool.go 74.46% <0.00%> (ø)
...omponents/ws-manager/pkg/manager/pod_controller.go 0.00% <0.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d035b70...1dad52b. Read the comment docs.

@csweichel
Copy link
Contributor Author

/auto-cc

@roboquat roboquat requested a review from rl-gitpod July 25, 2021 21:47
@csweichel csweichel requested review from geropl and removed request for rl-gitpod July 25, 2021 21:47
@geropl
Copy link
Member

geropl commented Jul 26, 2021

/werft run

👍 started the job as gitpod-build-csweichel-ws-manager-properly-stop-4937.3

@geropl
Copy link
Member

geropl commented Jul 26, 2021

While testing I see the screen above (timeout) but the pod is still dangling in "Terminating". Last line of the ws log is:

{"exitCode":0,"level":"debug","message":"supervisor exit","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-07-26T08:22:31Z"}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /.workspace/daemon.sock: connect: connection refused\"","level":"error","message":"cannot trigger teardown","ring":0,"serviceContext":{"service":"workspacekit","version":""},"severity":"ERROR","time":"2021-07-26T08:22:31Z"}

@csweichel
Copy link
Contributor Author

during offline discussion with @meysholdt we figured that for now we should match the prior 1h timeout for contentFinalization

@csweichel
Copy link
Contributor Author

While testing I see the screen above (timeout) but the pod is still dangling in "Terminating". Last line of the ws log is:

{"exitCode":0,"level":"debug","message":"supervisor exit","serviceContext":{"service":"supervisor","version":""},"severity":"DEBUG","time":"2021-07-26T08:22:31Z"}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /.workspace/daemon.sock: connect: connection refused\"","level":"error","message":"cannot trigger teardown","ring":0,"serviceContext":{"service":"workspacekit","version":""},"severity":"ERROR","time":"2021-07-26T08:22:31Z"}

This strikes me as a separate issue because this PR does not change that behaviour. I have filed #4955

@csweichel
Copy link
Contributor Author

/hold

need to update timeout

@csweichel
Copy link
Contributor Author

/hold cancel

Copy link
Member

@geropl geropl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, tested and works as expected.

@roboquat
Copy link
Contributor

LGTM label has been added.

Git tree hash: 955eee5239067876259569ec38bbd79235991c16

@roboquat
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csweichel, geropl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat merged commit 9d5713a into main Jul 26, 2021
@roboquat roboquat deleted the csweichel/ws-manager-properly-stop-4937 branch July 26, 2021 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ws-manager] Properly stop workspaces that time out during backup
3 participants