Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headless Log II: ws-manager state machine #4351

Merged
merged 1 commit into from
Jul 8, 2021
Merged

Conversation

geropl
Copy link
Member

@geropl geropl commented May 31, 2021

Note: don't merge before #4262

This PR is part of the "Revamp Headless Logs" effort: https://www.notion.so/gitpod/Revamp-Headless-Logs-90d9d08d12e344a49bce442cad89200b#065b7b935c4d4d2d86dd8bd1aa3b840e

It changes the "done" signal for headless workspaces from "the last line of the log is DONE" to "the container has stopped".

This contains:

  • STOPPING now does not imply "container has already been stopped" but is enforced by monitor.go now
  • headless workspaces that fail because one of their tasks returned an exit code != 0 return 222 as exit code
  • workspacekit needs to propagate that exit code

Test

Unit tests are passing ✔️

These Integration tests are passing ✔️

cd test/tests/workspace
go test -run ^TestPrebuild.*$ -v .
go test -run ^TestGhost.*$ -v .

Start a prebuild which succeeds

Start a prebuild which fails

@geropl geropl marked this pull request as ready for review June 7, 2021 07:42
Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes by and large make sense to me.
Only the many fixture tests I did not understand. Many (most) of them do not seem to exercise prebuild specific paths.

@csweichel
Copy link
Contributor

csweichel commented Jun 9, 2021

/werft run

👍 started the job as gitpod-build-gpl-headless-log-wsman.31

@csweichel
Copy link
Contributor

csweichel commented Jun 9, 2021

I first started the prebuild when ws-daemon wasn't ready yet, hence the build failed. Now I'm left in a permanent "prebuild is running" state.

{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","addr":"10.132.0.23:10482","error":"context deadline exceeded","level":"error","message":"cannot connect to ws-daemon","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","time":"2021-06-09T06:57:59Z"}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","error":"cannot mark workspace 53ac2e9b-d49c-4e13-92f7-9dd7870d30f6 with +gitpod/explicitFail:\n    github.com/gitpod-io/gitpod/ws-manager/pkg/manager.(*Manager).markWorkspace\n        github.com/gitpod-io/gitpod/ws-manager/pkg/manager/annotations.go:118\n  - cannot find workspace 53ac2e9b-d49c-4e13-92f7-9dd7870d30f6:\n    github.com/gitpod-io/gitpod/ws-manager/pkg/manager.(*Manager).markWorkspace.func1\n        github.com/gitpod-io/gitpod/ws-manager/pkg/manager/annotations.go:90\n  - pod for workspace 53ac2e9b-d49c-4e13-92f7-9dd7870d30f6 not found","instanceId":"53ac2e9b-d49c-4e13-92f7-9dd7870d30f6","level":"warning","message":"was unable to mark workspace as failed","serviceContext":{"service":"ws-manager","version":""},"severity":"WARNING","time":"2021-06-09T06:57:59Z","userId":"9fc1317c-eb7f-4123-b3aa-0d5a15e63ccd","workspaceId":"maroon-hyena-nsyf8tqj"}

Once ws-daemon was up and running things worked as expected.

@csweichel
Copy link
Contributor

I really like this change. It finally does away with that fickle headless stop mechanism and replaces it with a much stronger signal!

@geropl geropl force-pushed the gpl/headless-log-wsman branch 3 times, most recently from d94e366 to d5d7481 Compare June 9, 2021 12:38
@geropl geropl force-pushed the gpl/headless-log-wsman branch 2 times, most recently from 7f32420 to aceb552 Compare June 23, 2021 09:02
@codecov
Copy link

codecov bot commented Jun 23, 2021

Codecov Report

Merging #4351 (bffd808) into main (eb0f0c3) will increase coverage by 4.95%.
The diff coverage is 14.91%.

❗ Current head bffd808 differs from pull request most recent head 1f06391. Consider uploading reports for the commit 1f06391 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4351      +/-   ##
==========================================
+ Coverage   36.38%   41.33%   +4.95%     
==========================================
  Files          14       37      +23     
  Lines        3881     9288    +5407     
==========================================
+ Hits         1412     3839    +2427     
- Misses       2349     5145    +2796     
- Partials      120      304     +184     
Flag Coverage Δ
components-ee-ws-scheduler-app 62.19% <ø> (?)
components-local-app-app-darwin ?
components-local-app-app-linux ?
components-local-app-app-windows ?
components-supervisor-app 36.17% <6.06%> (?)
components-ws-manager-app 36.64% <18.51%> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/supervisor/pkg/supervisor/supervisor.go 0.00% <0.00%> (ø)
components/ws-manager/pkg/manager/manager.go 26.05% <0.00%> (+0.17%) ⬆️
components/ws-manager/pkg/manager/monitor.go 0.00% <0.00%> (ø)
components/ws-manager/pkg/manager/status.go 70.84% <43.75%> (-1.49%) ⬇️
components/supervisor/pkg/supervisor/tasks.go 47.98% <50.00%> (ø)
components/ws-manager/pkg/manager/create.go 78.53% <100.00%> (ø)
components/ee/ws-scheduler/pkg/scheduler/config.go 62.50% <0.00%> (ø)
components/supervisor/pkg/ports/served-ports.go 76.00% <0.00%> (ø)
components/supervisor/pkg/terminal/service.go 29.09% <0.00%> (ø)
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb0f0c3...1f06391. Read the comment docs.

@geropl geropl force-pushed the gpl/headless-log-wsman branch 9 times, most recently from f5afc4f to 95e9442 Compare June 24, 2021 12:33
@geropl
Copy link
Member Author

geropl commented Jun 24, 2021

Now we also have integration tests. ✔️

@csweichel

@geropl geropl requested a review from csweichel June 24, 2021 12:45
@geropl geropl force-pushed the gpl/headless-log-wsman branch 4 times, most recently from 8ce0b5f to b0ca0f2 Compare June 25, 2021 12:41
@geropl
Copy link
Member Author

geropl commented Jun 25, 2021

@csweichel Finally rebased and deployed again. After that and fixing the nits the codecov/patch is red now 🙈

@csweichel
Copy link
Contributor

Code does look good now indeed. After running the integration test I had a look at the ws-manager logs and saw this:

{"instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","metadata":{"owner":"builtin-user-workspace-probe-0000000","meta_id":"apricot-leopard-hugdrrvy","started_at":{"seconds":1624644359}},"spec":{"workspace_image":"eu.gcr.io/gitpod-core-dev/registry/workspace-images:0d70f6a2cf7aa97cb7d4408ccbf775c01c05dd7297de3439b325396bbb8155dd","ide_image":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-6d0bc3279e942a39aef78c9db5aa8ec9629a9017","headless":true,"url":"https://apricot-leopard-hugdrrvy.ws-dev.gpl-headless-log-wsman.staging.gitpod-dev.com","type":1,"timeout":"30m"},"phase":5,"conditions":{"failed":"unexpected exit: exit status 222","service_exists":1,"deployed":1},"message":"headless workspace is stopping","runtime":{"node_name":"gke-dev-workload-1-49d27f81-6n3t","pod_name":"prebuild-800f0a98-aa49-43bb-bd59-db8bc3b1c35b","node_ip":"10.132.15.201"},"auth":{"owner_token":"QkJqm7OfT_YGbyous1HGtinYPQpmDtyX"}},"time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}
{"instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","metadata":{"owner":"builtin-user-workspace-probe-0000000","meta_id":"apricot-leopard-hugdrrvy","started_at":{"seconds":1624644359}},"spec":{"workspace_image":"eu.gcr.io/gitpod-core-dev/registry/workspace-images:0d70f6a2cf7aa97cb7d4408ccbf775c01c05dd7297de3439b325396bbb8155dd","ide_image":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-6d0bc3279e942a39aef78c9db5aa8ec9629a9017","headless":true,"url":"https://apricot-leopard-hugdrrvy.ws-dev.gpl-headless-log-wsman.staging.gitpod-dev.com","type":1,"timeout":"30m"},"phase":5,"conditions":{"failed":"unexpected exit: exit status 222","service_exists":1,"deployed":1},"message":"headless workspace is stopping","runtime":{"node_name":"gke-dev-workload-1-49d27f81-6n3t","pod_name":"prebuild-800f0a98-aa49-43bb-bd59-db8bc3b1c35b","node_ip":"10.132.15.201"},"auth":{"owner_token":"QkJqm7OfT_YGbyous1HGtinYPQpmDtyX"}},"time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}
{"instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","metadata":{"owner":"builtin-user-workspace-probe-0000000","meta_id":"apricot-leopard-hugdrrvy","started_at":{"seconds":1624644359}},"spec":{"workspace_image":"eu.gcr.io/gitpod-core-dev/registry/workspace-images:0d70f6a2cf7aa97cb7d4408ccbf775c01c05dd7297de3439b325396bbb8155dd","ide_image":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-6d0bc3279e942a39aef78c9db5aa8ec9629a9017","headless":true,"url":"https://apricot-leopard-hugdrrvy.ws-dev.gpl-headless-log-wsman.staging.gitpod-dev.com","type":1,"timeout":"30m"},"phase":5,"conditions":{"failed":"unexpected exit: exit status 222","service_exists":1,"deployed":1},"runtime":{"node_name":"gke-dev-workload-1-49d27f81-6n3t","pod_name":"prebuild-800f0a98-aa49-43bb-bd59-db8bc3b1c35b","node_ip":"10.132.15.201"},"auth":{"owner_token":"QkJqm7OfT_YGbyous1HGtinYPQpmDtyX"}},"time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}
{"instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","metadata":{"owner":"builtin-user-workspace-probe-0000000","meta_id":"apricot-leopard-hugdrrvy","started_at":{"seconds":1624644359}},"spec":{"workspace_image":"eu.gcr.io/gitpod-core-dev/registry/workspace-images:0d70f6a2cf7aa97cb7d4408ccbf775c01c05dd7297de3439b325396bbb8155dd","ide_image":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-6d0bc3279e942a39aef78c9db5aa8ec9629a9017","headless":true,"url":"https://apricot-leopard-hugdrrvy.ws-dev.gpl-headless-log-wsman.staging.gitpod-dev.com","type":1,"timeout":"30m"},"phase":5,"conditions":{"failed":"unexpected exit: exit status 222","service_exists":1,"snapshot":"workspaces/apricot-leopard-hugdrrvy/snapshot-1624644390307551254.tar@gitpod-user-builtin-user-workspace-probe-0000000","deployed":1},"runtime":{"node_name":"gke-dev-workload-1-49d27f81-6n3t","pod_name":"prebuild-800f0a98-aa49-43bb-bd59-db8bc3b1c35b","node_ip":"10.132.15.201"},"auth":{"owner_token":"QkJqm7OfT_YGbyous1HGtinYPQpmDtyX"}},"time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}
{"@type":"type.googleapis.com/google.devtools.clouderrorreporting.v1beta1.ReportedErrorEvent","error":"rpc error: code = NotFound desc = workspace does not exist","instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"warning","message":"cannot take snapshot","serviceContext":{"service":"ws-manager","version":""},"severity":"WARNING","time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}
{"instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","metadata":{"owner":"builtin-user-workspace-probe-0000000","meta_id":"apricot-leopard-hugdrrvy","started_at":{"seconds":1624644359}},"spec":{"workspace_image":"eu.gcr.io/gitpod-core-dev/registry/workspace-images:0d70f6a2cf7aa97cb7d4408ccbf775c01c05dd7297de3439b325396bbb8155dd","ide_image":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-6d0bc3279e942a39aef78c9db5aa8ec9629a9017","headless":true,"url":"https://apricot-leopard-hugdrrvy.ws-dev.gpl-headless-log-wsman.staging.gitpod-dev.com","type":1,"timeout":"30m"},"phase":6,"conditions":{"failed":"unexpected exit: exit status 222","service_exists":1,"snapshot":"workspaces/apricot-leopard-hugdrrvy/snapshot-1624644390307551254.tar@gitpod-user-builtin-user-workspace-probe-0000000","final_backup_complete":1,"deployed":1},"runtime":{"node_name":"gke-dev-workload-1-49d27f81-6n3t","pod_name":"prebuild-800f0a98-aa49-43bb-bd59-db8bc3b1c35b","node_ip":"10.132.15.201"},"auth":{"owner_token":"QkJqm7OfT_YGbyous1HGtinYPQpmDtyX"}},"time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}
{"instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","metadata":{"owner":"builtin-user-workspace-probe-0000000","meta_id":"apricot-leopard-hugdrrvy","started_at":{"seconds":1624644359}},"spec":{"workspace_image":"eu.gcr.io/gitpod-core-dev/registry/workspace-images:0d70f6a2cf7aa97cb7d4408ccbf775c01c05dd7297de3439b325396bbb8155dd","ide_image":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-6d0bc3279e942a39aef78c9db5aa8ec9629a9017","headless":true,"url":"https://apricot-leopard-hugdrrvy.ws-dev.gpl-headless-log-wsman.staging.gitpod-dev.com","type":1,"timeout":"30m"},"phase":6,"conditions":{"failed":"unexpected exit: exit status 222","service_exists":1,"snapshot":"workspaces/apricot-leopard-hugdrrvy/snapshot-1624644390307551254.tar@gitpod-user-builtin-user-workspace-probe-0000000","final_backup_complete":1,"deployed":1},"runtime":{"node_name":"gke-dev-workload-1-49d27f81-6n3t","pod_name":"prebuild-800f0a98-aa49-43bb-bd59-db8bc3b1c35b","node_ip":"10.132.15.201"},"auth":{"owner_token":"QkJqm7OfT_YGbyous1HGtinYPQpmDtyX"}},"time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}
{"instanceId":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","level":"error","message":"workspace failed","serviceContext":{"service":"ws-manager","version":""},"severity":"ERROR","status":{"id":"800f0a98-aa49-43bb-bd59-db8bc3b1c35b","metadata":{"owner":"builtin-user-workspace-probe-0000000","meta_id":"apricot-leopard-hugdrrvy","started_at":{"seconds":1624644359}},"spec":{"workspace_image":"eu.gcr.io/gitpod-core-dev/registry/workspace-images:0d70f6a2cf7aa97cb7d4408ccbf775c01c05dd7297de3439b325396bbb8155dd","ide_image":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-6d0bc3279e942a39aef78c9db5aa8ec9629a9017","headless":true,"url":"https://apricot-leopard-hugdrrvy.ws-dev.gpl-headless-log-wsman.staging.gitpod-dev.com","type":1,"timeout":"30m"},"phase":6,"conditions":{"failed":"unexpected exit: exit status 222; last backup failed: workspace does not exist. Please contact support if you need the workspace data.","service_exists":1,"snapshot":"workspaces/apricot-leopard-hugdrrvy/snapshot-1624644390307551254.tar@gitpod-user-builtin-user-workspace-probe-0000000","final_backup_complete":1,"deployed":1},"runtime":{"node_name":"gke-dev-workload-1-49d27f81-6n3t","pod_name":"prebuild-800f0a98-aa49-43bb-bd59-db8bc3b1c35b","node_ip":"10.132.15.201"},"auth":{"owner_token":"QkJqm7OfT_YGbyous1HGtinYPQpmDtyX"}},"time":"2021-06-25T18:06:30Z","userId":"builtin-user-workspace-probe-0000000","workspaceId":"apricot-leopard-hugdrrvy"}

Even though the functionality itself seems to be there (snapshot is created and attached to the workspace), these logs are far too loud and might be hiding other failure modes. Unfortunately these issues would need to be fixed before merging this.

If we were not to merge this PR prior to deploying, would that break prebuilds?

@geropl
Copy link
Member Author

geropl commented Jun 28, 2021

Even though the functionality itself seems to be there (snapshot is created and attached to the workspace), these logs are far too loud and might be hiding other failure modes. Unfortunately these issues would need to be fixed before merging this.

Hm, I'm not sure what to do about this. The PR has not touched that area, at all. The only thing I can think of is to switch the ERROR for a INFO, because it's a prebuild task that failed, not a workspace.

If we were not to merge this PR prior to deploying, would that break prebuilds?

No, there is no dependency here. Though we should double check on staging, of course.

@geropl
Copy link
Member Author

geropl commented Jun 28, 2021

Hm, I'm not sure what to do about this. The PR has not touched that area, at all. The only thing I can think of is to switch the ERROR for a INFO, because it's a prebuild task that failed, not a workspace.

Ok, after looking at the code there is one distinct line where we try to be loud about things we deem urgent. I introduced a condition PREBUILD_TASK_FAILED (35e1a77) and excluded those from that line - seems to work nicely.

@aledbf Do you have time to review?

@aledbf
Copy link
Member

aledbf commented Jun 28, 2021

/werft run

👍 started the job as gitpod-build-gpl-headless-log-wsman.64

@geropl geropl force-pushed the gpl/headless-log-wsman branch 3 times, most recently from bffd808 to 1f06391 Compare July 1, 2021 14:35
@geropl geropl requested a review from csweichel July 5, 2021 08:02
@geropl
Copy link
Member Author

geropl commented Jul 5, 2021

/werft run

👍 started the job as gitpod-build-gpl-headless-log-wsman.71

@geropl geropl requested a review from a team as a code owner July 6, 2021 13:45
@geropl geropl requested review from a team and AlexTugarev and removed request for a team July 6, 2021 13:45
@csweichel
Copy link
Contributor

csweichel commented Jul 7, 2021

/werft run

👍 started the job as gitpod-build-gpl-headless-log-wsman.74

Copy link
Contributor

@csweichel csweichel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one took way too long to review/rework/approve - I am very sorry for that :/

@csweichel
Copy link
Contributor

csweichel commented Jul 7, 2021

/werft run

👍 started the job as gitpod-build-gpl-headless-log-wsman.75

@csweichel
Copy link
Contributor

csweichel commented Jul 8, 2021

/werft run

👍 started the job as gitpod-build-gpl-headless-log-wsman.76

@geropl geropl merged commit 8f0c24a into main Jul 8, 2021
@geropl geropl deleted the gpl/headless-log-wsman branch July 8, 2021 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants