Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisor] Make resource status request more resilient #12103

Merged
merged 1 commit into from
Aug 16, 2022

Conversation

andreafalzetti
Copy link
Contributor

@andreafalzetti andreafalzetti commented Aug 12, 2022

Description

Makes the resource status request more resilient by avoiding calling syncronously upstream to fetch resource status at every requests. With more clients adding functionalities such as gp top or workspace cpu/memory in jetbrains ide control center, the risk of getting rate-limited is guaranteed.

With the approached used in this PR, we retrieve data from upstream at every second and when clients request it, we return the latest available data. Additional, an exponential backoff strategy is in place to be more resilient.

Related Issue(s)

Fixes #12083

How to test

Compare how this prev env behaves in comparison with gitpod.io by running a workspace and run watch -n 0.1 gp top. In gitpod.io (prod) it will almost immediately fail by returning gRPC errors because ResourceStatus endpoint is pulled too frequently and we hit the rate-limit.

In the prev env, you should be able to run multiple watch -n 0.1 gp top without any errors.

Release Notes

NONE

Documentation

Werft options:

  • /werft with-preview

@andreafalzetti andreafalzetti changed the title feat(supervisor): add top service [supervisor] Make resource status request more resilient Aug 12, 2022
@gitpod-io gitpod-io deleted a comment from roboquat Aug 12, 2022
@andreafalzetti andreafalzetti force-pushed the afalz/12083-supervisor-extend-rate-limiting branch 5 times, most recently from a5b8337 to 2bda963 Compare August 12, 2022 17:15
@andreafalzetti andreafalzetti marked this pull request as ready for review August 15, 2022 06:09
@andreafalzetti andreafalzetti requested a review from a team August 15, 2022 06:09
@andreafalzetti andreafalzetti marked this pull request as draft August 15, 2022 06:10
@andreafalzetti andreafalzetti force-pushed the afalz/12083-supervisor-extend-rate-limiting branch from 2bda963 to 0f2d58f Compare August 15, 2022 06:37
@andreafalzetti andreafalzetti marked this pull request as ready for review August 15, 2022 06:39
@akosyakov
Copy link
Member

@andreafalzetti How do you test failure case?

components/supervisor/pkg/supervisor/services.go Outdated Show resolved Hide resolved
components/supervisor/pkg/supervisor/top.go Outdated Show resolved Hide resolved
components/supervisor/pkg/supervisor/top.go Show resolved Hide resolved
components/supervisor/pkg/supervisor/top.go Outdated Show resolved Hide resolved
@andreafalzetti andreafalzetti force-pushed the afalz/12083-supervisor-extend-rate-limiting branch from 0f2d58f to 954ad36 Compare August 15, 2022 13:14
@roboquat roboquat added size/L and removed size/M labels Aug 15, 2022
@andreafalzetti
Copy link
Contributor Author

andreafalzetti commented Aug 15, 2022

How do you test failure case?

@akosyakov That's what is mainly left now to get this merged. If you have any suggestions or can point me to some similar tests, I would appreciate it. In the meantime I will attempt some ideas.

@akosyakov
Copy link
Member

Run several watch gp top in parallel and then manipulate with sock file.

@akosyakov
Copy link
Member

akosyakov commented Aug 15, 2022

@andreafalzetti I actually cannot start a workspace it gets stopped immediately.

@andreafalzetti
Copy link
Contributor Author

@andreafalzetti I actually cannot start a workspace if gets stopped immediately.

@akosyakov Weird. I was able to reproduce. Before today's changes it was working fine. Maybe it's the prev env, I will try to do a clean deployment

@andreafalzetti
Copy link
Contributor Author

andreafalzetti commented Aug 15, 2022

/werft run with-clean-slate-deployment

👍 started the job as gitpod-build-afalz-12083-supervisor-extend-rate-limiting.9
(with .werft/ from main)

@akosyakov
Copy link
Member

akosyakov commented Aug 15, 2022

You can check workspace logs, maybe supervisor panics somehow after last change, i.e. not initialised channel and then nil pointer exception.

@andreafalzetti
Copy link
Contributor Author

andreafalzetti commented Aug 15, 2022

@akosyakov If I look for the file pathes used to read memory/cpu, in my workspace, I don't see them. What am I missing?

e.g. /sys/fs/cgroup/memory/memory.limit_in_bytes -> https://github.com/gitpod-io/gitpod/blob/main/components/supervisor/pkg/supervisor/top.go#L67

cat: /sys/fs/cgroup/memory/memory.limit_in_bytes: No such file or directory

the sock file /.supervisor/info.sock is there, but I am not sure how to mock/alter it yet

@akosyakov
Copy link
Member

akosyakov commented Aug 15, 2022

If I look for the file pathes used to read memory/cpu, in my workspace, I don't see them. What am I missing?

for cgroup v2 there is not such files only via /.supervisor/info.sock It is expected if sock is missing then we try to compute with cgroup v1 if not possible we fail completely, one cannot compute such info for v2 from within of the workspace.

@andreafalzetti andreafalzetti force-pushed the afalz/12083-supervisor-extend-rate-limiting branch from 954ad36 to 13603a6 Compare August 15, 2022 14:14
@andreafalzetti
Copy link
Contributor Author

@akosyakov workspace starts now, it was re-closing a closed channel

Copy link
Member

@akosyakov akosyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works as advertised

/hold
if you want to clean up something

@andreafalzetti andreafalzetti force-pushed the afalz/12083-supervisor-extend-rate-limiting branch from 13603a6 to 97f6c60 Compare August 16, 2022 10:23
@@ -257,6 +257,9 @@ func Run(options ...RunOption) {
go analyseConfigChanges(ctx, cfg, analytics, gitpodConfigService, gitpodService)
}

topService := NewTopService(Top)
Copy link
Member

@akosyakov akosyakov Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I think for usual clients it should be just NewTopService()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akosyakov do you mean without parameters?

@akosyakov
Copy link
Member

akosyakov commented Aug 16, 2022

I left some nits, can be merged without addressing them as well. Not sure about the test, using timeouts can lead to flakiness.

@andreafalzetti andreafalzetti force-pushed the afalz/12083-supervisor-extend-rate-limiting branch from 97f6c60 to a47d1ea Compare August 16, 2022 13:32
@andreafalzetti andreafalzetti self-assigned this Aug 16, 2022
@andreafalzetti
Copy link
Contributor Author

/unhold

@roboquat roboquat merged commit 47c64d4 into main Aug 16, 2022
@roboquat roboquat deleted the afalz/12083-supervisor-extend-rate-limiting branch August 16, 2022 13:50
@roboquat roboquat added deployed: IDE IDE change is running in production deployed Change is completely running in production labels Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: supervisor deployed: IDE IDE change is running in production deployed Change is completely running in production release-note-none size/L team: IDE
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extend supervisor rate-limiting on ResourceStatus
4 participants