Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supervisor] Trigger instance update if missing after port exposure #7058

Closed
wants to merge 1 commit into from

Conversation

corneliusludmann
Copy link
Contributor

Description

When for some reason the server stops sending instance updates ports are getting stuck in detecting. This PR checks 1 minute after port exposure if we got the port exposure information from the server and if not it asks actively for an instance update.

@akosyakov @csweichel After implementing this, I'm not sure if this is 100 % that was you had in mind. I'm very open to any suggestions for other implementations.

Related Issue(s)

Mitigate/Fixes #6778

See also: #7054

How to test

I tested this PR by adding this change:

diff --git a/components/supervisor/pkg/ports/exposed-ports.go b/components/supervisor/pkg/ports/exposed-ports.go
index 5408a5b3..001cbcbc 100644
--- a/components/supervisor/pkg/ports/exposed-ports.go
+++ b/components/supervisor/pkg/ports/exposed-ports.go
@@ -120,6 +120,13 @@ func (g *GitpodExposedPorts) Observe(ctx context.Context) (<-chan []ExposedPort,
                for {
                        select {
                        case u := <-updates:
+
+                               // TODO: ignoring all updates just for testing purposes
+                               // remove the following block before merging
+                               if u != nil {
+                                       break
+                               }
+
                                res := getExposedPorts(u)
                                if res == nil {
                                        return

This change ignores all instance updates from server. With this we see the same as we see in prod: The ports get stuck in “detecting”. After 1 minute, supervisor logs “we haven't seen an instance update with port exposure info after 1 minute” and “detecting” disappears.

Release Notes

Mitigate ports getting stuck in detecting

@roboquat
Copy link
Contributor

roboquat commented Dec 3, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from corneliusludmann after the PR has been reviewed.

Associated issue: #6778

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat added the size/L label Dec 3, 2021
@codecov
Copy link

codecov bot commented Dec 3, 2021

Codecov Report

Merging #7058 (3530a57) into main (81a86c2) will increase coverage by 17.89%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #7058       +/-   ##
===========================================
+ Coverage   19.04%   36.94%   +17.89%     
===========================================
  Files           2       19       +17     
  Lines         168     4623     +4455     
===========================================
+ Hits           32     1708     +1676     
- Misses        134     2781     +2647     
- Partials        2      134      +132     
Flag Coverage Δ
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?
components-supervisor-app 36.94% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/supervisor/pkg/ports/exposed-ports.go 0.00% <0.00%> (ø)
components/supervisor/pkg/ports/ports.go 60.60% <0.00%> (ø)
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go
components/supervisor/pkg/supervisor/tasks.go 58.56% <0.00%> (ø)
...mponents/supervisor/pkg/supervisor/notification.go 83.95% <0.00%> (ø)
components/supervisor/pkg/terminal/ring-buffer.go 45.65% <0.00%> (ø)
components/supervisor/pkg/dropwriter/dropwriter.go 73.46% <0.00%> (ø)
components/supervisor/pkg/terminal/terminal.go 64.52% <0.00%> (ø)
components/supervisor/pkg/ports/slirp4netns.go 0.00% <0.00%> (ø)
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81a86c2...3530a57. Read the comment docs.

Copy link
Member

@akosyakov akosyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add couple tests for it by mocking the api service in ports_test.go.

errchan = make(chan error, 1)
)
errchan := make(chan error, 1)
g.exposedPorts = make(chan []ExposedPort)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would imply that now we can have only one listener

if err != nil {
return err
}
res := getExposedPorts(wsInfo.LatestInstance)
Copy link
Member

@akosyakov akosyakov Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better if we do sync on level of APIoverJSONRPC in such way we don't need to worry about changing semantic of its clients.

i.e. add APIoverJSONRPC.SyncInstance, you can look how we appliy similar logic in the frontend, there are 2 important things:

  • avoid multiple concurrent syncs, i.e. last one wins for isntance like here:
    private sync(): void {
    this.cancelSync();
    this.syncTokenSource = new CancellationTokenSource();
    const token = this.syncTokenSource.token;
    this.syncQueue = this.syncQueue.then(async () => {
    if (token.isCancellationRequested) {
    return;
    }
    try {
    const info = await this.service.server.getWorkspace(this._info.workspace.id);
    if (token.isCancellationRequested) {
    return;
    }
    this._info = info;
    this.source = 'sync';
    this.onDidChangeEmitter.fire(undefined);
    } catch (e) {
    console.error('failed to sync workspace instance:', e)
    }
    })
    }
  • resolve conflicts between instance updates and sync, if sync seen newer data then we shoudl ignore new instance update, it can be done based on phases like here:
    if (instance.id !== this.info.latestInstance?.id) {
    return false;
    }
    return phasesOrder[instance.status.phase] < phasesOrder[this.info.latestInstance.status.phase];

@corneliusludmann
Copy link
Contributor Author

Anton and I discussed that we want to postpone this change for now and start with adding more logging (#7083).

@akosyakov
Copy link
Member

@iQQBot Could you picked up as a mitigation for #6778?
while @geropl trying to solver the root cause in server: #7054

@geropl
Copy link
Member

geropl commented Dec 10, 2021

@akosyakov I had a look into supervisor, and it seems we're not using a reconnect handler. That could help making sure we see all events we're waiting for, even across reconnects.

@akosyakov
Copy link
Member

@akosyakov I had a look into supervisor, and it seems we're not using a reconnect handler. That could help making sure we see all events we're waiting for, even across reconnects.

Do you mean to do force getWorkspace on reconnect similarly how we do it in the local companion?

@geropl
Copy link
Member

geropl commented Dec 10, 2021

Still not 100% sure this caused the "port exposure" troubles, but qualifies as general error handler we talked about here.

@stale
Copy link

stale bot commented Dec 20, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Dec 20, 2021
@akosyakov
Copy link
Member

I close it for now. We would like to add some syncing but on reconnection to Gitpod Server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

detecting state
4 participants