Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

che-code - Open IDE redirects too fast: Bad Gateway #21099

Closed
nils-mosbach opened this issue Jan 31, 2022 · 17 comments · Fixed by che-incubator/che-code#26 or eclipse-che/che-operator#1309
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/gateway kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.

Comments

@nils-mosbach
Copy link

Describe the bug

If starting a new Workspace opening the IDE e.g. che-code is triggered to fast and responds with a Bad Gateway. If clicked refresh in the browser after 2 seconds everything loads as expected. Maybe che-code or the traeffik configration is not yet ready?

So basically the redirect from here

image

to the IDE happens to fast.

image

Startup times of workspaces are incredibly fast after we switched to devworkspaces. Great improvement! :)

Che version

next (development version)

Steps to reproduce

Start che-code environment.

Expected behavior

Che code should be opened as soon as its ready.

Runtime

Kubernetes (vanilla)

Screenshots

No response

Installation method

chectl/latest

Environment

Linux

Eclipse Che Logs

No response

Additional context

No response

@nils-mosbach nils-mosbach added the kind/bug Outline of a bug - must adhere to the bug report template. label Jan 31, 2022
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Jan 31, 2022
@benoitf
Copy link
Contributor

benoitf commented Jan 31, 2022

might be an issue related to #20481

I'm also facing the issue

@benoitf benoitf changed the title che-code - Open IDE redirects to fast: Bad Gateway che-code - Open IDE redirects too fast: Bad Gateway Jan 31, 2022
@azatsarynnyy azatsarynnyy added area/editor/vscode Issues related to the Code OSS editor of Che severity/P2 Has a minor but important impact to the usage or development of the system. and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Feb 1, 2022
@benoitf benoitf reopened this Feb 1, 2022
@benoitf benoitf added area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator severity/P1 Has a major impact to usage or development of the system. and removed area/editor/vscode Issues related to the Code OSS editor of Che severity/P2 Has a minor but important impact to the usage or development of the system. labels Feb 1, 2022
@benoitf
Copy link
Contributor

benoitf commented Feb 1, 2022

issue is not inside the editor because after few time the route is working fine

it's related to route availability

@benoitf
Copy link
Contributor

benoitf commented Feb 1, 2022

@amisevsk @sparkoo how is the route flagged as ready ? it seems that route is flagged as ready before traefik is able to reach the endpoint ?

I get rid of the connectionToken parameter so CheCode endpoint should return either 200 or nothing if not yet started

For Red Hatters it can be reproduce with https://che-dogfooding.apps.che-dev.x6e0.p1.openshiftapps.com/#https://github.com/che-incubator/che-code

@amisevsk
Copy link
Contributor

amisevsk commented Feb 1, 2022

On the DWO side, the DevWorkspace is not marked as "started" until <mainUrl>/healthz returns a 2xx (or 4xx) status code (4xx is assumed to mean "unimplemented").

Testing locally, I'm seeing that curl <che-gateway>/<workspaceID>/dev/3100/healthz returns 404 not found -- does Che Code serve this endpoint?

@l0rd
Copy link
Contributor

l0rd commented Feb 1, 2022

I also get that with Che Theia from time to time (it just happened).

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

of course that che-code is serving this endpoint (hitting refresh few seconds later after the 'bad gateway' message just works fine)

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

it looks like healthz may say that che-code is ok to receive requests but traefik is not yet updated to handle that route then it fails
because it looks like error is coming from traefik (Bad Gateway plain text message/ Error 502)

@amisevsk
Copy link
Contributor

amisevsk commented Feb 2, 2022

che-code is serving this endpoint (hitting refresh few seconds later after the 'bad gateway' message just works fine)

I don't know if it's relevant in this case, but note that issues with /healthz are easily obscured since we have to accept 4xx status codes as successes. Testing it with che-code, I've seen 404 and 401 response codes even in up-and-running workspaces, so it does seem like something somewhere is going wrong.

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

401 response is gone in the latest builds (there was a mandatory token in previous builds) so 401 could happen before but not anymore (as token is not required)
And in Theia, Mario reproduced the issue and for sure there is no token/401 error

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

BTW it looks still odd as 404 can be accepted as a success b/c if theia or code is not launched, it will return a 404 during their startup

it looks like some race condition: someone notify it's ready while the traefik route is not

@amisevsk
Copy link
Contributor

amisevsk commented Feb 2, 2022

PR eclipse-che/che-operator#1309 should resolve this issue (tested briefly, at least). Changes are available in image quay.io/amisevsk/che-operator:issue-21099 (replaces quay.io/eclipse/che-operator:next)

However, on the topic of /healthz endpoints, from what I can tell both Theia and che-code return 404 status codes:

Theia:

bash-5.0$ curl http://localhost:3100/healthz -I
HTTP/1.1 404 Not Found

Che-code:

bash-4.4$ curl http://localhost:3100/healthz -i 
HTTP/1.1 404 Not Found
Content-Type: text/plain
Date: Wed, 02 Feb 2022 20:33:22 GMT
Connection: keep-alive
Keep-Alive: timeout=5
Transfer-Encoding: chunked

Not found.

c.f. Web terminal, for example:

bash-4.4 ~ $ curl -k https://localhost:4444/healthz -i
HTTP/2 200 
content-length: 0
date: Wed, 02 Feb 2022 20:33:02 GMT

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

But healthz is not managed by theia or code. There is no such endpoint

@amisevsk
Copy link
Contributor

amisevsk commented Feb 2, 2022

But healthz is not managed by theia or code. There is no such endpoint

Ah, sorry -- I was confused. It seemed like you were saying that Theia/Che-code were serving /healthz here. If this is the case, and we stop treating 404 as success (as in #21099 (comment)) then DevWorkspaces with Theia and Che-code will never enter a running state.

A more proper solution for readiness/liveness checks would be devfile/api#168

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

But for exemple why is it not checking the path of the main endpoint ?

Like / instead of /healthz ?

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

HEAD request on the main endpoint / for both theia and vscode

@amisevsk
Copy link
Contributor

amisevsk commented Feb 2, 2022

The history of the decision is found in eclipse-che/che-machine-exec#105 and devfile/devworkspace-operator#78. In a general sense, checking / on the main endpoint is not exactly reliable either, and /healthz is/was more-or-less a Kubernetes standard.

Ironically, doing a HEAD request on / would result in precisely the bug reported here, since it would redirect to oauth as in #20481:

curl -IL https://che-<snip>/workspaced642721e6fb04d37/dev/3100/
HTTP/1.1 302 Found
<snip>

HTTP/1.1 200 OK
<snip>

In fact, the reason this bug happened is because excluding /healthz from auth was failing for some workspaces (see the description to eclipse-che/che-operator#1309 for an explanation).

@benoitf
Copy link
Contributor

benoitf commented Feb 2, 2022

well the gateway component is intercepting the calls so it could intercept handle the /healthz and check if the underlying service is returning 200 or not

Why put a gateway in every container if it doesn't bring value

It looks like strange to patch theia and vscode server to add a /healthz entrypoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/gateway kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
6 participants