Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error is not meanful when workspace startup fails because route quota is exceeded #22068

Closed
l0rd opened this issue Mar 17, 2023 · 12 comments · Fixed by eclipse-che/che-operator#1805
Assignees
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/devworkspace-operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system. sprint/next team/A This team is responsible for the Che Operator and all its operands as well as chectl and Hosted Che

Comments

@l0rd
Copy link
Contributor

l0rd commented Mar 17, 2023

Describe the bug

The default workspace has 3 routes. If routes count quota is configured in the namespace of the workspace pod and if the creation of 3 new routes will exceed the quota, then workspaces creation will fail.

The problem workspace creation will fails for timeout (so after waiting for 5 minutes) and the developer have no clue how to address the problem and may spend hours trying to start a workspace without success (deleting an existing workspace would solve the problem):

image

This is even more problematic as there is no Kubernetes event during the startup:

image

Che version

7.61@latest

Steps to reproduce

Start 4 workspaces on developer sandbox. The 4th will fail (quota is going to be updated so you may need to start more workspaces)

image

Expected behavior

We should catch the error and present the following message "The workspace cannot be started because it requires the creation a number of routes () that exceed the quota () for the current namespace (). Deleting an existing workspace may solve the problem."

Runtime

OpenShift

Screenshots

No response

Installation method

OperatorHub, other (please specify in additional context)

Environment

Dev Sandbox (workspaces.openshift.com)

Eclipse Che Logs

$ kubectl get dw
NAME                      DEVWORKSPACE ID             PHASE      INFO
python-hello-world        workspace85995a207cb641a2   Stopped    Stopped
python-hello-world-4iqo   workspacee0385883730b4ef8   Stopped    Stopped
python-hello-world-kqii   workspacec9211b3e57f1461f   Stopped    Stopped
python-hello-world-qvbu   workspacedbf3bb93a7294cf4   Starting   Preparing routes

$ kubectl get dwr
NAME                                DEVWORKSPACE ID             PHASE       INFO
routing-workspace85995a207cb641a2   workspace85995a207cb641a2   Ready       DevWorkspaceRouting prepared
routing-workspacec9211b3e57f1461f   workspacec9211b3e57f1461f   Ready       DevWorkspaceRouting prepared
routing-workspacedbf3bb93a7294cf4   workspacedbf3bb93a7294cf4   Preparing   Preparing routes
routing-workspacee0385883730b4ef8   workspacee0385883730b4ef8   Ready       DevWorkspaceRouting prepared

Additional context

There is no Kubernetes event during the startup (zero).

@l0rd l0rd added the kind/bug Outline of a bug - must adhere to the bug report template. label Mar 17, 2023
@ibuziuk
Copy link
Member

ibuziuk commented Mar 17, 2023

@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Mar 17, 2023
@l0rd l0rd added severity/P1 Has a major impact to usage or development of the system. area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/devworkspace-operator and removed status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Mar 17, 2023
@l0rd
Copy link
Contributor Author

l0rd commented Mar 17, 2023

@ibuziuk this is good, users will be able to have 4 workspace.

But workspaces creation will fail (with no clue why) if the user have 9 stopped workspaces. And it's less then 9 if workspaces have endpoints (and usually they do). I think we have the opportunity to limit the workspace count now (because of this issue users cannot have created a lot of workspaces lately) and I have created #22069.

@amisevsk
Copy link
Contributor

There is also the DevWorkspaceOperatorConfig .config.workspace.cleanupOnStop setting, which configures DWO to delete owned resources of the DevWorkspace when it is stopped. This would delete the DevWorkspace's DevWorkspaceRouting CR, which should clean up any routes created for the workspace.

@amisevsk
Copy link
Contributor

To resolve this bug, the Che Operator should detect this condition (cannot create routes due to quota) and set the DWR to a failed state. This would trigger workspace failure.

@l0rd
Copy link
Contributor Author

l0rd commented Mar 18, 2023

There is also the DevWorkspaceOperatorConfig .config.workspace.cleanupOnStop setting, which configures DWO to delete owned resources of the DevWorkspace when it is stopped. This would delete the DevWorkspace's DevWorkspaceRouting CR, which should clean up any routes created for the workspace.

@amisevsk this is good to know but why isn't it the default then? does it have an impact on the restart time of the workspace?

@l0rd l0rd mentioned this issue Mar 22, 2023
50 tasks
@amisevsk
Copy link
Contributor

When the setting was added, the default was set to the earlier behaviour as we didn't have bandwidth to test all edge cases -- if all resources are removed on stop, they need to be re-created when the workspace is restarted and this could have implications on scalability. For instance, this would require the Che gateway to un-expose and re-expose all workspace routes on every stop/start.

It also would have some impact on restart time, though the size of the effect is not clear. Enabling the setting means workspace starts require as much work as new workspaces on the cluster (everything needs to be re-created).

We can test enabling it in our development environments to see if it impacts availability, and make it a default in a future release if there are no issues.

@ibuziuk
Copy link
Member

ibuziuk commented Apr 5, 2023

enabled on the dogfooding instance

config:
  workspace:
    cleanupOnStop: true

@ibuziuk ibuziuk added the team/B This team is responsible for the Web Terminal, the DevWorkspace Operator and the IDEs. label Apr 19, 2023
@che-bot
Copy link
Contributor

che-bot commented Oct 16, 2023

Issues go stale after 180 days of inactivity. lifecycle/stale issues rot after an additional 7 days of inactivity and eventually close.

Mark the issue as fresh with /remove-lifecycle stale in a new comment.

If this issue is safe to close now please do so.

Moderators: Add lifecycle/frozen label to avoid stale mode.

@che-bot che-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2023
@l0rd l0rd removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 16, 2023
@l0rd
Copy link
Contributor Author

l0rd commented Oct 16, 2023

/remove-lifecycle stale

@ibuziuk ibuziuk added team/A This team is responsible for the Che Operator and all its operands as well as chectl and Hosted Che and removed team/B This team is responsible for the Web Terminal, the DevWorkspace Operator and the IDEs. labels Oct 17, 2023
@amisevsk
Copy link
Contributor

I've opened a PR in the DWO repository that should fix this issue: devfile/devworkspace-operator#1199

However, since the Che Operator runs its own DevWorkspaceRouting reconciler, fixing this problem in Che will require the Che Operator to pull in a newer DevWorkspaceOperator dependency, so I'm leaving this issue open until that is completed.

@ibuziuk
Copy link
Member

ibuziuk commented Jan 31, 2024

@amisevsk do you think we can close the issue?

@amisevsk
Copy link
Contributor

@ibuziuk thanks for the reminder -- I've opened eclipse-che/che-operator#1805 to update the Che Operator to DevWorkspace v0.25.0, which will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/devworkspace-operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system. sprint/next team/A This team is responsible for the Che Operator and all its operands as well as chectl and Hosted Che
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants