Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Che operator fails to reconcile che after failed installation of devworkspace #19243

Closed
metlos opened this issue Mar 9, 2021 · 8 comments
Closed
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.

Comments

@metlos
Copy link
Contributor

metlos commented Mar 9, 2021

Describe the bug

I've had some issues during the installation of the che-operator with devworkspaces enabled and wanted to reinstall. I deleted eclipse-che and devworkspace-controller namespaces (by running kubectl delete namespace ... and deleting the finalizers on resources where necessary).

I then wanted to install che using chectl. It never succeeded with these errors in the che operator log:

time="2021-03-09T20:41:31Z" level=info msg="Running exec for 'create Keycloak DB, user, privileges' in the pod 'postgres-7d794f7b58-h2xkm'"
time="2021-03-09T20:41:31Z" level=error msg="Error running exec: Internal error occurred: failed calling webhook \"validate-exec.devworkspace-controller.svc\": Post \"https://devworkspace-webhookserver.devworkspace-controller.svc:443/validate?timeout=30s\": service \"devworkspace-webhookserver\" not found, command: [/bin/bash -c OUT=$(psql postgres -tAc \"SELECT 1 FROM pg_roles WHERE rolname='keycloak'\"); if [ $OUT -eq 1 ]; then echo \"DB exists\"; exit 0; fi && psql -c \"CREATE USER keycloak WITH PASSWORD 'is8NVHQ7r0GL'\" && psql -c \"CREATE DATABASE keycloak\" && psql -c \"GRANT ALL PRIVILEGES ON DATABASE keycloak TO keycloak\" && psql -c \"ALTER USER ${POSTGRESQL_USER} WITH SUPERUSER\"]"
time="2021-03-09T20:41:31Z" level=error msg="Stderr: "

Notice the reference to the devworkspace-webhookserver during the installation of the postgres DB for che server.

Because setting spec.devworkspace.enable: false actually does not uninstall devworkspace from the cluster in any way, the user has no way of making the installation work again.

Note that I was able to make the installation work again by running make uninstall from devworkspace-operator sources, but that is something that we might not want the users to do,

Che version

  • latest

Steps to reproduce

  1. Install Che using chectl server:deploy -p openshift -n eclipse-che -a operator
  2. kubectl edit checluster eclipse-che -n eclipse-che and set spec.devworkspace.enable: true
  3. Let the installation finish
  4. kubectl delete namespace eclipse-che devworkspace-controller
  5. Remove finalizers on resources blocking the deletion
  6. Wait for the namespaces to be deleted
  7. Try to install Che using chectl server:deploy -p openshift -n eclipse-che -a operator again

The installation never finishes with the error in the che-operator log as described above.

Expected behavior

We should have a documented way of cleaning up the cluster to be able to do repeated installations.

Runtime

  • Openshift 4.6

Screenshots

N/A

Installation method

  • see repro steps

Environment

  • RHPDS with OpenShift 4.7
@metlos metlos added kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system. area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator labels Mar 9, 2021
@amisevsk
Copy link
Contributor

amisevsk commented Mar 9, 2021

The uninstall process for the DevWorkspaceOperator is complicated and manual (e.g. the WTO docs or makefile uninstall rule).

Improper installation will break pods/exec across the cluster (kind of by design, but mostly to fill a gap in OLM, which doesn't allow us to force removal of all CRDs when an operator is uninstalled).

@sleshchenko I seem to recall we couldn't set an ObjectSelector on the validating webhook (hence causing this issue), but I can't find anything in the docs forbidding it. Could we just add a labelselector to our validating webhook to at least kind-of avoid this problem?

@tolusha
Copy link
Contributor

tolusha commented Mar 10, 2021

Switching spec.devworkspace.enable: false should not uninstall dev workspace operator since it might be used by another Eclipse Che deployment.
To remove dev workspace operator you can try chectl server:delete command.
Besides removing Eclipse Che it removes devworkspace operator only if there are no another instance of Eclipse Che on the cluster.

@sleshchenko
Copy link
Member

(by running kubectl delete namespace ... and deleting the finalizers on resources where necessary).

you went into very non-optimal way, in addition to deleting the finalizers, you just need to clean up all the Cluster-scoped resources, where webhooks are most critical.

Could we just add a labelselector to our validating webhook to at least kind-of avoid this problem?

pods labels are not propagated to pod/execs subresources, here is an issue on K8s side which should unblock us kubernetes/kubernetes#91732
Maybe it's changed but issue is still opened, but I doubt

@sleshchenko
Copy link
Member

Switching spec.devworkspace.enable: false should not uninstall dev workspace operator since it might be used by another Eclipse Che deployment.

yes and no. If it does not uninstall it, it must provide instruction how to remove it properly.
I have doubt that in production Cluster is expected to have more than one Che/CRW installed.

If we care about development environments, I think Che Operator can store somewhere the info, which CheClusters uses DevWorkspace Operator, and the last can uninstall DWO and DevWorkspace Che Operator. It can be configmap in the DWO namespace.
Or CheCluster can get some testing field, which would allow to disable DWO uninstalling.

@amisevsk
Copy link
Contributor

pods labels are not propagated to pod/execs subresources, here is an issue on K8s side which should unblock us

That was the roadblock I was remembering :)

@che-bot
Copy link
Contributor

che-bot commented Sep 7, 2021

Issues go stale after 180 days of inactivity. lifecycle/stale issues rot after an additional 7 days of inactivity and eventually close.

Mark the issue as fresh with /remove-lifecycle stale in a new comment.

If this issue is safe to close now please do so.

Moderators: Add lifecycle/frozen label to avoid stale mode.

@che-bot che-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2021
@mmorhun mmorhun removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 17, 2021
@mmorhun
Copy link
Contributor

mmorhun commented Sep 17, 2021

Still actual. I see the same error which prevents me now from debugging Che operator

@tolusha
Copy link
Contributor

tolusha commented May 13, 2022

We can use now chectl server:delete to remove Eclipse Che + clean up DevWorkspace resources.

@tolusha tolusha closed this as completed May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator kind/bug Outline of a bug - must adhere to the bug report template. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

6 participants