Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Che-operator fails to manage che failing after database outage #20337

Closed
Tracked by #20404
guydog28 opened this issue Aug 19, 2021 · 9 comments
Closed
Tracked by #20404

Che-operator fails to manage che failing after database outage #20337

guydog28 opened this issue Aug 19, 2021 · 9 comments
Assignees
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/doc Issues related to documentation kind/enhancement A feature request - must adhere to the feature request template. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@guydog28
Copy link

guydog28 commented Aug 19, 2021

Describe the bug

Our kubernetes cluster's primary goal is to serve che as the development environment for our team. Our kubernetes cluster is managed my kops. As a cost savings measure, our leadership has requested the cluster be shutdown overnight and on weekends (completely).

When the cluster comes back online in the morning, cluster state is restored from etcd backsups by etcd-manager. This means that the pods that were running when the cluster shut down are brought back up - so this time the operator isn't bringing them online and cant guarantee their order.

This results in a race condition where che comes up before postgres (and sometimes keycloak as well) and che is non-functional every morning. To get around this we have a cronjob that kills the che pod 15 minutes after the cluster comes online so that it will come up after postgres.

This process has led to very hard feelings by the team toward che (not great reliability). I'm sure this use case isn't a normal one, but it is more popular than you'd think.

I would request that the operator have more robust health checking for postgres, keycloak, and che. when one is failing (like che gave up connecting to postgres), restart them in the proper order accordingly to get them back to functional. The purpose of operators is to replace this sort of mundane but necessary task.

Che version

7.34@latest

Steps to reproduce

  1. create a cluster with kops
  2. deploy che via operator
  3. terminate all machines simultaneously.
  4. bring machines back online simulteously (likely done automatically via aws asg)
  5. when the cluster comes online, verify that che is non-functional.

Expected behavior

The operator would better monitor postgres, keycloak, and che to determine issues, and restart them accordly and in the proper order.

Runtime

Kubernetes (vanilla)

Screenshots

image

Installation method

chectl/latest

Environment

Linux

Eclipse Che Logs

No response

Additional context

No response

@guydog28 guydog28 added the kind/bug Outline of a bug - must adhere to the bug report template. label Aug 19, 2021
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Aug 19, 2021
@tolusha tolusha added sprint/next team/deploy severity/P1 Has a major impact to usage or development of the system. kind/enhancement A feature request - must adhere to the feature request template. and removed kind/bug Outline of a bug - must adhere to the bug report template. labels Aug 19, 2021
@tolusha
Copy link
Contributor

tolusha commented Aug 19, 2021

@guydog28
The request make sense. We will look on it.

@guydog28
Copy link
Author

guydog28 commented Aug 20, 2021

It might not hurt to also put a postgres init container on the che and keycloak deployments that blocks them from starting until postgres is up. This is also good for non-operator managed postgres since the operator cant control order there.

        image: postgres
        command: ['sh', '-c', 
          'until pg_isready -h {pg url} -p {pg port}; 
          do echo waiting for database; sleep 2; done;']

Additionally, since an external postgres could go down at any time, there should be a livenessProbe on the Che deployment, maybe one that looks for a 200 response to the url in the error screenshot (Options to /api/). If that fails the container would then restart and wait for the initContainer above.

@ericwill ericwill removed the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Aug 23, 2021
@tolusha tolusha added the area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator label Sep 2, 2021
@tolusha tolusha mentioned this issue Sep 3, 2021
29 tasks
@tolusha tolusha added this to the 7.37 milestone Sep 8, 2021
@tolusha tolusha mentioned this issue Sep 27, 2021
27 tasks
@tolusha tolusha modified the milestones: 7.37, 7.38 Sep 28, 2021
@mmorhun mmorhun self-assigned this Oct 13, 2021
@mmorhun
Copy link
Contributor

mmorhun commented Oct 13, 2021

Fixed starting with init containers. However, the feature is disabled by default. See the PR on how to enable init containers.
Special thanks for @guydog28

@mmorhun mmorhun closed this as completed Oct 13, 2021
@tolusha
Copy link
Contributor

tolusha commented Oct 18, 2021

Doc in progress...

@guydog28
Copy link
Author

@mmorhun what is the proper way to set this in the Che operator yaml?

@mmorhun
Copy link
Contributor

mmorhun commented Nov 30, 2021

@guydog28 just add ADD_COMPONENT_READINESS_INIT_CONTAINERS environment variable with value true into Che Operator deployment:

$ kubectl edit deployment che-operator -n eclipse-che

and add the following under env of the Opreaator container:

- name: ADD_COMPONENT_READINESS_INIT_CONTAINERS
  value: "true"

After that, Che Operator pod should restart and add init containers to Keycloak and Che Server deployments.

@guydog28
Copy link
Author

guydog28 commented Dec 1, 2021

@mmorhun will this environment variable survive a chectl update? I was thinking there would be something on the CheCluster CRD that would create this env var on the operator deployment.

@guydog28
Copy link
Author

guydog28 commented Dec 1, 2021

also, @mmorhun, this solves one of the issues (waiting for postgres/keycloak on start with an init container) but does not add liveness probes to che in the case that one of those goes down later, after a successful start. with a livenessProbe on the che deployment, it would see che is in an error state from postgres going down and terminate the pod - then when the deployment creates a new pod, the initContainer you created would take over block the new pod starting until postgres comes back online.

@mmorhun
Copy link
Contributor

mmorhun commented Dec 2, 2021

@guydog28 I am +1 for a field in the CR, but there are other opinions (cc @tolusha please explain your thoughts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator area/doc Issues related to documentation kind/enhancement A feature request - must adhere to the feature request template. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

6 participants