Che-operator fails to manage che failing after database outage #20337

guydog28 · 2021-08-19T12:41:58Z

Describe the bug

Our kubernetes cluster's primary goal is to serve che as the development environment for our team. Our kubernetes cluster is managed my kops. As a cost savings measure, our leadership has requested the cluster be shutdown overnight and on weekends (completely).

When the cluster comes back online in the morning, cluster state is restored from etcd backsups by etcd-manager. This means that the pods that were running when the cluster shut down are brought back up - so this time the operator isn't bringing them online and cant guarantee their order.

This results in a race condition where che comes up before postgres (and sometimes keycloak as well) and che is non-functional every morning. To get around this we have a cronjob that kills the che pod 15 minutes after the cluster comes online so that it will come up after postgres.

This process has led to very hard feelings by the team toward che (not great reliability). I'm sure this use case isn't a normal one, but it is more popular than you'd think.

I would request that the operator have more robust health checking for postgres, keycloak, and che. when one is failing (like che gave up connecting to postgres), restart them in the proper order accordingly to get them back to functional. The purpose of operators is to replace this sort of mundane but necessary task.

Che version

7.34@latest

Steps to reproduce

create a cluster with kops
deploy che via operator
terminate all machines simultaneously.
bring machines back online simulteously (likely done automatically via aws asg)
when the cluster comes online, verify that che is non-functional.

Expected behavior

The operator would better monitor postgres, keycloak, and che to determine issues, and restart them accordly and in the proper order.

Runtime

Kubernetes (vanilla)

Screenshots

Installation method

chectl/latest

Environment

Linux

Eclipse Che Logs

No response

Additional context

No response

tolusha · 2021-08-19T18:14:34Z

@guydog28
The request make sense. We will look on it.

guydog28 · 2021-08-20T12:51:55Z

It might not hurt to also put a postgres init container on the che and keycloak deployments that blocks them from starting until postgres is up. This is also good for non-operator managed postgres since the operator cant control order there.

        image: postgres
        command: ['sh', '-c', 
          'until pg_isready -h {pg url} -p {pg port}; 
          do echo waiting for database; sleep 2; done;']

Additionally, since an external postgres could go down at any time, there should be a livenessProbe on the Che deployment, maybe one that looks for a 200 response to the url in the error screenshot (Options to /api/). If that fails the container would then restart and wait for the initContainer above.

mmorhun · 2021-10-13T13:19:04Z

Fixed starting with init containers. However, the feature is disabled by default. See the PR on how to enable init containers.
Special thanks for @guydog28

tolusha · 2021-10-18T07:04:50Z

Doc in progress...

guydog28 · 2021-11-30T13:32:16Z

@mmorhun what is the proper way to set this in the Che operator yaml?

mmorhun · 2021-11-30T13:56:00Z

@guydog28 just add ADD_COMPONENT_READINESS_INIT_CONTAINERS environment variable with value true into Che Operator deployment:

$ kubectl edit deployment che-operator -n eclipse-che

and add the following under env of the Opreaator container:

- name: ADD_COMPONENT_READINESS_INIT_CONTAINERS
  value: "true"

After that, Che Operator pod should restart and add init containers to Keycloak and Che Server deployments.

guydog28 · 2021-12-01T18:27:40Z

@mmorhun will this environment variable survive a chectl update? I was thinking there would be something on the CheCluster CRD that would create this env var on the operator deployment.

guydog28 · 2021-12-01T19:01:06Z

also, @mmorhun, this solves one of the issues (waiting for postgres/keycloak on start with an init container) but does not add liveness probes to che in the case that one of those goes down later, after a successful start. with a livenessProbe on the che deployment, it would see che is in an error state from postgres going down and terminate the pod - then when the deployment creates a new pod, the initContainer you created would take over block the new pod starting until postgres comes back online.

mmorhun · 2021-12-02T09:00:35Z

@guydog28 I am +1 for a field in the CR, but there are other opinions (cc @tolusha please explain your thoughts).

guydog28 added the kind/bug Outline of a bug - must adhere to the bug report template. label Aug 19, 2021

che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Aug 19, 2021

tolusha added sprint/next team/deploy severity/P1 Has a major impact to usage or development of the system. kind/enhancement A feature request - must adhere to the feature request template. and removed kind/bug Outline of a bug - must adhere to the bug report template. labels Aug 19, 2021

ericwill removed the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Aug 23, 2021

tolusha added the area/che-operator Issues and PRs related to Eclipse Che Kubernetes Operator label Sep 2, 2021

tolusha mentioned this issue Sep 3, 2021

Che Deploy Sprint #207 #20404

Closed

29 tasks

tolusha added sprint/current and removed sprint/next labels Sep 8, 2021

tolusha added this to the 7.37 milestone Sep 8, 2021

tolusha mentioned this issue Sep 27, 2021

Che Deploy Sprint #208 #20542

Closed

27 tasks

tolusha modified the milestones: 7.37, 7.38 Sep 28, 2021

mmorhun mentioned this issue Oct 13, 2021

feat: Add init containers to start Che correctly after node restart eclipse-che/che-operator#1139

Merged

11 tasks

mmorhun self-assigned this Oct 13, 2021

mmorhun closed this as completed Oct 13, 2021

tolusha reopened this Oct 18, 2021

tolusha mentioned this issue Oct 18, 2021

Che Deploy Sprint #209 #20648

Closed

25 tasks

mmorhun mentioned this issue Oct 18, 2021

procedures: Add readiness init container docs eclipse-che/che-docs#2139

Merged

6 tasks

mmorhun closed this as completed Oct 26, 2021

themr0c added area/doc Issues related to documentation team/doc labels Oct 28, 2021

tolusha removed the sprint/current label Nov 9, 2021

tolusha removed the team/deploy label Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Che-operator fails to manage che failing after database outage #20337

Che-operator fails to manage che failing after database outage #20337

guydog28 commented Aug 19, 2021 •

edited

Loading

tolusha commented Aug 19, 2021

guydog28 commented Aug 20, 2021 •

edited

Loading

mmorhun commented Oct 13, 2021

tolusha commented Oct 18, 2021

guydog28 commented Nov 30, 2021

mmorhun commented Nov 30, 2021

guydog28 commented Dec 1, 2021

guydog28 commented Dec 1, 2021

mmorhun commented Dec 2, 2021 •

edited

Loading

Che-operator fails to manage che failing after database outage #20337

Che-operator fails to manage che failing after database outage #20337

Comments

guydog28 commented Aug 19, 2021 • edited Loading

Describe the bug

Che version

Steps to reproduce

Expected behavior

Runtime

Screenshots

Installation method

Environment

Eclipse Che Logs

Additional context

tolusha commented Aug 19, 2021

guydog28 commented Aug 20, 2021 • edited Loading

mmorhun commented Oct 13, 2021

tolusha commented Oct 18, 2021

guydog28 commented Nov 30, 2021

mmorhun commented Nov 30, 2021

guydog28 commented Dec 1, 2021

guydog28 commented Dec 1, 2021

mmorhun commented Dec 2, 2021 • edited Loading

guydog28 commented Aug 19, 2021 •

edited

Loading

guydog28 commented Aug 20, 2021 •

edited

Loading

mmorhun commented Dec 2, 2021 •

edited

Loading