Avoid reporting outdated ES health on reconciliation error that prevents getting the real one #5349

thbkrkr · 2022-02-09T14:08:54Z

2 first commits are more an optimization:

Split HTTP and transport certs reconciliation in order to be able to retrieve the cluster health despite an issue with the transport certs.
Start the observer and start it as soon as possible.

Those after could be enough:

Initialize health state to unknown in the NewState constructor

Relates to #5330.

Testing

Create an ES cluster

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: eight
spec:
  version: 8.0.0
  # http:
  # transport:
  #   tls:
  #     certificate:
  #       secretName: ca-that-does-not-exist
  nodeSets:
  - name: master
    count: 1
    config:
      node.store.allow_mmap: false

Make cluster health yellow

# create an index with 1 replica
-XPUT /my-index-1 -d '{"settings":{"number_of_shards":1,"number_of_replicas":1}}'

Break the transport service by configuring a CA Secret that doesn't exist in the cluster

Uncomment transport.tls.certificate.secretName and reapply.

Make cluster health green Again

# delete the index
-XDELETE /my-index-1

Cluster should be reported as green and not stuck in the yellow state, with phase ApplyingChanges.
Do the same by breaking the http service. This time the operator will report an unknown state as the observer depends on this service, still in phase ApplyingChanges.

mtparet · 2022-02-10T12:46:42Z

Quick question, does it means the operator will also set the status to unknown when we excluding an elasticsearch from being managed by the operator using https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-upgrading-eck.html#k8s-beta-to-ga-rolling-restart ?

Because today, if we exclude temporary a ressources, the status is reported is kept also it does not reflect the reality.

thbkrkr · 2022-02-10T16:45:55Z

does it means the operator will also set the status to unknown when we excluding an elasticsearch from being managed by the operator

No, it is independent.

Because today, if we exclude temporary a ressources, the status is reported is kept also it does not reflect the reality.

Could you elaborate on that (maybe in #5330 or a dedicated issue)? What part of the status isn't updated?

mtparet · 2022-02-10T16:54:57Z

Could you elaborate on that (maybe in #5330 or a dedicated issue)? What part of the status isn't updated?

I think it's a different issue.
If we set the annotation eck.k8s.elastic.co/managed=false to a resource, the operator will keep the status as previously (say "green" for example) although it does not update anymore the status whatever happens to the elasticsearch cluster.

thbkrkr · 2022-02-15T11:23:35Z

I realized that we can reset the State with an empty ResourcesState, so before the ResourcesState is created and before we have reconciled the services on which it depends on. c370d1e

pebrc

I am not sure the split in certificate reconciliation is worth the effort (but I am also not opposed to it).

Regarding the reset to unknown I think we could instead just initialise the reconciliation state with unknown health in the reconcile state constructor function.

Another possible improvement could be to update the status sub resource even if the cluster is unmanaged but that should probably be discussed in an issue first.

pkg/controller/elasticsearch/driver/driver.go

thbkrkr · 2022-02-15T18:01:03Z

I am not sure the split in certificate reconciliation is worth the effort (but I am also not opposed to it).

I discussed with barkbay and he is rather in favor of it but will share his opinion after verifying that it goes well with #5328.

barkbay

LGTM, I think I'm fine with the proposal of having certificates reconciled in a "two phases" approach. Also it should be easy to merge it with #5328

👍 to create an issue to discuss whether or not the status subresource should be updated if the resource itself is not supposed to be managed.

pkg/controller/elasticsearch/reconcile/state.go

pkg/controller/elasticsearch/certificates/reconcile.go

pkg/controller/common/certificates/reconcile.go

pkg/controller/elasticsearch/certificates/reconcile.go

The Elasticsearch health reported by the Operator in the Elasticsearch resource status subresource may never be updated if the Operator encounters an error during the reconciliation loop. This commits improves that by: * Initializing the ES health to 'unknown' in the NewState constructor, so that we stop to report a health that may be out of date * Splitting HTTP and transport certs reconciliation in order to be able to retrieve the ES health despite an issue with the transport certs * Starting the observer as soon as possible and then updating the ES state with the latest state

Split http and transport certs reconciliation

6b9f1af

thbkrkr added the >enhancement Enhancement of existing functionality label Feb 9, 2022

Start the observer as soon as possible

3112e3c

thbkrkr force-pushed the improve-status-health-update branch from ab96f7b to 3112e3c Compare February 9, 2022 21:29

Reset ES health state as soon as possible

c370d1e

pebrc reviewed Feb 15, 2022

View reviewed changes

pkg/controller/elasticsearch/driver/driver.go Outdated Show resolved Hide resolved

thbkrkr added 2 commits February 15, 2022 16:16

Initialize health state to unknown in the NewState constructor

919bf5f

Adjust unit test

0027a17

barkbay approved these changes Feb 16, 2022

View reviewed changes

thbkrkr mentioned this pull request Feb 16, 2022

Update the status of unmanaged resource #5389

Open

thbkrkr added 2 commits February 16, 2022 18:19

Remove dead code

2680f80

Revert span deletion in ReconcileCAAndHTTPCerts

3ca36c4

thbkrkr added the v2.1.0 label Feb 16, 2022

thbkrkr merged commit dd31f06 into elastic:main Feb 16, 2022

thbkrkr changed the title ~~Improve ES status health reporting~~ Avoid reporting outdated ES health on reconciliation error that prevents getting the real one Feb 17, 2022

thbkrkr added >bug Something isn't working and removed >enhancement Enhancement of existing functionality labels Feb 17, 2022

thbkrkr deleted the improve-status-health-update branch March 22, 2022 16:36

thbkrkr mentioned this pull request Sep 26, 2022

ES status health reported by the operator not updated due to a reconciliation error #5330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid reporting outdated ES health on reconciliation error that prevents getting the real one #5349

Avoid reporting outdated ES health on reconciliation error that prevents getting the real one #5349

thbkrkr commented Feb 9, 2022 •

edited

Loading

mtparet commented Feb 10, 2022

thbkrkr commented Feb 10, 2022

mtparet commented Feb 10, 2022

thbkrkr commented Feb 15, 2022

pebrc left a comment

thbkrkr commented Feb 15, 2022

barkbay left a comment

Avoid reporting outdated ES health on reconciliation error that prevents getting the real one #5349

Avoid reporting outdated ES health on reconciliation error that prevents getting the real one #5349

Conversation

thbkrkr commented Feb 9, 2022 • edited Loading

Testing

mtparet commented Feb 10, 2022

thbkrkr commented Feb 10, 2022

mtparet commented Feb 10, 2022

thbkrkr commented Feb 15, 2022

pebrc left a comment

Choose a reason for hiding this comment

thbkrkr commented Feb 15, 2022

barkbay left a comment

Choose a reason for hiding this comment

thbkrkr commented Feb 9, 2022 •

edited

Loading