Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid reporting outdated ES health on reconciliation error that prevents getting the real one #5349

Merged
merged 7 commits into from
Feb 16, 2022

Conversation

thbkrkr
Copy link
Contributor

@thbkrkr thbkrkr commented Feb 9, 2022

2 first commits are more an optimization:

  • Split HTTP and transport certs reconciliation in order to be able to retrieve the cluster health despite an issue with the transport certs.
  • Start the observer and start it as soon as possible.

Those after could be enough:

  • Initialize health state to unknown in the NewState constructor

Relates to #5330.

Testing

  • Create an ES cluster
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: eight
spec:
  version: 8.0.0
  # http:
  # transport:
  #   tls:
  #     certificate:
  #       secretName: ca-that-does-not-exist
  nodeSets:
  - name: master
    count: 1
    config:
      node.store.allow_mmap: false
  • Make cluster health yellow
# create an index with 1 replica
-XPUT /my-index-1 -d '{"settings":{"number_of_shards":1,"number_of_replicas":1}}'
  • Break the transport service by configuring a CA Secret that doesn't exist in the cluster

Uncomment transport.tls.certificate.secretName and reapply.

  • Make cluster health green Again
# delete the index
-XDELETE /my-index-1
  • Cluster should be reported as green and not stuck in the yellow state, with phase ApplyingChanges.

  • Do the same by breaking the http service. This time the operator will report an unknown state as the observer depends on this service, still in phase ApplyingChanges.

@thbkrkr thbkrkr added the >enhancement Enhancement of existing functionality label Feb 9, 2022
@thbkrkr thbkrkr force-pushed the improve-status-health-update branch from ab96f7b to 3112e3c Compare February 9, 2022 21:29
@mtparet
Copy link

mtparet commented Feb 10, 2022

Quick question, does it means the operator will also set the status to unknown when we excluding an elasticsearch from being managed by the operator using https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-upgrading-eck.html#k8s-beta-to-ga-rolling-restart ?

Because today, if we exclude temporary a ressources, the status is reported is kept also it does not reflect the reality.

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Feb 10, 2022

does it means the operator will also set the status to unknown when we excluding an elasticsearch from being managed by the operator

No, it is independent.

Because today, if we exclude temporary a ressources, the status is reported is kept also it does not reflect the reality.

Could you elaborate on that (maybe in #5330 or a dedicated issue)? What part of the status isn't updated?

@mtparet
Copy link

mtparet commented Feb 10, 2022

Could you elaborate on that (maybe in #5330 or a dedicated issue)? What part of the status isn't updated?

I think it's a different issue.
If we set the annotation eck.k8s.elastic.co/managed=false to a resource, the operator will keep the status as previously (say "green" for example) although it does not update anymore the status whatever happens to the elasticsearch cluster.

@thbkrkr
Copy link
Contributor Author

thbkrkr commented Feb 15, 2022

I realized that we can reset the State with an empty ResourcesState, so before the ResourcesState is created and before we have reconciled the services on which it depends on. c370d1e

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the split in certificate reconciliation is worth the effort (but I am also not opposed to it).

Regarding the reset to unknown I think we could instead just initialise the reconciliation state with unknown health in the reconcile state constructor function.

Another possible improvement could be to update the status sub resource even if the cluster is unmanaged but that should probably be discussed in an issue first.

pkg/controller/elasticsearch/driver/driver.go Outdated Show resolved Hide resolved
@thbkrkr
Copy link
Contributor Author

thbkrkr commented Feb 15, 2022

I am not sure the split in certificate reconciliation is worth the effort (but I am also not opposed to it).

I discussed with barkbay and he is rather in favor of it but will share his opinion after verifying that it goes well with #5328.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think I'm fine with the proposal of having certificates reconciled in a "two phases" approach. Also it should be easy to merge it with #5328

👍 to create an issue to discuss whether or not the status subresource should be updated if the resource itself is not supposed to be managed.

pkg/controller/elasticsearch/reconcile/state.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/reconcile/state.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/certificates/reconcile.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/certificates/reconcile.go Outdated Show resolved Hide resolved
pkg/controller/common/certificates/reconcile.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/certificates/reconcile.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/certificates/reconcile.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/certificates/reconcile.go Outdated Show resolved Hide resolved
@thbkrkr thbkrkr added the v2.1.0 label Feb 16, 2022
@thbkrkr thbkrkr merged commit dd31f06 into elastic:main Feb 16, 2022
@thbkrkr thbkrkr changed the title Improve ES status health reporting Avoid reporting outdated ES health on reconciliation error that prevents getting the real one Feb 17, 2022
@thbkrkr thbkrkr added >bug Something isn't working and removed >enhancement Enhancement of existing functionality labels Feb 17, 2022
@thbkrkr thbkrkr deleted the improve-status-health-update branch March 22, 2022 16:36
fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
The Elasticsearch health reported by the Operator in the Elasticsearch resource status subresource may never be updated if the Operator encounters an error during the reconciliation loop. This commits improves that by:
* Initializing the ES health to 'unknown' in the NewState constructor, so that we stop to report a health that may be out of date
* Splitting HTTP and transport certs reconciliation in order to be able to retrieve the ES health despite an issue with the transport certs
* Starting the observer as soon as possible and then updating the ES state with the latest state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v2.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants