Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading from ECK 2.4.0 to latest version may fail #5979

Closed
barkbay opened this issue Aug 29, 2022 · 4 comments · Fixed by #6037
Closed

Upgrading from ECK 2.4.0 to latest version may fail #5979

barkbay opened this issue Aug 29, 2022 · 4 comments · Fixed by #6037
Assignees
Labels
>bug Something isn't working v2.5.0

Comments

@barkbay
Copy link
Contributor

barkbay commented Aug 29, 2022

Got this error for one of my Elasticsearch cluster while upgrading from 2.4.0 to main:

400 Bad Request: {Status:400 Error:{CausedBy:{Reason: Type:}
Reason:
Desired nodes with history [d7bf9e8e-47a0-40ad-8156-400ae519eb6c] and version [2] already exists with a different definition

I think (not 💯 sure yet) this is because of #5950: the Elasticsearch configuration has changed, but the metadata.generation of the Elasticsearch resource, used as the version field for the desired nodes API, is still the same.

I wonder if we should have an "upgrade" e2e pipeline, something like our upgrade-test-harness, that should be run automatically when submitting a PR in order to detect this kind of issue.

@pebrc
Copy link
Collaborator

pebrc commented Sep 21, 2022

Similar problem #6027 there it is a PVC resize that leads to multiple updates with an unchanging spec.

@pebrc
Copy link
Collaborator

pebrc commented Sep 22, 2022

We discussed potential solutions and @barkbay suggested two approaches:

  • disabling desired_nodes for the next release
  • switching to a conditional PUT after GET approach:
  1. GET the _lastest desired nodes topology from Elasticsearch and compare with the expected desired nodes
  2. If the topologies are the same stop
  3. If they differ take the version returned from the GET call and increment itPUT the new topology

@pebrc
Copy link
Collaborator

pebrc commented Sep 23, 2022

Comparing the _latest desired nodes returned via the Elasticsearch API with the expected values turns out to be trickier than I thought:

{"service.version": "2.5.0-SNAPSHOT+dff6b534", "iteration": "1", "namespace": "default", "es_name": "autoscaling-sample", "diff": ["slice[0].Settings.map[node].map[store].map[allow_mmap]: string != bool", "slice[0].Settings.map[xpack].map[security].map[http].map[ssl].map[enabled]: string != bool", "slice[0].Settings.map[xpack].map[security].map[authc].map[realms].map[native].map[native1].map[order]: string != int64", "slice[0].Settings.map[xpack].map[security].map[authc].map[realms].map[file].map[file1].map[order]: string != int64", "slice[0].Memory: 3gb != 3221225472b", "slice[0].Storage: 4gb != 4294967296b", "slice[1].Settings.map[xpack].map[security].map[http].map[ssl].map[enabled]: string != bool", "slice[1].Settings.map[xpack].map[security].map[authc].map[realms].map[native].map[native1].map[order]: string != int64", "slice[1].Settings.map[xpack].map[security].map[authc].map[realms].map[file].map[file1].map[order]: string != int64", "slice[1].Settings.map[node].map[store].map[allow_mmap]: string != bool"]}

Elasticsearch transforms the submitted data: it stringifies all the booleans and integers and it also transforms the resource units to the largest applicable i.e instead or bytes of memory it returns gigabytes.

My fear is that implementing a comparison after mirroring the same transformations might be brittle and am thinking we should just stick to the current approach of updating at each reconciliation with an incremented version number.

@pebrc
Copy link
Collaborator

pebrc commented Sep 23, 2022

I am thinking about ways to optimise this. But it is quite involved. One idea follow below.

First iteration:

  1. PUT desired nodes topology and calculate the hash of the submitted request payload
  2. Store it in an annotation together with the version for example the orchestration hints annotation

Subbsequent iterations:

  1. GET the _latest desired nodes topology from Elasticsearch and stored version and hash from the annotation
  2. Calculate the hash of the expected desired nodes and compare hash and version. If the hash/version are the same stop
  3. If they differ take the version returned from the GET call and increment it PUT the new topology
  4. update the orchestration hint annotation with the new version and hash

This would address the following concerns:

  • reduce the number of updates to the Elasticsearch API with identical topologies
  • handles the case where a third party changes or deletes the desired nodes by GET before PUT
  • in steady state no updates are posted to Elasticsearch (e.g. if reconciliation is triggered by cache refresh, operator restart or spec changes that have not relevance for the desired nodes API)

This comes with the downside of additional complexity and annotation updates to the ES resource

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v2.5.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants