Upgrading from ECK 2.4.0 to latest version may fail #5979

barkbay · 2022-08-29T15:31:02Z

Got this error for one of my Elasticsearch cluster while upgrading from 2.4.0 to main:

400 Bad Request: {Status:400 Error:{CausedBy:{Reason: Type:}
Reason:
Desired nodes with history [d7bf9e8e-47a0-40ad-8156-400ae519eb6c] and version [2] already exists with a different definition

I think (not 💯 sure yet) this is because of #5950: the Elasticsearch configuration has changed, but the metadata.generation of the Elasticsearch resource, used as the version field for the desired nodes API, is still the same.

I wonder if we should have an "upgrade" e2e pipeline, something like our upgrade-test-harness, that should be run automatically when submitting a PR in order to detect this kind of issue.

The text was updated successfully, but these errors were encountered:

pebrc · 2022-09-21T14:03:03Z

Similar problem #6027 there it is a PVC resize that leads to multiple updates with an unchanging spec.

pebrc · 2022-09-22T15:04:00Z

We discussed potential solutions and @barkbay suggested two approaches:

disabling desired_nodes for the next release
switching to a conditional PUT after GET approach:

GET the _lastest desired nodes topology from Elasticsearch and compare with the expected desired nodes
If the topologies are the same stop
If they differ take the version returned from the GET call and increment itPUT the new topology

pebrc · 2022-09-23T07:45:30Z

Comparing the _latest desired nodes returned via the Elasticsearch API with the expected values turns out to be trickier than I thought:

{"service.version": "2.5.0-SNAPSHOT+dff6b534", "iteration": "1", "namespace": "default", "es_name": "autoscaling-sample", "diff": ["slice[0].Settings.map[node].map[store].map[allow_mmap]: string != bool", "slice[0].Settings.map[xpack].map[security].map[http].map[ssl].map[enabled]: string != bool", "slice[0].Settings.map[xpack].map[security].map[authc].map[realms].map[native].map[native1].map[order]: string != int64", "slice[0].Settings.map[xpack].map[security].map[authc].map[realms].map[file].map[file1].map[order]: string != int64", "slice[0].Memory: 3gb != 3221225472b", "slice[0].Storage: 4gb != 4294967296b", "slice[1].Settings.map[xpack].map[security].map[http].map[ssl].map[enabled]: string != bool", "slice[1].Settings.map[xpack].map[security].map[authc].map[realms].map[native].map[native1].map[order]: string != int64", "slice[1].Settings.map[xpack].map[security].map[authc].map[realms].map[file].map[file1].map[order]: string != int64", "slice[1].Settings.map[node].map[store].map[allow_mmap]: string != bool"]}

Elasticsearch transforms the submitted data: it stringifies all the booleans and integers and it also transforms the resource units to the largest applicable i.e instead or bytes of memory it returns gigabytes.

My fear is that implementing a comparison after mirroring the same transformations might be brittle and am thinking we should just stick to the current approach of updating at each reconciliation with an incremented version number.

pebrc · 2022-09-23T09:05:41Z

I am thinking about ways to optimise this. But it is quite involved. One idea follow below.

First iteration:

PUT desired nodes topology and calculate the hash of the submitted request payload
Store it in an annotation together with the version for example the orchestration hints annotation

Subbsequent iterations:

GET the _latest desired nodes topology from Elasticsearch and stored version and hash from the annotation
Calculate the hash of the expected desired nodes and compare hash and version. If the hash/version are the same stop
If they differ take the version returned from the GET call and increment it PUT the new topology
update the orchestration hint annotation with the new version and hash

This would address the following concerns:

reduce the number of updates to the Elasticsearch API with identical topologies
handles the case where a third party changes or deletes the desired nodes by GET before PUT
in steady state no updates are posted to Elasticsearch (e.g. if reconciliation is triggered by cache refresh, operator restart or spec changes that have not relevance for the desired nodes API)

This comes with the downside of additional complexity and annotation updates to the ES resource

barkbay added >bug Something isn't working v2.5.0 labels Aug 29, 2022

barkbay mentioned this issue Sep 21, 2022

Autoscaling Elasticsearch: Introduce a dedicated custom resource #5978

Merged

4 tasks

pebrc self-assigned this Sep 22, 2022

pebrc mentioned this issue Sep 22, 2022

Desired nodes API errors on volume resizes #6027

Closed

pebrc mentioned this issue Sep 23, 2022

Increment desired nodes version on each call #6037

Merged

pebrc closed this as completed in #6037 Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading from ECK 2.4.0 to latest version may fail #5979

Upgrading from ECK 2.4.0 to latest version may fail #5979

barkbay commented Aug 29, 2022

pebrc commented Sep 21, 2022

pebrc commented Sep 22, 2022

pebrc commented Sep 23, 2022

pebrc commented Sep 23, 2022 •

edited

Loading

Upgrading from ECK 2.4.0 to latest version may fail #5979

Upgrading from ECK 2.4.0 to latest version may fail #5979

Comments

barkbay commented Aug 29, 2022

pebrc commented Sep 21, 2022

pebrc commented Sep 22, 2022

pebrc commented Sep 23, 2022

pebrc commented Sep 23, 2022 • edited Loading

pebrc commented Sep 23, 2022 •

edited

Loading