Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to restart uncompleted downsampling tasks in ES 8.13 and above #106880

Closed
salvatore-campagna opened this issue Mar 28, 2024 · 2 comments
Closed
Labels
>bug :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data Team:StorageEngine

Comments

@salvatore-campagna
Copy link
Contributor

salvatore-campagna commented Mar 28, 2024

Elasticsearch Version

8.13 and above

Installed Plugins

No response

Java Version

bundled

OS Version

all

Problem Description

PR #97557 introduced DownsampleShardTaskParams a data structure used by our persistent task framework to store task specific data including, in this case, downsampling tasks specific data for tasks started when a downsampling operation is carried out.

PR #98023 introduced an array of strings dimensions which is used to store the set of dimensions defined for the original index the downsampling task is operating onto. This is required because with TSID Hashing we lose the ability to decode dimensions just by decoding the _tsid field and we need to store them unencoded somewhere else to support resuming interrupted persistent tasks.

Addition of the new dimensions string array changes the format of our wire protocol which we use when serialising and deserialising instances of objects like DownsampleShardTaskParams. This kind of changes require code to handle backward compatibility with nodes running older versions of Elasticsearch which "speak" a different version of the wire protocol. The check is missing (this is the bug!) as result, newer versions of Elasticsearch try to read a boolean unconditionally and later on, if the boolean is true, an array of strings (dimensions), ignoring the fact that the boolean and string array might or might not be there. Older versions of Elasticsearch do not serialize such boolean and/or string array since that did not exist when the older version was released. This is why newer versions of Elasticsearch need the check on the wire protocol version and need to implement backward compatible behaviour.

Moreover instances of DownsampleShardTaskParams are serialised as part of the cluster state which is written/read by nodes in the cluster and which needs to be readable by new nodes running a newer version of Elasticsearch after an upgrade. This is why the upgrade process is affected.

The issue happens because a node running Elasticsearch older than 8.13 (8.10.x-8.12.x) writes such cluster state with
DownsampleShardTaskParams not including the dimensions string array. Then, after nodes start moving to a new version as a result of an upgrade to 8.13, deserialising the cluster state fails in the node running version 8.13 because the dimensions array is missing.

(NOTE: hopefully failure in deserielizing the cluster state means the node running version 8.13 will never be able to join the cluster).

Steps to Reproduce

Ideally could happen just by having at least one downsampling task starting, then upgrading to version 8.13 while the downsampling task is running. Note also that the executor is not going to restart them as a result of the failure being unrecoverable.

Logs (if relevant)

No response

@salvatore-campagna salvatore-campagna added >bug :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data labels Mar 28, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

dnhatn pushed a commit that referenced this issue Mar 29, 2024
Missing a check on the transport version results in unreadable cluster state
if it includes a serialized instance of DownsampleShardTaskParams.
#98023 introduced an optional string array including dimensions used by time
serie indices.
Reading an optional array requires reading a boolean first which is required to
know if an array of values exists in serialized form. From 8.13 on we try to
read such a boolean which is not there because older versions don't write any
boolean nor any string array. Here we include the check on versions for backward
compatibility skipping reading any boolean or array whatsoever whenever not possible.

Customers using downsampling might have cluster states including such serielized
objects and would be unable to upgrade to version 8.13. They will be able to
upgrade to any version including this fix.

This fix has a side effect #106880
dnhatn pushed a commit to dnhatn/elasticsearch that referenced this issue Mar 29, 2024
Missing a check on the transport version results in unreadable cluster state
if it includes a serialized instance of DownsampleShardTaskParams.
serie indices.
Reading an optional array requires reading a boolean first which is required to
know if an array of values exists in serialized form. From 8.13 on we try to
read such a boolean which is not there because older versions don't write any
boolean nor any string array. Here we include the check on versions for backward
compatibility skipping reading any boolean or array whatsoever whenever not possible.

Customers using downsampling might have cluster states including such serielized
objects and would be unable to upgrade to version 8.13. They will be able to
upgrade to any version including this fix.

This fix has a side effect elastic#106880
elasticsearchmachine pushed a commit that referenced this issue Mar 29, 2024
…06896)

Missing a check on the transport version results in unreadable cluster state
if it includes a serialized instance of DownsampleShardTaskParams.
serie indices.
Reading an optional array requires reading a boolean first which is required to
know if an array of values exists in serialized form. From 8.13 on we try to
read such a boolean which is not there because older versions don't write any
boolean nor any string array. Here we include the check on versions for backward
compatibility skipping reading any boolean or array whatsoever whenever not possible.

Customers using downsampling might have cluster states including such serielized
objects and would be unable to upgrade to version 8.13. They will be able to
upgrade to any version including this fix.

This fix has a side effect #106880

Co-authored-by: Salvatore Campagna <93581129+salvatore-campagna@users.noreply.github.com>
@martijnvg
Copy link
Member

This was fixed by #106878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data Team:StorageEngine
Projects
None yet
Development

No branches or pull requests

3 participants