Searchable snapshot version compatibility during upgrade #77007

Leaf-Lin · 2021-08-30T05:38:59Z

Elasticsearch version (bin/elasticsearch --version): During an upgrade from 7.14.0 --> 7.15.0

Plugins installed: []

JVM version (java -version): on ESS

OS version (uname -a if on a Unix-like system): on ESS

Description of the problem including expected versus actual behavior:
ILM searchable snapshot actions and _mount action during upgrade should continue to work, yet if the master node is in a higher version than the other data nodes, users may encounter the following version compatibility issue with error message: node version [x0.y0.z0] is older than the snapshot version [x1.y1.z1]
Steps to reproduce:

There are two different ways to reproduce the red unassigned searchable index:

Prerequiste:

Step 0. To reproduce this issue, I created a 2 hot_content nodes + 1 data_frozen node on Elastic cloud.
Step 1. For the following reproduction, I added the sample kibana_sample_data_flights index.
Step 2. Take a snapshot prior to the upgrade.
Step 3. Create a mixed version cluster. This may happen if for some reason the upgrade was unsuccessful and having mixed versions nodes in the cluster where master is in a higher version than the data node. You can manually cancel the upgrade and manually stop/start nodes to simulate this scenario and force master to be on the newer version.

At this point, I have one master in 7.15.0 and all the rest of the nodes in 7.14.0.

Method 1: via `_mount` API:

Step 4. Create a snapshot and use that snapshot to mount a partial searchable snapshot index.

PUT /_snapshot/found-snapshots/<insert_a_valid_snapshot_from_step2>/_clone/my_snapshot
{
  "indices":"kibana_sample_data_flights"
}
POST /_snapshot/found-snapshots/my_snapshot/_mount?storage=shared_cache&wait_for_completion=true
{
  "index": "kibana_sample_data_flights", 
  "renamed_index": "my_manually_mounted_frozen_index",
  "index_settings": { 
    "index.number_of_replicas": 0
  },
  "ignored_index_settings": [ "index.refresh_interval" ] 
}

The step above results a failed unassigned shard with:

{
  "snapshot" : {
    "snapshot" : "keep_snapshot",
    "indices" : [
      "my_manually_mounted_frozen_index"
    ],
    "shards" : {
      "total" : 1,
      "failed" : 1,   --> failed
      "successful" : 0
    }
  }
}

If you check with GET _cluster/allocation/explain, you would see the 7.14.0 data_frozen node gives:

{
      "node_id" : "mEYMgdXeTgynW86xIlVC1A",
      "node_name" : "instance-0000000002",
      "transport_address" : "10.43.0.17:19519",
      "node_attributes" : {
        "logical_availability_zone" : "zone-0",
        "server_name" : "instance-0000000002.3d36b179416e482daaf8a6deac458706",
        "availability_zone" : "europe-west1-b",
        "xpack.installed" : "true",
        "data" : "frozen",
        "instance_configuration" : "gcp.es.datafrozen.n2.68x10x95",
        "transform.node" : "false",
        "region" : "unknown-region"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "node version [7.14.0] is older than the snapshot version [7.15.0]"
        }
      ]
    }

Method 2. via ILM:

Step 5. Create an ILM policy (and change the cluster setting so ILM will happen sooner):

PUT _ilm/policy/searchable_snapshot_to_frozen_policy
{
  "policy": {
    "phases": {
      "frozen": {
        "min_age" : "0ms",
        "actions": {
          "searchable_snapshot" : {
            "snapshot_repository" : "found-snapshots"
          }
        }
      }
    }
  }
}
PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": "10s"
  }
}
PUT kibana_sample_data_flights/_settings
{
  "index.blocks.write": true
}
POST kibana_sample_data_flights/_clone/my_ilm_frozen_index
{
  "settings": {
    "index.lifecycle.name":"searchable_snapshot_to_frozen_policy"
  }
}

At this point, you can see that partial-my_ilm_frozen_index stays red (unassigned):

GET _cluster/allocation/explain
{
  "index":"partial-my_ilm_frozen_index",
  "primary": true,
  "shard":0
}

gives:

{
      "node_id" : "mEYMgdXeTgynW86xIlVC1A",
      "node_name" : "instance-0000000002",
      "transport_address" : "10.43.0.17:19519",
      "node_attributes" : {
        "logical_availability_zone" : "zone-0",
        "server_name" : "instance-0000000002.3d36b179416e482daaf8a6deac458706",
        "availability_zone" : "europe-west1-b",
        "xpack.installed" : "true",
        "data" : "frozen",
        "instance_configuration" : "gcp.es.datafrozen.n2.68x10x95",
        "transform.node" : "false",
        "region" : "unknown-region"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "node version [7.14.0] is older than the snapshot version [7.15.0]"
        }
      ]
    }

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-31T07:58:22Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-08-31T08:00:05Z

This is not just a searchable snapshot issue I guess. We record the master version as the version of a snapshot which isn't great or even correct. We should instead record the min-node version in the cluster to make sure we don't run into this kind of thing.
Will create a fix for this.

original-brownbear · 2021-08-31T09:45:42Z

Hmm on second thought this isn't as straight-forward as I thought. In the exact case mentioned here, we can solve the problem by setting the snapshot version to the min-node version in the cluster and it goes away.
But say, we run a mixed cluster where all the hot nodes are 7.15.0 and all the frozen or cold nodes are 7.14.0. Now the frozen or cold nodes won't be able to mount the snapshots anyway potentially because the Lucene files won't be compatible (especially considering that the tier-boundary crossing often comes with a force-merge that will change the Lucene file version for sure).

=> don't we have the same problem with other ILM actions that move data across tiers in in mixed-version clusters if there is a mismatch between the versions of hot and less than hot nodes?

DaveCTurner · 2021-08-31T10:01:27Z

don't we have the same problem with other ILM actions that move data across tiers in in mixed-version clusters

Yes, moves like that would be blocked by the NodeVersionAllocationDecider. Do we need to recommend upgrading the tiers one-by-one (frozen first, then cold, then warm, then hot, then masters)?

As discussed in #77007 (comment), it was decided that documentation on rolling upgrade should explicitly mention upgrading by tiers.

Leaf-Lin · 2021-09-09T12:21:35Z

It has been discussed and agreed upon having rolling upgrade by tier (frozen, cold, warm then hot) mentioning in the doc is safer to ensure ILM bwc. This will be followed up by #77491

* Update rolling_upgrade.asciidoc As discussed in #77007 (comment), it was decided that documentation on rolling upgrade should explicitly mention upgrading by tiers. * Update docs/reference/upgrade/rolling_upgrade.asciidoc Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com> Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>

jakommo · 2022-05-16T12:30:46Z

@Leaf-Lin @DaveCTurner I helped a user today that ran into this during the upgrade from 7.16 to 7.17.
It looks like the original doc PR to add this to the 7.x docs was reverted again and then #79617 only made it into 8.x.
Should we back port this to 7.x or was there a reason it only made it onto 8?

Leaf-Lin added >bug needs:triage Requires assignment of a team area label labels Aug 30, 2021

original-brownbear self-assigned this Aug 31, 2021

original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed needs:triage Requires assignment of a team area label labels Aug 31, 2021

elasticmachine added the Team:Distributed Meta label for distributed team label Aug 31, 2021

original-brownbear added the team-discuss label Aug 31, 2021

Leaf-Lin removed the team-discuss label Sep 9, 2021

Leaf-Lin added a commit that referenced this issue Sep 9, 2021

Update rolling_upgrade.asciidoc

5aa812b

As discussed in #77007 (comment), it was decided that documentation on rolling upgrade should explicitly mention upgrading by tiers.

Leaf-Lin mentioned this issue Sep 9, 2021

Adding upgrade by tier in rolling_upgrade doc #77491

Merged

barkbay mentioned this issue Jan 5, 2022

Rolling upgrade should handle data tiers elastic/cloud-on-k8s#5228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searchable snapshot version compatibility during upgrade #77007

Searchable snapshot version compatibility during upgrade #77007

Leaf-Lin commented Aug 30, 2021

elasticmachine commented Aug 31, 2021

original-brownbear commented Aug 31, 2021

original-brownbear commented Aug 31, 2021 •

edited

DaveCTurner commented Aug 31, 2021

Leaf-Lin commented Sep 9, 2021

jakommo commented May 16, 2022

Searchable snapshot version compatibility during upgrade #77007

Searchable snapshot version compatibility during upgrade #77007

Comments

Leaf-Lin commented Aug 30, 2021

Prerequiste:

Method 1: via _mount API:

Method 2. via ILM:

elasticmachine commented Aug 31, 2021

original-brownbear commented Aug 31, 2021

original-brownbear commented Aug 31, 2021 • edited

DaveCTurner commented Aug 31, 2021

Leaf-Lin commented Sep 9, 2021

jakommo commented May 16, 2022

Method 1: via `_mount` API:

original-brownbear commented Aug 31, 2021 •

edited