Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searchable snapshot version compatibility during upgrade #77007

Open
Leaf-Lin opened this issue Aug 30, 2021 · 6 comments
Open

Searchable snapshot version compatibility during upgrade #77007

Leaf-Lin opened this issue Aug 30, 2021 · 6 comments
Assignees
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team

Comments

@Leaf-Lin
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): During an upgrade from 7.14.0 --> 7.15.0

Plugins installed: []

JVM version (java -version): on ESS

OS version (uname -a if on a Unix-like system): on ESS

Description of the problem including expected versus actual behavior:
ILM searchable snapshot actions and _mount action during upgrade should continue to work, yet if the master node is in a higher version than the other data nodes, users may encounter the following version compatibility issue with error message: node version [x0.y0.z0] is older than the snapshot version [x1.y1.z1]
Steps to reproduce:

There are two different ways to reproduce the red unassigned searchable index:

Prerequiste:

  • Step 0. To reproduce this issue, I created a 2 hot_content nodes + 1 data_frozen node on Elastic cloud.
  • Step 1. For the following reproduction, I added the sample kibana_sample_data_flights index.
  • Step 2. Take a snapshot prior to the upgrade.
  • Step 3. Create a mixed version cluster. This may happen if for some reason the upgrade was unsuccessful and having mixed versions nodes in the cluster where master is in a higher version than the data node. You can manually cancel the upgrade and manually stop/start nodes to simulate this scenario and force master to be on the newer version.

At this point, I have one master in 7.15.0 and all the rest of the nodes in 7.14.0.

Method 1: via _mount API:

  • Step 4. Create a snapshot and use that snapshot to mount a partial searchable snapshot index.
PUT /_snapshot/found-snapshots/<insert_a_valid_snapshot_from_step2>/_clone/my_snapshot
{
  "indices":"kibana_sample_data_flights"
}
POST /_snapshot/found-snapshots/my_snapshot/_mount?storage=shared_cache&wait_for_completion=true
{
  "index": "kibana_sample_data_flights", 
  "renamed_index": "my_manually_mounted_frozen_index",
  "index_settings": { 
    "index.number_of_replicas": 0
  },
  "ignored_index_settings": [ "index.refresh_interval" ] 
}

The step above results a failed unassigned shard with:

{
  "snapshot" : {
    "snapshot" : "keep_snapshot",
    "indices" : [
      "my_manually_mounted_frozen_index"
    ],
    "shards" : {
      "total" : 1,
      "failed" : 1,   --> failed
      "successful" : 0
    }
  }
}

If you check with GET _cluster/allocation/explain, you would see the 7.14.0 data_frozen node gives:

{
      "node_id" : "mEYMgdXeTgynW86xIlVC1A",
      "node_name" : "instance-0000000002",
      "transport_address" : "10.43.0.17:19519",
      "node_attributes" : {
        "logical_availability_zone" : "zone-0",
        "server_name" : "instance-0000000002.3d36b179416e482daaf8a6deac458706",
        "availability_zone" : "europe-west1-b",
        "xpack.installed" : "true",
        "data" : "frozen",
        "instance_configuration" : "gcp.es.datafrozen.n2.68x10x95",
        "transform.node" : "false",
        "region" : "unknown-region"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "node version [7.14.0] is older than the snapshot version [7.15.0]"
        }
      ]
    }

Method 2. via ILM:

  • Step 5. Create an ILM policy (and change the cluster setting so ILM will happen sooner):
PUT _ilm/policy/searchable_snapshot_to_frozen_policy
{
  "policy": {
    "phases": {
      "frozen": {
        "min_age" : "0ms",
        "actions": {
          "searchable_snapshot" : {
            "snapshot_repository" : "found-snapshots"
          }
        }
      }
    }
  }
}
PUT _cluster/settings
{
  "transient": {
    "indices.lifecycle.poll_interval": "10s"
  }
}
PUT kibana_sample_data_flights/_settings
{
  "index.blocks.write": true
}
POST kibana_sample_data_flights/_clone/my_ilm_frozen_index
{
  "settings": {
    "index.lifecycle.name":"searchable_snapshot_to_frozen_policy"
  }
}

At this point, you can see that partial-my_ilm_frozen_index stays red (unassigned):

GET _cluster/allocation/explain
{
  "index":"partial-my_ilm_frozen_index",
  "primary": true,
  "shard":0
}

gives:

{
      "node_id" : "mEYMgdXeTgynW86xIlVC1A",
      "node_name" : "instance-0000000002",
      "transport_address" : "10.43.0.17:19519",
      "node_attributes" : {
        "logical_availability_zone" : "zone-0",
        "server_name" : "instance-0000000002.3d36b179416e482daaf8a6deac458706",
        "availability_zone" : "europe-west1-b",
        "xpack.installed" : "true",
        "data" : "frozen",
        "instance_configuration" : "gcp.es.datafrozen.n2.68x10x95",
        "transform.node" : "false",
        "region" : "unknown-region"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "node version [7.14.0] is older than the snapshot version [7.15.0]"
        }
      ]
    }

Provide logs (if relevant):

@Leaf-Lin Leaf-Lin added >bug needs:triage Requires assignment of a team area label labels Aug 30, 2021
@original-brownbear original-brownbear self-assigned this Aug 31, 2021
@original-brownbear original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed needs:triage Requires assignment of a team area label labels Aug 31, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Aug 31, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear
Copy link
Member

This is not just a searchable snapshot issue I guess. We record the master version as the version of a snapshot which isn't great or even correct. We should instead record the min-node version in the cluster to make sure we don't run into this kind of thing.
Will create a fix for this.

@original-brownbear
Copy link
Member

original-brownbear commented Aug 31, 2021

Hmm on second thought this isn't as straight-forward as I thought. In the exact case mentioned here, we can solve the problem by setting the snapshot version to the min-node version in the cluster and it goes away.
But say, we run a mixed cluster where all the hot nodes are 7.15.0 and all the frozen or cold nodes are 7.14.0. Now the frozen or cold nodes won't be able to mount the snapshots anyway potentially because the Lucene files won't be compatible (especially considering that the tier-boundary crossing often comes with a force-merge that will change the Lucene file version for sure).

=> don't we have the same problem with other ILM actions that move data across tiers in in mixed-version clusters if there is a mismatch between the versions of hot and less than hot nodes?

@DaveCTurner
Copy link
Contributor

don't we have the same problem with other ILM actions that move data across tiers in in mixed-version clusters

Yes, moves like that would be blocked by the NodeVersionAllocationDecider. Do we need to recommend upgrading the tiers one-by-one (frozen first, then cold, then warm, then hot, then masters)?

Leaf-Lin added a commit that referenced this issue Sep 9, 2021
As discussed in #77007 (comment), it was decided that documentation on rolling upgrade should explicitly mention upgrading by tiers.
@Leaf-Lin
Copy link
Contributor Author

Leaf-Lin commented Sep 9, 2021

It has been discussed and agreed upon having rolling upgrade by tier (frozen, cold, warm then hot) mentioning in the doc is safer to ensure ILM bwc. This will be followed up by #77491

Leaf-Lin added a commit that referenced this issue Oct 21, 2021
* Update rolling_upgrade.asciidoc

As discussed in #77007 (comment), it was decided that documentation on rolling upgrade should explicitly mention upgrading by tiers.

* Update docs/reference/upgrade/rolling_upgrade.asciidoc

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
@jakommo
Copy link
Contributor

jakommo commented May 16, 2022

@Leaf-Lin @DaveCTurner I helped a user today that ran into this during the upgrade from 7.16 to 7.17.
It looks like the original doc PR to add this to the 7.x docs was reverted again and then #79617 only made it into 8.x.
Should we back port this to 7.x or was there a reason it only made it onto 8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

5 participants