Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot and restore and shrink do not play well together #24257

Closed
nik9000 opened this issue Apr 21, 2017 · 2 comments · Fixed by #24322
Closed

Snapshot and restore and shrink do not play well together #24257

nik9000 opened this issue Apr 21, 2017 · 2 comments · Fixed by #24322
Assignees
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@nik9000
Copy link
Member

nik9000 commented Apr 21, 2017

The simplest way to reproduce locally is with gradle run and then:

POST /test/doc
{ "test": "test" }


PUT /test/_settings
{
  "settings": {
    "index.routing.allocation.require._name": "shrink_node_name", 
    "index.blocks.write": true 
  }
}

POST /test/_shrink/shrunk

PUT /_snapshot/test_repo
{
  "type": "fs",
  "settings": {
    "compress": true,
    "location": "/Users/manybubbles/Workspaces/Elasticsearch/master/elasticsearch/distribution/build/cluster/shared/repo/test_repo"
  }
}

POST /_snapshot/test_repo/test_snapshot?wait_for_completion=true

POST /_snapshot/test_repo/test_snapshot/_restore?wait_for_completion=true
{
  "indices": ["shrunk"]
}

Kibana will give you "socket hang up" error because Elasticsearch has tripped an assertion and killed itself:

[elasticsearch] java.lang.AssertionError: all settings must have been upgraded before
[elasticsearch]         at org.elasticsearch.cluster.metadata.MetaDataIndexUpgradeService.upgradeIndexMetaData(MetaDataIndexUpgradeService.java:77) ~[elasticsearch-6.0.0-alpha1-SNAPSHOT.jar:6.0.0-alpha1-SNAPSHOT]

If you run without -ea then the shrunken index won't allocation properly because of the index.routing.allocation.initial_recovery setting. That setting cannot be cleared.

@nik9000 nik9000 added :Allocation :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug labels Apr 21, 2017
@nik9000
Copy link
Member Author

nik9000 commented Apr 21, 2017

If you run without -ea then the shrunken index won't allocation properly because of the index.routing.allocation.initial_recovery setting. That setting cannot be cleared.

Clarification: if you run without -ea it won't recover properly unless the old node is around. But that is still a problem because then you can't snapshot in one cluster and restore to another. Or you can't restore after that node has been decommissioned.

@abeyad abeyad self-assigned this Apr 21, 2017
@nik9000
Copy link
Member Author

nik9000 commented Apr 21, 2017

abeyad self-assigned this 3 minutes ago

Good luck!

abeyad pushed a commit to abeyad/elasticsearch that referenced this issue Apr 25, 2017
When an index is shrunk using the shrink APIs, the shrink operation adds
some internal index settings to the shrink index, for example
`index.shrink.source.name|uuid` to denote the source index, as well as
`index.routing.allocation.initial_recovery._id` to denote the node on
which all shards for the source index resided when the shrunken index
was created.  However, this presents a problem when taking a snapshot of
the shrunken index and restoring it to a cluster where the initial
recovery node is not present, or restoring to the same cluster where the
initial recovery node is offline or decomissioned.  The restore
operation fails to allocate the shard in the shrunken index to a node
when the initial recovery node is not present, and a restore type of
recovery will *not* go through the PrimaryShardAllocator, meaning that
it will not have the chance to force allocate the primary to a node in
the cluster.  Rather, restore initiated shard allocation goes through
the BalancedShardAllocator which does not attempt to force allocate a
primary.

This commit fixes the aforementioned problem by not requiring allocation
to occur on the initial recovery node when the recovery type is a
restore of a snapshot.  This commit also ensures that the internal
shrink index settings are recognized and not archived (which can trip an
assertion in the restore scenario).

Closes elastic#24257
abeyad pushed a commit that referenced this issue Apr 26, 2017
…24322)

When an index is shrunk using the shrink APIs, the shrink operation adds
some internal index settings to the shrink index, for example
`index.shrink.source.name|uuid` to denote the source index, as well as
`index.routing.allocation.initial_recovery._id` to denote the node on
which all shards for the source index resided when the shrunken index
was created.  However, this presents a problem when taking a snapshot of
the shrunken index and restoring it to a cluster where the initial
recovery node is not present, or restoring to the same cluster where the
initial recovery node is offline or decomissioned.  The restore
operation fails to allocate the shard in the shrunken index to a node
when the initial recovery node is not present, and a restore type of
recovery will *not* go through the PrimaryShardAllocator, meaning that
it will not have the chance to force allocate the primary to a node in
the cluster.  Rather, restore initiated shard allocation goes through
the BalancedShardAllocator which does not attempt to force allocate a
primary.

This commit fixes the aforementioned problem by not requiring allocation
to occur on the initial recovery node when the recovery type is a
restore of a snapshot.  This commit also ensures that the internal
shrink index settings are recognized and not archived (which can trip an
assertion in the restore scenario).

Closes #24257
abeyad pushed a commit that referenced this issue Apr 26, 2017
…24322)

When an index is shrunk using the shrink APIs, the shrink operation adds
some internal index settings to the shrink index, for example
`index.shrink.source.name|uuid` to denote the source index, as well as
`index.routing.allocation.initial_recovery._id` to denote the node on
which all shards for the source index resided when the shrunken index
was created.  However, this presents a problem when taking a snapshot of
the shrunken index and restoring it to a cluster where the initial
recovery node is not present, or restoring to the same cluster where the
initial recovery node is offline or decomissioned.  The restore
operation fails to allocate the shard in the shrunken index to a node
when the initial recovery node is not present, and a restore type of
recovery will *not* go through the PrimaryShardAllocator, meaning that
it will not have the chance to force allocate the primary to a node in
the cluster.  Rather, restore initiated shard allocation goes through
the BalancedShardAllocator which does not attempt to force allocate a
primary.

This commit fixes the aforementioned problem by not requiring allocation
to occur on the initial recovery node when the recovery type is a
restore of a snapshot.  This commit also ensures that the internal
shrink index settings are recognized and not archived (which can trip an
assertion in the restore scenario).

Closes #24257
abeyad pushed a commit that referenced this issue Apr 26, 2017
…24322)

When an index is shrunk using the shrink APIs, the shrink operation adds
some internal index settings to the shrink index, for example
`index.shrink.source.name|uuid` to denote the source index, as well as
`index.routing.allocation.initial_recovery._id` to denote the node on
which all shards for the source index resided when the shrunken index
was created.  However, this presents a problem when taking a snapshot of
the shrunken index and restoring it to a cluster where the initial
recovery node is not present, or restoring to the same cluster where the
initial recovery node is offline or decomissioned.  The restore
operation fails to allocate the shard in the shrunken index to a node
when the initial recovery node is not present, and a restore type of
recovery will *not* go through the PrimaryShardAllocator, meaning that
it will not have the chance to force allocate the primary to a node in
the cluster.  Rather, restore initiated shard allocation goes through
the BalancedShardAllocator which does not attempt to force allocate a
primary.

This commit fixes the aforementioned problem by not requiring allocation
to occur on the initial recovery node when the recovery type is a
restore of a snapshot.  This commit also ensures that the internal
shrink index settings are recognized and not archived (which can trip an
assertion in the restore scenario).

Closes #24257
@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants