Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handles pre 2.x compressed snapshots #22267

Merged
merged 1 commit into from Jan 6, 2017

Conversation

abeyad
Copy link

@abeyad abeyad commented Dec 19, 2016

In pre 2.x versions, if the repository was set to compress snapshots,
then snapshots would be compressed with the LZF algorithm. In 5.x,
Elasticsearch no longer supports the LZF compression algorithm. This
presents an issue when retrieving snapshots in a repository or upgrading
repository data to the 5.x version, because Elasticsearch throws an
exception when it tries to read the snapshot metadata because it was
compressed using LZF.

This commit gracefully handles the situation by introducing a new
incompatible-snapshots blob to the repository. For any pre-2.x snapshot
that cannot be read, that snapshot is removed from the list of active
snapshots, because the snapshot could not be restored anyway. Instead,
the snapshot is recorded in the incompatible-snapshots blob. When
listing snapshots, both active snapshots and incompatible snapshots will
be listed, with incompatible snapshots showing a INCOMPATIBLE state.
Any attempt to restore an incompatible snapshot will result in an
exception.

@abeyad abeyad added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >bug v5.1.2 v5.2.0 labels Dec 19, 2016
@abeyad abeyad requested a review from imotov December 19, 2016 18:04
Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure what the right solution here, but this approach seems dangerous to me. Since we are not adding this snapshot to the list of snapshots in 5.x we will not consider files that belong to this snapshot as needed, which means that we might remove these files during consecutive snapshots, right?

I am thinking about a scenario where 1.x index was upgraded to 2.x so we have both 1.x and 2.x snapshots pointing to the same data directory. So, in a sense we might make this snapshot non-restorable even in earlier versions of elasticsearch.

I think a better approach might be to identify a list of snapshots that cannot be brought over to 5.x and throw an exception with the list of these snapshots asking user to either delete these snapshots from the existing repository or switch to a new repository.

@abeyad
Copy link
Author

abeyad commented Dec 19, 2016

@imotov

I am not really sure what the right solution here, but this approach seems dangerous to me. Since we are not adding this snapshot to the list of snapshots in 5.x we will not consider files that belong to this snapshot as needed, which means that we might remove these files during consecutive snapshots, right?

What happens in the case where there are no compressed snapshots, when a 5.x cluster is touching the repository for the first time, it updates the snapshots index file to the 5.x version. In doing so, it is able to read the legacy pre 2.x format and get the snapshot metadata and proceed accordingly. When you GET all the snapshots, it lists all the snapshots including the pre 2.x ones. It is only when you try to restore the snapshot that you get an error saying "snapshot contains index data that is too old" (something to that effect). It is only when dealing with compressed snapshots that we have an issue. The reason is we can't even read the snapshot metadata via LegacyBlobFormat because the compression mode is no longer supported and throws the IllegalStateException. This means when we try to upgrade the snapshot repository data to the 5.x version, we get exceptions and can't proceed. That means we can't even delete the snapshot because being able to read the snapshot metadata is a prerequisite to updating the view of what the current snapshots are (not to mention the getSnapshotInfo call fails too). So we really can't proceed with the repository in 5.x until we either drop the snapshot or have some way to read its metadata, which is now prohibited due to its compressed format.

I am thinking about a scenario where 1.x index was upgraded to 2.x so we have both 1.x and 2.x snapshots pointing to the same data directory. So, in a sense we might make this snapshot non-restorable even in earlier versions of elasticsearch.

The upgrade process will only remove the snapshot from the new index-N file, so a 2.x cluster can still read the snapshot from the index file. Unless I misunderstood what you intended?

I think a better approach might be to identify a list of snapshots that cannot be brought over to 5.x and throw an exception with the list of these snapshots asking user to either delete these snapshots from the existing repository or switch to a new repository.

The issue is that we can't even list the snapshots to delete until we are able to load the repository in 5.x We can, and indeed do, do this for uncompressed snapshots, as mentioned above, but the same strategy does not work for snapshots compressed in 1.x. The reason I considered it acceptable to remove the snapshot from the index-N file, with a warning, is that the snapshot would never be restorable in 5.x anyway (because it will have 1.x index data).

Not sure if this makes sense, but this is the rationale for this approach and I'm not sure if it can be done differently without adding a new API that operates on a "non-updated" repository so one can pick and chose what snapshots they want to keep.

@imotov
Copy link
Contributor

imotov commented Dec 19, 2016

The upgrade process will only remove the snapshot from the new index-N file, so a 2.x cluster can still read the snapshot from the index file. Unless I misunderstood what you intended?

@abeyad I was trying to describe a scenario, where this might lead to loosing data in the 1.x snapshots without a warning (I don't consider logging on master a fair warning in this case). Might be it's me how is missing something here though. Can we chat on zoom about it tomorrow morning, I will try to explain the scenario in more details.

@abeyad abeyad force-pushed the snapshot_unsupported_compression_5x branch 2 times, most recently from 64a0ba6 to 2a3faf7 Compare December 26, 2016 04:17
@abeyad
Copy link
Author

abeyad commented Dec 26, 2016

@imotov I pushed 2a3faf7daf723b6c5ab52862aecc9277979f8dbc that implements the incompatible-snapshots blob that we discussed.

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is much better, but I have a couple of questions. How are we going to handle this in 6.x? Would it be possible to improve the delete snapshot experience for the unsupported snapshots or they are pretty much stuck in this repository forever since we cannot really clean them?

@abeyad abeyad removed the v5.1.2 label Jan 6, 2017
@abeyad
Copy link
Author

abeyad commented Jan 6, 2017

How are we going to handle this in 6.x?

I will forward port this code to master, with the only omission being the actual creation of the incompatible-snapshots blob (because as of now, only 5.x would need to create it in its updateRepositoryData method that 6.x won't have). So 6.x clusters will also be able to read the incompatible snapshots and display them just as 5.x does.

Would it be possible to improve the delete snapshot experience for the unsupported snapshots or they are pretty much stuck in this repository forever since we cannot really clean them?

Its impossible to know from the repository data which data files can be deleted, because the snap-*.dat blob can't be read at all, which is the starting point to figure out which files in the repository belong to the snapshot.

One (not so great) option is that we could iterate over all the data files in the repository and if they throw a similar compression exception, we can know they are outdated (belonging to 1.x) and delete them. But the overhead of iterating over all repository files like that could be huge.

WDYT?

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess this is the best we can do in this case. LGTM.

In pre 2.x versions, if the repository was set to compress snapshots,
then snapshots would be compressed with the LZF algorithm.  In 5.x,
Elasticsearch no longer supports the LZF compression algorithm.  This
presents an issue when retrieving snapshots in a repository or upgrading
repository data to the 5.x version, because Elasticsearch throws an
exception when it tries to read the snapshot metadata because it was
compressed using LZF.

This commit gracefully handles the situation by introducing a new
incompatible-snapshots blob to the repository.  For any pre-2.x snapshot
that cannot be read, that snapshot is removed from the list of active
snapshots, because the snapshot could not be restored anyway.  Instead,
the snapshot is recorded in the incompatible-snapshots blob.  When
listing snapshots, both active snapshots and incompatible snapshots will
be listed, with incompatible snapshots showing a `INCOMPATIBLE` state.
Any attempt to restore an incompatible snapshot will result in an
exception.
@abeyad abeyad force-pushed the snapshot_unsupported_compression_5x branch from 2a3faf7 to 959ad35 Compare January 6, 2017 21:30
@abeyad
Copy link
Author

abeyad commented Jan 6, 2017

retest this please

@abeyad
Copy link
Author

abeyad commented Jan 6, 2017

thanks for the review @imotov

@abeyad abeyad merged commit 714e748 into elastic:5.x Jan 6, 2017
@abeyad abeyad deleted the snapshot_unsupported_compression_5x branch January 6, 2017 22:04
@abeyad
Copy link
Author

abeyad commented Jan 7, 2017

master: b0c009a

abeyad pushed a commit that referenced this pull request Jan 7, 2017
In pre 2.x versions, if the repository was set to compress snapshots,
then snapshots would be compressed with the LZF algorithm.  In 5.x,
Elasticsearch no longer supports the LZF compression algorithm.  This
presents an issue when retrieving snapshots in a repository or upgrading
repository data to the 5.x version, because Elasticsearch throws an
exception when it tries to read the snapshot metadata because it was
compressed using LZF.

This commit gracefully handles the situation by introducing a new
incompatible-snapshots blob to the repository.  For any pre-2.x snapshot
that cannot be read, that snapshot is removed from the list of active
snapshots, because the snapshot could not be restored anyway.  Instead,
the snapshot is recorded in the incompatible-snapshots blob.  When
listing snapshots, both active snapshots and incompatible snapshots will
be listed, with incompatible snapshots showing a `INCOMPATIBLE` state.
Any attempt to restore an incompatible snapshot will result in an
exception.
@abeyad
Copy link
Author

abeyad commented Jan 7, 2017

5.2 commit: 29b777a

abeyad pushed a commit to abeyad/elasticsearch that referenced this pull request Apr 29, 2017
In elastic#22267, we introduced the notion of incompatible snapshots in a
repository, and they were stored in a root-level blob named
`incompatible-snapshots`.  If there were no incompatible snapshots in
the repository, then there was no `incompatible-snapshots` blob.

However, this causes a problem for some cloud-based repositories,
because if the blob does not exist, the cloud-based repositories may
attempt to keep retrying the read of a non-existent blob with
expontential backoff until giving up.  This causes performance issues
(and potential timeouts) on snapshot operations because getting the
`incompatible-snapshots` is part of getting the repository data (see
RepositoryData#getRepositoryData()).

This commit fixes the issue by creating an empty
`incompatible-snapshots` blob in the repository if one does not exist.
abeyad pushed a commit that referenced this pull request May 1, 2017
In #22267, we introduced the notion of incompatible snapshots in a
repository, and they were stored in a root-level blob named
`incompatible-snapshots`.  If there were no incompatible snapshots in
the repository, then there was no `incompatible-snapshots` blob.

However, this causes a problem for some cloud-based repositories,
because if the blob does not exist, the cloud-based repositories may
attempt to keep retrying the read of a non-existent blob with
expontential backoff until giving up.  This causes performance issues
(and potential timeouts) on snapshot operations because getting the
`incompatible-snapshots` is part of getting the repository data (see
RepositoryData#getRepositoryData()).

This commit fixes the issue by creating an empty
`incompatible-snapshots` blob in the repository if one does not exist.
abeyad pushed a commit that referenced this pull request May 1, 2017
In #22267, we introduced the notion of incompatible snapshots in a
repository, and they were stored in a root-level blob named
`incompatible-snapshots`.  If there were no incompatible snapshots in
the repository, then there was no `incompatible-snapshots` blob.

However, this causes a problem for some cloud-based repositories,
because if the blob does not exist, the cloud-based repositories may
attempt to keep retrying the read of a non-existent blob with
expontential backoff until giving up.  This causes performance issues
(and potential timeouts) on snapshot operations because getting the
`incompatible-snapshots` is part of getting the repository data (see
RepositoryData#getRepositoryData()).

This commit fixes the issue by creating an empty
`incompatible-snapshots` blob in the repository if one does not exist.
abeyad pushed a commit that referenced this pull request May 1, 2017
In #22267, we introduced the notion of incompatible snapshots in a
repository, and they were stored in a root-level blob named
`incompatible-snapshots`.  If there were no incompatible snapshots in
the repository, then there was no `incompatible-snapshots` blob.

However, this causes a problem for some cloud-based repositories,
because if the blob does not exist, the cloud-based repositories may
attempt to keep retrying the read of a non-existent blob with
expontential backoff until giving up.  This causes performance issues
(and potential timeouts) on snapshot operations because getting the
`incompatible-snapshots` is part of getting the repository data (see
RepositoryData#getRepositoryData()).

This commit fixes the issue by creating an empty
`incompatible-snapshots` blob in the repository if one does not exist.
abeyad pushed a commit that referenced this pull request May 8, 2017
This commit fixes inefficient (worst case exponential) loading of 
snapshot repository data when checking for incompatible snapshots,
that was introduced in #22267.  When getting snapshot information,
getRepositoryData() was called on every snapshot, so if there are
a large number of snapshots in the repository and _all snapshots
were requested, the performance degraded exponentially.  This
commit fixes the issue by only calling getRepositoryData once and
using the data from it in all subsequent calls to get snapshot 
information.

Closes #24509
abeyad pushed a commit that referenced this pull request May 8, 2017
This commit fixes inefficient (worst case exponential) loading of 
snapshot repository data when checking for incompatible snapshots,
that was introduced in #22267.  When getting snapshot information,
getRepositoryData() was called on every snapshot, so if there are
a large number of snapshots in the repository and _all snapshots
were requested, the performance degraded exponentially.  This
commit fixes the issue by only calling getRepositoryData once and
using the data from it in all subsequent calls to get snapshot 
information.

Closes #24509
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v5.2.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants