Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery from snapshot may leave seq# gaps #27850

Merged
merged 1 commit into from
Dec 18, 2017

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Dec 16, 2017

When snapshotting the primary we capture a lucene commit at an arbitrary moment from a sequence number perspective. This means that it is possible that the commit misses operations and that there is a gap between the local checkpoint in the commit and the maximum sequence number.

When we restore, this will create a primary that "misses" operations and currently will mean that the sequence number system is stuck (i.e., the local checkpoint will be stuck). To fix this we should fill in gaps when we restore, in a similar fashion to normal store recovery.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

final Index index = resolveIndex(indexName);
final IndexShard primary = internalCluster().getInstance(IndicesService.class, dataNode).getShardOrNull(new ShardId(index, 0));
// create a gap in the sequence numbers
getEngineFromShard(primary).seqNoService().generateSeqNo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bleskes bleskes merged commit 9cd69e7 into elastic:master Dec 18, 2017
@bleskes
Copy link
Contributor Author

bleskes commented Dec 18, 2017

thx @jasontedor

@bleskes bleskes deleted the snapshot_restore_seq_nos branch December 18, 2017 12:58
bleskes added a commit that referenced this pull request Dec 18, 2017
When snapshotting the primary we capture a lucene commit at an arbitrary moment from a sequence number perspective. This means that it is possible that the commit misses operations and that there is a gap between the local checkpoint in the commit and the maximum sequence number.

When we restore, this will create a primary that "misses" operations and currently will mean that the sequence number system is stuck (i.e., the local checkpoint will be stuck). To fix this we should fill in gaps when we restore, in a similar fashion to normal store recovery.
bleskes added a commit that referenced this pull request Dec 18, 2017
When snapshotting the primary we capture a lucene commit at an arbitrary moment from a sequence number perspective. This means that it is possible that the commit misses operations and that there is a gap between the local checkpoint in the commit and the maximum sequence number.

When we restore, this will create a primary that "misses" operations and currently will mean that the sequence number system is stuck (i.e., the local checkpoint will be stuck). To fix this we should fill in gaps when we restore, in a similar fashion to normal store recovery.
martijnvg added a commit that referenced this pull request Dec 18, 2017
* es/6.x: (170 commits)
  Allow TrimFilter to be used in custom normalizers (#27758)
  recovery from snapshot should fill gaps (#27850)
  Remove unused class PreBuiltTokenFilters (#27839)
  Reject scroll query if size is 0 (#22552) (#27842)
  Mutes ‘Rollover no condition matched’ YAML test
  Make randomNonNegativeLong() draw from a uniform distribution (#27856)
  Adapt rest test after backport. Relates #27833
  Handle case where the hole vertex is south of the containing polygon(s) (#27685)
  Move range field mapper back to core
  Fix publication of elasticsearch-cli to Maven
  Do not use system properties when building the HttpAsyncClient (#27829)
  Optimize version map for append-only indexing (#27752)
  Add NioGroup for use in different transports (#27737)
  adapt field collapsing skip test version. relates #27833
  Add version support for inner hits in field collapsing (#27822) (#27833)
  Clarify that number of threads is set by packages
  Register HTTP read timeout setting
  Fixes Checkstyle
  Remove `operationThreaded` from Java API (#27836)
  Fixes failing BytesSizeValues tests
  ...
@clintongormley clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
@jpountz jpountz removed the :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. label Jan 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.1.1 v6.2.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants