Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposed engine must include all operations below global checkpoint during rollback #36159

Merged
merged 4 commits into from Dec 9, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Dec 3, 2018

Today we expose a new engine immediately during Lucene rollback. The new engine is started with a safe commit which might not include all acknowledged operation. With this change, we won't expose the new engine until it has recovered from the local translog.

Note that this solution is not complete since it's able to reserve only acknowledged operations before the global checkpoint. This is because we replay translog up to the global checkpoint during rollback. A per-doc Lucene rollback would solve this issue entirely.

Relates #32867

…ring rollback

Today we expose a new engine immediately during Lucene rollback. The new
engine is started with a safe commit which might not include all
acknowledged operation. With this change, we won't expose the new engine
until it has recovered from the local translog.

Note that this solution is not complete since it's able to reserve only
acknowledged operations before the global checkpoint. This is because we
replay translog up to the global checkpoint during rollback. A per-doc
Lucene rollback would solve this issue entirely.
@dnhatn dnhatn added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.6.0 labels Dec 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left 2 smaller comments around the order of things and mutexes. Looking good otherwise

@dnhatn
Copy link
Member Author

dnhatn commented Dec 7, 2018

@ywelsch Two very good catches. It's ready for another round. Would you please have another look? Thank you!

@dnhatn dnhatn requested a review from ywelsch December 7, 2018 08:12
Copy link
Contributor

@s1monw s1monw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left 2 comments. LGTM otherwise

synchronized (mutex) {
// we must create a new engine under mutex (see IndexShard#snapshotStoreMetadata).
newEngine = engineFactory.newReadWriteEngine(newEngineConfig());
onNewEngine(newEngine);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onNewEngine does't publish the new engine right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onNewEngine does not expose the engine itself but exposes only the last refresh translog location to RefreshListeners.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Dec 9, 2018

Thanks everyone :)

@dnhatn dnhatn merged commit 902d6f5 into elastic:master Dec 9, 2018
@dnhatn dnhatn deleted the rollback-acked-writes branch December 9, 2018 02:27
dnhatn added a commit that referenced this pull request Dec 9, 2018
Today we expose a new engine immediately during Lucene rollback. The new
engine is started with a safe commit which might not include all
acknowledged operation. With this change, we won't expose the new engine
until it has recovered from the local translog.

Note that this solution is not complete since it's able to reserve only
acknowledged operations before the global checkpoint. This is because we
replay translog up to the global checkpoint during rollback. A per-doc
Lucene rollback would solve this issue entirely.

Relates #32867
jasontedor added a commit to liketic/elasticsearch that referenced this pull request Dec 9, 2018
* elastic/6.x: (37 commits)
  [HLRC] Added support for Follow Stats API (elastic#36253)
  Exposed engine must have all ops below gcp during rollback (elastic#36159)
  TEST: Always enable soft-deletes in ShardChangesTests
  Use delCount of SegmentInfos to calculate numDocs (elastic#36323)
  Add soft-deletes upgrade tests (elastic#36286)
  Remove LocalCheckpointTracker#resetCheckpoint (elastic#34667)
  Option to use endpoints starting with _security (elastic#36379)
  [CCR] Restructured QA modules (elastic#36404)
  RestClient: on retry timeout add root exception (elastic#25576)
  [HLRC] Add support for put privileges API (elastic#35679)
  HLRC: Add rollup search (elastic#36334)
  Explicitly recommend to forceMerge before freezing (elastic#36376)
  Rename internal repository actions to be internal (elastic#36377)
  Core: Remove parseDefaulting from DateFormatter (elastic#36386)
  [ML] Prevent stack overflow while copying ML jobs and datafeeds (elastic#36370)
  Docs: Fix Jackson reference (elastic#36366)
  [ILM] Fix issue where index may not yet be in 'hot' phase (elastic#35716)
  Undeprecate /_watcher endpoints (elastic#36269)
  Docs: Fix typo in bool query (elastic#36350)
  HLRC: Add delete template API (elastic#36320)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants