Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover retention leases during peer recovery #38435

Merged
merged 6 commits into from
Feb 5, 2019

Conversation

jasontedor
Copy link
Member

This commit integrates retention leases with recovery. With this change, we copy the current retention leases on primary to the replica during phase two of recovery. At this point, the replica will have been added to the replication group and so is already receiving retention lease sync requests from the primary. This means that if any retention lease syncs are triggered on the primary after we sample the retention leases here during phase two, that sync request will also arrive on the replica ensuring that the replica is from this point on up to date with the retention leases on the primary. We have to copy these during phase two since we will be applying indexing operations, potentially triggering merges, and therefore must ensure the correct retention leases are in place beforehand.

Relates #37165

This commit integrates retention leases with recovery. With this change,
we copy the current retention leases on primary to the replica during
phase two of recovery. At this point, the replica will have been added
to the replication group and so is already receiving retention lease
sync requests from the primary. This means that if any retention lease
syncs are triggered on the primary after we sample the retention leases
here during phase two, that sync request will also arrive on the replica
ensuring that the replica is from this point on up to date with the
retention leases on the primary. We have to copy these during phase two
since we will be applying indexing operations, potentially triggering
merges, and therefore must ensure the correct retention leases are in
place beforehand.
@jasontedor jasontedor added >enhancement v7.0.0 :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v6.7.0 labels Feb 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jasontedor jasontedor changed the title Integrate retention leases with recovery Recover retention leases during peer recovery Feb 5, 2019
@jasontedor jasontedor mentioned this pull request Feb 5, 2019
24 tasks
@jasontedor
Copy link
Member Author

@elasticmachine run elasticsearch-ci/default-distro

1 similar comment
@jasontedor
Copy link
Member Author

@elasticmachine run elasticsearch-ci/default-distro

@jasontedor
Copy link
Member Author

@elasticmachine run elasticsearch-ci/1

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to sync RetentionLeases multiple times since syncing the leases once during recovery is enough. However, we can't do it without introducing a new step because we can't piggyback the leases in the prepareTranslog step nor the finalize step. Moreover, we expect to have a few leases, thus this choice makes a lot of sense to me.

@@ -39,18 +40,26 @@
private int totalTranslogOps = RecoveryState.Translog.UNKNOWN;
private long maxSeenAutoIdTimestampOnPrimary;
private long maxSeqNoOfUpdatesOrDeletesOnPrimary;
private RetentionLeases retentionLeases;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to initialize the retentionLeases with an empty leases in a mixed cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. I pushed 4d9fa1a.

@jasontedor
Copy link
Member Author

I prefer not to sync RetentionLeases multiple times since syncing the leases once during recovery is enough. However, we can't do it without introducing a new step because we can't piggyback the leases in the prepareTranslog step nor the finalize step. Moreover, we expect to have a few leases, thus this choice makes a lot of sense to me.

Indeed, I went through exactly the same dilemma. We could introduce a new step but I am not sure that it's worth it. Any thoughts @ywelsch before I push go on this?

* master: (23 commits)
  Lift retention lease expiration to index shard (elastic#38380)
  Make Ccr recovery file chunk size configurable (elastic#38370)
  Prevent CCR recovery from missing documents (elastic#38237)
  re-enables awaitsfixed datemath tests (elastic#38376)
  Types removal fix FullClusterRestartIT warnings (elastic#38445)
  Make sure to reject mappings with type _doc when include_type_name is false. (elastic#38270)
  Updates the grok patterns to be consistent with logstash (elastic#27181)
  Ignore type-removal warnings in XPackRestTestHelper (elastic#38431)
  testHlrcFromXContent() should respect assertToXContentEquivalence() (elastic#38232)
  add basic REST test for geohash_grid (elastic#37996)
  Remove DiscoveryPlugin#getDiscoveryTypes (elastic#38414)
  Fix the clock resolution to millis in GetWatchResponseTests (elastic#38405)
  Throw AssertionError when no master (elastic#38432)
  `if_seq_no` and `if_primary_term` parameters aren't wired correctly in REST Client's CRUD API (elastic#38411)
  Enable CronEvalToolTest.testEnsureDateIsShownInRootLocale (elastic#38394)
  Fix failures in BulkProcessorIT#testGlobalParametersAndBulkProcessor. (elastic#38129)
  SQL: Implement CURRENT_DATE (elastic#38175)
  Mute testReadRequestsReturnLatestMappingVersion (elastic#38438)
  [ML] Report index unavailable instead of waiting for lazy node (elastic#38423)
  Update Rollup Caps to allow unknown fields (elastic#38339)
  ...
@jasontedor jasontedor merged commit 79a45b4 into elastic:master Feb 5, 2019
@jasontedor jasontedor deleted the retention-leases-recovery branch February 5, 2019 22:43
@jasontedor
Copy link
Member Author

@dnhatn and @ywelsch If we want a new step we can do it in a follow-up, I am going to keep moving for now.

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 11, 2019
* master:
  Add an authentication cache for API keys (elastic#38469)
  Fix exit code in certutil packaging test (elastic#38393)
  Enable logs for intermittent test failure (elastic#38426)
  Disable BWC to backport recovering retention leases (elastic#38477)
  Enable bwc tests now that elastic#38443 is backported. (elastic#38462)
  Fix Master Failover and DataNode Leave Blocking Snapshot (elastic#38460)
  Recover retention leases during peer recovery (elastic#38435)
  Set update mappings mater node timeout to 30 min (elastic#38439)
  Assert job is not null in FullClusterRestartIT (elastic#38218)
  Update ilm-api.asciidoc, point to REMOVE policy (elastic#38235) (elastic#38463)
  SQL: Fix esType for DATETIME/DATE and INTERVALS (elastic#38179)
  Handle deprecation header-AbstractUpgradeTestCase (elastic#38396)
  XPack: core/ccr/Security-cli migration to java-time (elastic#38415)
  Disable bwc tests for elastic#38443 (elastic#38456)
  Bubble-up exceptions from scheduler (elastic#38317)
  Re-enable TasksClientDocumentationIT.testCancelTasks (elastic#38234)
  Allow custom authorization with an authorization engine  (elastic#38358)
  CRUDDocumentationIT fix documentation references
  Remove support for internal versioning for concurrency control (elastic#38254)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants