-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover retention leases during peer recovery #38435
Recover retention leases during peer recovery #38435
Conversation
This commit integrates retention leases with recovery. With this change, we copy the current retention leases on primary to the replica during phase two of recovery. At this point, the replica will have been added to the replication group and so is already receiving retention lease sync requests from the primary. This means that if any retention lease syncs are triggered on the primary after we sample the retention leases here during phase two, that sync request will also arrive on the replica ensuring that the replica is from this point on up to date with the retention leases on the primary. We have to copy these during phase two since we will be applying indexing operations, potentially triggering merges, and therefore must ensure the correct retention leases are in place beforehand.
Pinging @elastic/es-distributed |
@elasticmachine run elasticsearch-ci/default-distro |
1 similar comment
@elasticmachine run elasticsearch-ci/default-distro |
@elasticmachine run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to sync RetentionLeases multiple times since syncing the leases once during recovery is enough. However, we can't do it without introducing a new step because we can't piggyback the leases in the prepareTranslog step nor the finalize step. Moreover, we expect to have a few leases, thus this choice makes a lot of sense to me.
@@ -39,18 +40,26 @@ | |||
private int totalTranslogOps = RecoveryState.Translog.UNKNOWN; | |||
private long maxSeenAutoIdTimestampOnPrimary; | |||
private long maxSeqNoOfUpdatesOrDeletesOnPrimary; | |||
private RetentionLeases retentionLeases; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to initialize the retentionLeases with an empty leases in a mixed cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. I pushed 4d9fa1a.
Indeed, I went through exactly the same dilemma. We could introduce a new step but I am not sure that it's worth it. Any thoughts @ywelsch before I push go on this? |
* master: (23 commits) Lift retention lease expiration to index shard (elastic#38380) Make Ccr recovery file chunk size configurable (elastic#38370) Prevent CCR recovery from missing documents (elastic#38237) re-enables awaitsfixed datemath tests (elastic#38376) Types removal fix FullClusterRestartIT warnings (elastic#38445) Make sure to reject mappings with type _doc when include_type_name is false. (elastic#38270) Updates the grok patterns to be consistent with logstash (elastic#27181) Ignore type-removal warnings in XPackRestTestHelper (elastic#38431) testHlrcFromXContent() should respect assertToXContentEquivalence() (elastic#38232) add basic REST test for geohash_grid (elastic#37996) Remove DiscoveryPlugin#getDiscoveryTypes (elastic#38414) Fix the clock resolution to millis in GetWatchResponseTests (elastic#38405) Throw AssertionError when no master (elastic#38432) `if_seq_no` and `if_primary_term` parameters aren't wired correctly in REST Client's CRUD API (elastic#38411) Enable CronEvalToolTest.testEnsureDateIsShownInRootLocale (elastic#38394) Fix failures in BulkProcessorIT#testGlobalParametersAndBulkProcessor. (elastic#38129) SQL: Implement CURRENT_DATE (elastic#38175) Mute testReadRequestsReturnLatestMappingVersion (elastic#38438) [ML] Report index unavailable instead of waiting for lazy node (elastic#38423) Update Rollup Caps to allow unknown fields (elastic#38339) ...
* master: Add an authentication cache for API keys (elastic#38469) Fix exit code in certutil packaging test (elastic#38393) Enable logs for intermittent test failure (elastic#38426) Disable BWC to backport recovering retention leases (elastic#38477) Enable bwc tests now that elastic#38443 is backported. (elastic#38462) Fix Master Failover and DataNode Leave Blocking Snapshot (elastic#38460) Recover retention leases during peer recovery (elastic#38435) Set update mappings mater node timeout to 30 min (elastic#38439) Assert job is not null in FullClusterRestartIT (elastic#38218) Update ilm-api.asciidoc, point to REMOVE policy (elastic#38235) (elastic#38463) SQL: Fix esType for DATETIME/DATE and INTERVALS (elastic#38179) Handle deprecation header-AbstractUpgradeTestCase (elastic#38396) XPack: core/ccr/Security-cli migration to java-time (elastic#38415) Disable bwc tests for elastic#38443 (elastic#38456) Bubble-up exceptions from scheduler (elastic#38317) Re-enable TasksClientDocumentationIT.testCancelTasks (elastic#38234) Allow custom authorization with an authorization engine (elastic#38358) CRUDDocumentationIT fix documentation references Remove support for internal versioning for concurrency control (elastic#38254)
This commit integrates retention leases with recovery. With this change, we copy the current retention leases on primary to the replica during phase two of recovery. At this point, the replica will have been added to the replication group and so is already receiving retention lease sync requests from the primary. This means that if any retention lease syncs are triggered on the primary after we sample the retention leases here during phase two, that sync request will also arrive on the replica ensuring that the replica is from this point on up to date with the retention leases on the primary. We have to copy these during phase two since we will be applying indexing operations, potentially triggering merges, and therefore must ensure the correct retention leases are in place beforehand.
Relates #37165