Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent CCR recovery from missing documents #38237

Merged
merged 19 commits into from Feb 5, 2019

Conversation

Projects
None yet
4 participants
@tbrooks8
Copy link
Contributor

commented Feb 1, 2019

Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.

This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.

tbrooks8 added some commits Feb 1, 2019

WIP
WIP
WIP
@elasticmachine

This comment has been minimized.

Copy link

commented Feb 1, 2019

@tbrooks8

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2019

This is not ready for production. I pushed it up here so we can discuss.

This commit:

  1. Adds a test that will occasionally fail due to the local checkpoint != the max sequence number. Currently the test is set to run 5 times to guarantee consistent failures. However, that will be removed eventually.

  2. This commit comments out store.bootstrapNewHistory(). This method manually sets the the local checkpoint == the max sequence number. We need to discuss how to deal with this as I think the bootstrap new history method probably offers some behavior we want to retain (even if the sequence number part does not work for CCR).

  3. This commit attempts to advance the global checkpoint to the local check point after putting no-ops (for the normal non-ccr restore process). However, that cannot happen here as the replicationTracker.isPrimaryMode() is not yet in primary mode and that triggers an assertion. I'm not sure if we need to advance the global checkpoint in StoreRecovery#restore as I see that when the replication tracker goes to primary mode, it will advance the global checkpoint there. But this is something we should discuss.

    /**
     * Initializes the global checkpoint tracker in primary mode (see {@link #primaryMode}. Called on primary activation or promotion.
     */
    public synchronized void activatePrimaryMode(final long localCheckpoint) {
        assert invariant();
        assert primaryMode == false;
        assert checkpoints.get(shardAllocationId) != null && checkpoints.get(shardAllocationId).inSync &&
            checkpoints.get(shardAllocationId).localCheckpoint == SequenceNumbers.UNASSIGNED_SEQ_NO :
            "expected " + shardAllocationId + " to have initialized entry in " + checkpoints + " when activating primary";
        assert localCheckpoint >= SequenceNumbers.NO_OPS_PERFORMED;
        primaryMode = true;
        updateLocalCheckpoint(shardAllocationId, checkpoints.get(shardAllocationId), localCheckpoint);
        updateGlobalCheckpointOnPrimary();
        assert invariant();
    }

tbrooks8 and others added some commits Feb 2, 2019

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Feb 4, 2019

@tbrooks8 I've fixed things up as far as I think how we should handle this. I've also added a unit test. The integration test testFollowIndexWithConcurrentMappingChanges looks to fail sometimes, unrelated to the changes I've made here. It has some problems with concurrent mapping updates (where it tries to both update a field as text as well as a long).

While this fixes the problem here, it exposes another issue I think, namely that the primary will start off (i.e. be marked as started) with a history that contains gaps, i.e. local checkpoint != max sequence number. This can turn out to be problematic for replicas, because peer recovery only completes if all gaps are filled on the primary (see call to cancellableThreads.execute(() -> shard.waitForOpsToComplete(endingSeqNo)); in RecoverySourceHandler). This means that if someone does a PUT FOLLOW with wait_for_active_shards > 1, and we restore an IndexCommit with gaps, then the wait condition will time out, as we only start the following (which will complete the gaps) once the wait condition has passed. I think we should have initiateFollowing to first resume the following, and then execute the wait condition. This problem should be verifiable by extending the existing test with a wait_for_active_shards > 1 condition.

@ywelsch ywelsch changed the title WIP: Prevent Ccr recovery from missing documents Prevent CCR recovery from missing documents Feb 4, 2019

@ywelsch

ywelsch approved these changes Feb 4, 2019

Copy link
Contributor

left a comment

LGTM

tbrooks8 added some commits Feb 4, 2019

WIP
@tbrooks8

This comment has been minimized.

Copy link
Contributor Author

commented Feb 5, 2019

@ywelsch your changes look good to me. There is a test failing (reproducibly). Looks like we assert that the sequence numbers match between Lucene and the translog. Did you mean for this to be the assertion for testRestoreShard:

closeShard(target, false);
closeShards(source);

instead of:

closeShard(source, false);
closeShards(target);

I assume we should not be asserting the ops are the same for the target since there will be some no-ops? I can update the PR, I just wanted to check if that is what you intended.

ywelsch added some commits Feb 5, 2019

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Feb 5, 2019

No, I only wanted the source index to be leniently closed (because we force-fully inject a gap). The problem was that the test was not ensuring that the index is restored with soft-deletes enabled when it is snapshotted with soft-deletes enabled. I've fixed this now

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Feb 5, 2019

@elasticmachine run elasticsearch-ci/default-distro

@ywelsch

This comment has been minimized.

Copy link
Contributor

commented Feb 5, 2019

@elasticmachine run elasticsearch-ci/2

tbrooks8 added some commits Feb 5, 2019

Fix

@tbrooks8 tbrooks8 merged commit c2a8fe1 into elastic:master Feb 5, 2019

7 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 5, 2019

Merge branch 'master' into retention-leases-recovery
* master: (23 commits)
  Lift retention lease expiration to index shard (elastic#38380)
  Make Ccr recovery file chunk size configurable (elastic#38370)
  Prevent CCR recovery from missing documents (elastic#38237)
  re-enables awaitsfixed datemath tests (elastic#38376)
  Types removal fix FullClusterRestartIT warnings (elastic#38445)
  Make sure to reject mappings with type _doc when include_type_name is false. (elastic#38270)
  Updates the grok patterns to be consistent with logstash (elastic#27181)
  Ignore type-removal warnings in XPackRestTestHelper (elastic#38431)
  testHlrcFromXContent() should respect assertToXContentEquivalence() (elastic#38232)
  add basic REST test for geohash_grid (elastic#37996)
  Remove DiscoveryPlugin#getDiscoveryTypes (elastic#38414)
  Fix the clock resolution to millis in GetWatchResponseTests (elastic#38405)
  Throw AssertionError when no master (elastic#38432)
  `if_seq_no` and `if_primary_term` parameters aren't wired correctly in REST Client's CRUD API (elastic#38411)
  Enable CronEvalToolTest.testEnsureDateIsShownInRootLocale (elastic#38394)
  Fix failures in BulkProcessorIT#testGlobalParametersAndBulkProcessor. (elastic#38129)
  SQL: Implement CURRENT_DATE (elastic#38175)
  Mute testReadRequestsReturnLatestMappingVersion (elastic#38438)
  [ML] Report index unavailable instead of waiting for lazy node (elastic#38423)
  Update Rollup Caps to allow unknown fields (elastic#38339)
  ...

tbrooks8 added a commit to tbrooks8/elasticsearch that referenced this pull request Feb 5, 2019

Prevent CCR recovery from missing documents (elastic#38237)
Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.

This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.

@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Feb 12, 2019

Prevent CCR recovery from missing documents (elastic#38237)
Currently the snapshot/restore process manually sets the global
checkpoint to the max sequence number from the restored segements. This
does not work for Ccr as this will lead to documents that would be
recovered in the normal followering operation from being recovered.

This commit fixes this issue by setting the initial global checkpoint to
the existing local checkpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.