Uses auto generated timestamp with soft-deletes #33656

dnhatn · 2018-09-13T01:37:54Z

1. Peer-recovery

Today we don't store the auto-generated timestamp of indexing operations in Lucene and always assign -1 to all index operations from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica in the following test.

public void testRetryAppendOnlyInRecoveryAndReplication() throws Exception {
    Settings settings = Settings.builder()
        .put(IndexSettings.INDEX_SOFT_DELETES_SETTING.getKey(), true)
        .build();
    try (ReplicationGroup shards = createGroup(0, settings)) {
        shards.startAll();
        final IndexRequest originalRequest = new IndexRequest(
            index.getName(), "type").source("{}", XContentType.JSON);
        originalRequest.process(Version.CURRENT, null, index.getName());
        IndexRequest retryRequest = new IndexRequest();
        try (BytesStreamOutput out = new BytesStreamOutput()) {
            originalRequest.writeTo(out);
            try (StreamInput in = out.bytes().streamInput()) {
                retryRequest.readFrom(in);
            }
        }
        retryRequest.onRetry();
        shards.index(retryRequest);
        IndexShard replica = shards.addReplica();
        shards.recoverReplica(replica); // timestamp on replica is -1
        shards.assertAllEqual(1);
        shards.index(originalRequest); // we optimize this request on replica
        shards.assertAllEqual(1);
    }
}

To fix this, we need to assign a value which is at least the (original) timestamp of the index request to its corresponding index request from LucenChangeSnapshot. Here we can use the latest auto-generated timestamp of Engine.

2. Optimize indexing on a FollowingEngine in CCR

We disable optimization for index requests whose origin are recovery (retry always is true). To enable this optimization in CCR:

We need to make sure that a FollowingEngine processes an append-only operation once. This can be done using LocalCheckpointTracker.
We need to store the retry flag to Lucene index and extend Translog#Index to include this flag. This should be fast with a single value DocValues.

@s1monw WDYT? /cc @bleskes

This a subtask of #30086.

elasticmachine · 2018-09-13T01:37:55Z

Pinging @elastic/es-distributed

s1monw · 2018-09-13T11:46:34Z

To fix this, we need to assign a value which is at least the (original) timestamp of the index request to its corresponding index request from LucenChangeSnapshot. Here we can use the latest auto-generated timestamp of Engine.

I don't think we can do that. We don't know if the original request had such a timestamp or not. if we assign it to the wrong value we might miss an update. I think we should rather not try to optimize the peer recovery case and make sure we propagate the maxUnsafeAutoIdTimestamp from the primary to the replica before we replay the ops to it. that way we can be sure that if the retry / orig request arrives we deoptimize it and ops that come through peer recovery will be de-optimized.

We disable optimization for index requests whose origin are recovery (retry always is true). To enable this optimization in CCR:

I wonder if we can be smarter here. Today we can only optimize for append-only if the ID was autogenerated. Yet, for the following engine that's a different game. At the moment we fetch docs from the leader we could (if we had the information what was the highest seqID that actually did a delete or update in the engine) optimize also documents that have non-autogenerated IDs with append only. We can then use the seqID as the timestamp. Once we see the leader tells us that there is a seqID X that caused an update / delete we mark all ops with seqId <= X as retry. This will cause the right behavior for us. We also need to make sure that if we fetch docs a second time we mark them as retries. Lemme know if I am missing something.

dnhatn · 2018-09-13T12:05:06Z

I think we should rather not try to optimize the peer recovery case and make sure we propagate the maxUnsafeAutoIdTimestamp from the primary to the replica before we replay the ops to it.

Yes, I am good with this since we don't have to worry about the correctness.

At the moment we fetch docs from the leader we could (if we had the information what was the highest seqID that actually did a delete or update in the engine) optimize also documents that have non-autogenerated IDs with append only.

Wow, superb idea. Thanks so much @s1monw.

dnhatn · 2018-09-18T14:57:53Z

We will implement an optimization using sequence number on the following engine with these sub-tasks:

Track max_seq_no_of_updates on the primary (Track max seq_no of updates or deletes on primary #33842)
Replicate max_seq_no_updates to replicas in replication requests (Replicate max seq_no of updates to replicas #33967)
Replicate max_seq_no_updates to follower in shard change requests (CCR: replicates max seq_no of updates to follower #34051)
Apply optimization using max_seq_no_updates on the following engine (CCR: Optimize indexing ops using seq_no on followers #34099)

This change adds "contains" method to LocalCheckpointTracker. One of the use cases is to check if a given operation has been processed in an engine or not by looking up its seq_no in LocalCheckpointTracker. Relates #33656

Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222

This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates #33656

We start tracking max seq_no_of_updates on the primary in elastic#33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates elastic#33656

Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates elastic#33656 Relates elastic#33222

This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates elastic#33656

We start tracking max seq_no_of_updates on the primary in #33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates #33656

This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates elastic#33656

This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates #33656

This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates elastic#33656

We start tracking max seq_no_of_updates on the primary in elastic#33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates elastic#33656

This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates #33656

We start tracking max seq_no_of_updates on the primary in #33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates #33656

This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates #33656

We start tracking max seq_no_of_updates on the primary in elastic#33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates elastic#33656

This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates elastic#33656

This change introduces the indexing optimization using sequence numbers in the FollowingEngine. This optimization uses the max_seq_no_updates which is tracked on the primary of the leader and replicated to replicas and followers. Relates #33656

dnhatn · 2018-10-04T03:19:58Z

All subtasks have been integrated.

This change adds "contains" method to LocalCheckpointTracker. One of the use cases is to check if a given operation has been processed in an engine or not by looking up its seq_no in LocalCheckpointTracker. Relates #33656

Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222

This PR is the first step to use seq_no to optimize indexing operations. The idea is to track the max seq_no of either update or delete ops on a primary, and transfer this information to replicas, and replicas use it to optimize indexing plan for index operations (with assigned seq_no). The max_seq_no_of_updates on primary is initialized once when a primary finishes its local recovery or peer recovery in relocation or being promoted. After that, the max_seq_no_of_updates is only advanced internally inside an engine when processing update or delete operations. Relates #33656

We start tracking max seq_no_of_updates on the primary in #33842. This commit replicates that value from a primary to its replicas in replication requests or the translog phase of peer-recovery. With this change, we guarantee that the value of max seq_no_of_updates on a replica when any index/delete operation is performed at least the max_seq_no_of_updates on the primary when that operation was executed. Relates #33656

This commit replicates the max_seq_no_of_updates on the leading index to the primaries of the following index via ShardFollowNodeTask. The max_seq_of_updates is then transmitted to the replicas of the follower via replication requests (that's BulkShardOperationsRequest). Relates #33656

This change introduces the indexing optimization using sequence numbers in the FollowingEngine. This optimization uses the max_seq_no_updates which is tracked on the primary of the leader and replicated to replicas and followers. Relates #33656

A CCR test failure shows that the approach in #34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates #34474 Relates #33656

…ic#35230) A CCR test failure shows that the approach in elastic#34474 is flawed. Restoring the LocalCheckpointTracker from an index commit can cause both FollowingEngine and InternalEngine to incorrectly ignore some deletes. Here is a small scenario illustrating the problem: 1. Delete doc with seq=1 => engine will add a delete tombstone to Lucene 2. Flush a commit consisting of only the delete tombstone 3. Index doc with seq=0 => engine will add that doc to Lucene but soft-deleted 4. Restart an engine with the commit (step 2); the engine will fill its LocalCheckpointTracker with the delete tombstone in the commit 5. Replay the local translog in reverse order: index#0 then delete#1 6. When process index#0, an engine will add it into Lucene as a live doc and advance the local checkpoint to 1 (seq#1 was restored from the commit - step 4). 7. When process delete#1, an engine will skip it because seq_no=1 is less than or equal to the local checkpoint. We should have zero document after recovering from translog, but here we have one. Since all operations after the local checkpoint of the safe commit are retained, we should find them if the look-up considers also soft-deleted documents. This PR fills the disparity between the version map and the local checkpoint tracker by taking soft-deleted documents into account while resolving strategy for engine operations. Relates elastic#34474 Relates elastic#33656

dnhatn added :Distributed/CCR Issues around the Cross Cluster State Replication features team-discuss labels Sep 13, 2018

dnhatn self-assigned this Sep 13, 2018

elasticmachine mentioned this issue Sep 13, 2018

Introduce cross-cluster replication #30086

Closed

29 tasks

dnhatn mentioned this issue Sep 14, 2018

Propagate max_auto_id_timestamp in peer recovery #33693

Merged

dnhatn removed the team-discuss label Sep 17, 2018

This was referenced Sep 19, 2018

Track max seq_no of updates or deletes on primary #33842

Merged

Add contains method to LocalCheckpointTracker #33871

Merged

dnhatn mentioned this issue Sep 22, 2018

Replicate max seq_no of updates to replicas #33967

Merged

dnhatn mentioned this issue Sep 25, 2018

CCR: replicates max seq_no of updates to follower #34051

Merged

dnhatn closed this as completed Oct 4, 2018

dnhatn mentioned this issue Oct 19, 2018

Fill LocalCheckpointTracker with Lucene commit #34474

Merged

dnhatn mentioned this issue Nov 3, 2018

Use soft-deleted docs to resolve strategy for engine operation #35230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uses auto generated timestamp with soft-deletes #33656

Uses auto generated timestamp with soft-deletes #33656

dnhatn commented Sep 13, 2018 •

edited

elasticmachine commented Sep 13, 2018

s1monw commented Sep 13, 2018

dnhatn commented Sep 13, 2018

dnhatn commented Sep 18, 2018 •

edited

dnhatn commented Oct 4, 2018

Uses auto generated timestamp with soft-deletes #33656

Uses auto generated timestamp with soft-deletes #33656

Comments

dnhatn commented Sep 13, 2018 • edited

1. Peer-recovery

2. Optimize indexing on a FollowingEngine in CCR

elasticmachine commented Sep 13, 2018

s1monw commented Sep 13, 2018

dnhatn commented Sep 13, 2018

dnhatn commented Sep 18, 2018 • edited

dnhatn commented Oct 4, 2018

dnhatn commented Sep 13, 2018 •

edited

dnhatn commented Sep 18, 2018 •

edited