Dedup translog operations by reading in reverse #27268

dnhatn · 2017-11-04T21:23:00Z

Currently, translog operations are read and processed one by one. This
may be a problem as stale operations in translogs may suddenly reappear
in recoveries. To make sure that stale operations won't be processed, we
read the translog files in a reverse order (eg. from the most recent
file to the oldest file) and only process an operation if its sequence
number was not seen before.

Relates to #10708

Currently, translog operations are read and processed one by one. This may be a problem as stale operations in translogs may suddenly reappear in recoveries. To make sure that stale operations won't be processed, we read the translog files in a reverse order (eg. from the most recent file to the oldest file) and only process an operation if its sequence number was not seen before.

bleskes

I like this. I left some minor comments. I would also love to see a replication group test that creates collisions between old primary operations and new primary operations with the same sequence number and then performs recovery from it and see they don't re-appear.

you can look at RecoveryTests to see how to do these things. Another example is IndexLevelReplicationTests#testConflictingOpsOnReplica. I'm thinking something like:

3 shard group with some initial data.
Remove a replica from the group and index an operation.
Fail the primary and add the removed replica back into the group. Promote the replica just added to a primary.
Index one more op to the group . This will create a collision.
Make sure that peer recovery from the 3 replica (the one that was never removed) works. Note that you'll need to make that shard primary.

PS we have a bug in how replicas deal with the translog when it detects that there's a new primary when an operations acquires a replica lock (as opposed to the cluster state coming in), causing duplicates to be in the same translog gen. @jasontedor is working on a fix but it will be good to see that your test runs into that bug without his fix.

Feel free to reach out. The test I suggested can be complex to set up.

bleskes · 2017-11-07T20:57:29Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

-                return op;
+            Translog.Operation op;
+            while ((op = current.next()) != null) {
+                if (op.seqNo() < 0 || seenSeqNo.getAndSet(op.seqNo()) == false) {


can we check that the seqNo() is not set (UNASSIGNED_SEQ_NO) instead of a generic <0?

bleskes · 2017-11-07T20:58:39Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

+     */
+    static final class SeqNumSet {
+        static final short BIT_SET_SIZE = 1024;
+        private final LongSet topTier = new LongHashSet();


these are not always the top tier - can we instead call it something like completedBitSets?

I was implementing a multi-levels of bitsets structure. When a bitset at the lower is completed, we move it to the higher level as a single bit in another bitset, and so on. However, I preferred a simpler solution. These names are derived from that structure but are no longer valid. I pushed 5935009

bleskes · 2017-11-07T20:59:45Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

+    static final class SeqNumSet {
+        static final short BIT_SET_SIZE = 1024;
+        private final LongSet topTier = new LongHashSet();
+        private final LongObjectHashMap<CountedBitSet> bottomTier = new LongObjectHashMap<>();


ongoingSets?

incompleteBitSets?

bleskes · 2017-11-07T21:05:07Z

core/src/main/java/org/elasticsearch/index/translog/Translog.java

+         * The number of operations has been skipped in the snapshot so far.
+         * Unlike {@link #totalOperations()}, this value is updated each time after {@link #next()}) is called.
+         */
+        default int skippedOperations() {


I personally find skippedOperations a bit tricky as we don't really skip operations. Rather we remove operations that have been overridden. Maybe "overridenOperations" and update the java docs to explain what overridden means?

I pushed 5cc33bb

bleskes · 2017-11-07T21:07:12Z

core/src/test/java/org/elasticsearch/index/translog/MultiSnapshotTests.java

+        });
+    }
+
+    public void testTrackSeqNumDenseRanges() throws Exception {


what does this test add compared to testTrackSeqNumRandomRanges? I think we can drop testTrackSeqNumRandomRanges and keep this one? It's better to have move collisions in testing).

bleskes · 2017-11-07T21:07:51Z

core/src/test/java/org/elasticsearch/index/translog/MultiSnapshotTests.java

+    public void testTrackSeqNumDenseRanges() throws Exception {
+        final MultiSnapshot.SeqNumSet bitSet = new MultiSnapshot.SeqNumSet();
+        final LongSet normalSet = new LongHashSet();
+        IntStream.range(0, between(20_000, 50_000)).forEach(i -> {


I think we can run up to 10K all the time? it's enough show both collisions and non collisions?

bleskes · 2017-11-07T21:10:44Z

core/src/test/java/org/elasticsearch/index/translog/MultiSnapshotTests.java

+        long currentSeq = between(10_000_000, 1_000_000_000);
+        final int iterations = between(100, 2000);
+        for (long i = 0; i < iterations; i++) {
+            List<Long> batch = LongStream.range(currentSeq, currentSeq + between(1, 1000))


can we make sure batches sometimes crossed the underlying 1024 bitset implementation, to check that we handle completed sets correctly? Maybe we can also expose the number of completed sets in SeqNumSet and test that we see the right number.

👍 I'd also quite like to see some non-random tests that deliberately hit the corners, rather than relying on the random ones finding everything, and also think it'd be useful to assert that there are the expected number of complete/incomplete sets.

I pushed ea11a2a

bleskes · 2017-11-07T21:11:21Z

core/src/test/java/org/elasticsearch/index/translog/MultiSnapshotTests.java

+        final MultiSnapshot.SeqNumSet bitSet = new MultiSnapshot.SeqNumSet();
+        final LongSet normalSet = new LongHashSet();
+        long currentSeq = between(10_000_000, 1_000_000_000);
+        final int iterations = between(100, 2000);


maybe use scaledRandomIntBetween? this means faster runs in intellij and a better coverage in CI.

It's good to know scaledRandomIntBetween. Thank you.

bleskes · 2017-11-07T23:18:33Z

PS we have a bug in how replicas deal with the translog when it detects that there's a new primary when an operations acquires a replica lock (as opposed to the cluster state coming in), causing duplicates to be in the same translog gen. @jasontedor is working on a fix but it will be good to see that your test runs into that bug without his fix.

Scratch this PS. Jason looked more into this and concluded that the missing roll generation when processing an new term from the cluster state can't cause duplicates after all (I know this is vague - I'll explain once we're both online)

DaveCTurner · 2017-11-08T09:26:06Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

+    /**
+     * Sequence numbers from translog are likely to form contiguous ranges, thus using two tiers can reduce memory usage.
+     */
+    static final class SeqNumSet {


Could we call this SequenceNumberSet instead? Also, this idea feels more widely applicable. Can it be used elsewhere?

This should be used by MultiSnapshot only.

DaveCTurner · 2017-11-08T09:33:26Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

+    }
+
+    /**
+     * Sequence numbers from translog are likely to form contiguous ranges, thus using two tiers can reduce memory usage.


I'm not familiar with the scale of the numbers here, but I take it that the memory usage is an important constraint? Without looking at how the sets all work in great detail, I guess this takes ~10x less memory - does that sound about right? Is that enough? Could the comment include more detail, answering some of these questions? Could you expand the comment to answer some of these questions?

I have adjusted the comment in 5935009 but not going to such detail.

DaveCTurner · 2017-11-08T09:42:19Z

core/src/test/java/org/elasticsearch/index/translog/SnapshotMatchers.java

+
+        @Override
+        public void describeTo(Description description) {
+


Should this be empty?

I pushed ebe92b6. Thank you.

DaveCTurner · 2017-11-08T09:43:21Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

+            while ((op = current.next()) != null) {
+                if (op.seqNo() < 0 || seenSeqNo.getAndSet(op.seqNo()) == false) {
+                    return op;
+                }else {


Nit: spacing :)

# Conflicts: # core/src/test/java/org/elasticsearch/index/translog/TranslogTests.java

dnhatn · 2017-11-08T22:59:16Z

@bleskes, @DaveCTurner I have added a replication test and addressed your feedbacks. Could you please take another look? Thank you!

DaveCTurner · 2017-11-13T08:44:12Z

core/src/test/java/org/elasticsearch/index/translog/MultiSnapshotTests.java

+import static org.hamcrest.CoreMatchers.equalTo;
+import static org.hamcrest.Matchers.lessThanOrEqualTo;
+
+public class MultiSnapshotTests extends ESTestCase {


I'd still like to see a deterministic (i.e. no randomness) test that exercises the corners of SeqNumSet.

I've added 7d3482b

DaveCTurner · 2017-11-13T08:44:47Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

+     * Sequence numbers from translog are likely to form contiguous ranges,
+     * thus collapsing a completed bitset into a single entry will reduce memory usage.
+     */
+    static final class SeqNumSet {


Could we call this SequenceNumberSet instead?

This class is used only here; a concise name in a private context is better. Anyway, I renamed it to SeqNoSet as we call SeqNo*, not SeqNum* in a3994f8

DaveCTurner · 2017-11-13T08:46:43Z

Looks OK to me - I added a couple more comments as I think they were lost in your changes last time round. I don't feel qualified to give a full LGTM on this as I'm not familiar enough with this area yet. @ywelsch can you?

dnhatn · 2017-11-16T20:21:13Z

@jasontedor, @bleskes With this PR, we will read translog operations in reverse, however, this may cause LocalCheckpointTracker consume more memory than before when executing a peer-recovering. Is it ok if I address this later in a follow-up PR using the seenSeqNo of the MultiSnapshot?

elasticsearch/core/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java

Lines 246 to 255 in 90d6317

    
           final LocalCheckpointTracker tracker = new LocalCheckpointTracker(startingSeqNo, startingSeqNo - 1); 
        
           try (Translog.Snapshot snapshot = shard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo)) { 
        
               Translog.Operation operation; 
        
               while ((operation = snapshot.next()) != null) { 
        
                   if (operation.seqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO) { 
        
                       tracker.markSeqNoAsCompleted(operation.seqNo()); 
        
                   } 
        
               } 
        
           } 
        
           return tracker.getCheckpoint() >= endingSeqNo;

bleskes · 2017-11-22T09:32:29Z

we will read translog operations in reverse, however, this may cause LocalCheckpointTracker consume more memory than before when executing a peer-recovering.

We can indeed optimize as a follow up. I do wonder if we should keep things simpler and use CountedBitSet until it's full and then free the underlying FixedBitSet and treat this case as "all set". WDYT? Maybe this is good enough for this PR too?

dnhatn · 2017-11-26T21:45:32Z

Thanks @bleskes and @DaveCTurner. I will add a follow-up to simplify the SeqNoSet.

Currently, translog operations are read and processed one by one. This may be a problem as stale operations in translogs may suddenly reappear in recoveries. To make sure that stale operations won't be processed, we read the translog files in a reverse order (eg. from the most recent file to the oldest file) and only process an operation if its sequence number was not seen before. Relates to #10708

* master: Skip shard refreshes if shard is `search idle` (#27500) Remove workaround in translog rest test (#27530) inner_hits: Return an empty _source for nested inner hit when filtering on a field that doesn't exist. percolator: Avoid TooManyClauses exception if number of terms / ranges is exactly equal to 1024 Dedup translog operations by reading in reverse (#27268) Ensure logging is configured for CLI commands Ensure `doc_stats` are changing even if refresh is disabled (#27505) Fix classes that can exit Revert "Adjust CombinedDeletionPolicy for multiple commits (#27456)" Transpose expected and actual, and remove duplicate info from message. (#27515) [DOCS] Fixed broken link in breaking changes

* 6.x: [DOCS] s/Spitting/Splitting in split index docs inner_hits: Return an empty _source for nested inner hit when filtering on a field that doesn't exist. percolator: Avoid TooManyClauses exception if number of terms / ranges is exactly equal to 1024 Dedup translog operations by reading in reverse (#27268) Ensure logging is configured for CLI commands Ensure `doc_stats` are changing even if refresh is disabled (#27505) Fix classes that can exit Revert "Adjust CombinedDeletionPolicy for multiple commits (#27456)" Transpose expected and actual, and remove duplicate info from message.

Today, we maintain two sets in a SeqNoSet: ongoing sets and completed sets. We can remove the completed sets by releasing the internal bitset of a CountedBitSet when all its bits are set. Relates elastic#27268

Today, we maintain two sets in a SeqNoSet: ongoing sets and completed sets. We can remove the completed sets and use only the ongoing sets by releasing the internal bitset of a CountedBitSet when all its bits are set. This behaves like two sets but simpler. This commit also makes CountedBitSet as a drop-in replacement for BitSet. Relates #27268

dnhatn added :Sequence IDs >enhancement v6.1.0 v7.0.0 labels Nov 4, 2017

dnhatn requested review from bleskes and jasontedor November 4, 2017 21:23

dnhatn added the :Translog label Nov 6, 2017

bleskes suggested changes Nov 7, 2017

View reviewed changes

DaveCTurner reviewed Nov 8, 2017

View reviewed changes

dnhatn added 8 commits November 8, 2017 14:06

add the replication test

d006988

remove tier in name and comment

5935009

improve the SeqNumSet tests

ea11a2a

add description for the matcher

ebe92b6

Merge branch 'master' into dedup-translog-merge

def1d39

# Conflicts: # core/src/test/java/org/elasticsearch/index/translog/TranslogTests.java

skippedOperations -> overriddenOperations

5cc33bb

check against unassigned

aa21d87

correct condition

9bdb507

fix compilation

b3e0b6d

DaveCTurner reviewed Nov 13, 2017

View reviewed changes

dnhatn mentioned this pull request Nov 21, 2017

Allows an engine start from any previous commit #27485

Closed

dnhatn added v6.2.0 and removed v6.1.0 labels Nov 22, 2017

dnhatn added 6 commits November 22, 2017 11:15

address feedback

48b0ff9

more restrict asserts

97aa4d7

Merge branch 'master' into dedup-translog

c960f37

another address

a3ff8c8

add todo

f79e3ef

Merge branch 'master' into dedup-translog

29aa2f2

dnhatn merged commit a4b4e14 into elastic:master Nov 26, 2017

dnhatn deleted the dedup-translog branch November 26, 2017 21:44

dnhatn added the backport pending label Nov 26, 2017

dnhatn removed the backport pending label Nov 26, 2017

bleskes mentioned this pull request Nov 27, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

dnhatn mentioned this pull request Nov 27, 2017

Simplify MultiSnapshot#SeqNoset #27547

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Dedup translog operations by reading in reverse #27268

Dedup translog operations by reading in reverse #27268

Conversation

dnhatn commented Nov 4, 2017

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Nov 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner Nov 8, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Nov 8, 2017

Choose a reason for hiding this comment

dnhatn Nov 22, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn Nov 22, 2017 • edited

Choose a reason for hiding this comment

DaveCTurner commented Nov 13, 2017

dnhatn commented Nov 16, 2017

bleskes commented Nov 22, 2017

dnhatn commented Nov 26, 2017

bleskes commented Nov 7, 2017 •

edited

DaveCTurner Nov 8, 2017 •

edited

dnhatn Nov 22, 2017 •

edited

dnhatn Nov 22, 2017 •

edited