Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedup translog operations by reading in reverse #27268

Merged
merged 19 commits into from Nov 26, 2017

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Nov 4, 2017

Currently, translog operations are read and processed one by one. This
may be a problem as stale operations in translogs may suddenly reappear
in recoveries. To make sure that stale operations won't be processed, we
read the translog files in a reverse order (eg. from the most recent
file to the oldest file) and only process an operation if its sequence
number was not seen before.

Relates to #10708

Currently, translog operations are read and processed one by one.  This
may be a problem as stale operations in translogs may suddenly reappear
in recoveries. To make sure that stale operations won't be processed, we
read the translog files in a reverse order (eg. from the most recent
file to the oldest file) and only process an operation if its sequence
number was not seen before.
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. I left some minor comments. I would also love to see a replication group test that creates collisions between old primary operations and new primary operations with the same sequence number and then performs recovery from it and see they don't re-appear.

you can look at RecoveryTests to see how to do these things. Another example is IndexLevelReplicationTests#testConflictingOpsOnReplica. I'm thinking something like:

  1. 3 shard group with some initial data.
  2. Remove a replica from the group and index an operation.
  3. Fail the primary and add the removed replica back into the group. Promote the replica just added to a primary.
  4. Index one more op to the group . This will create a collision.
  5. Make sure that peer recovery from the 3 replica (the one that was never removed) works. Note that you'll need to make that shard primary.

PS we have a bug in how replicas deal with the translog when it detects that there's a new primary when an operations acquires a replica lock (as opposed to the cluster state coming in), causing duplicates to be in the same translog gen. @jasontedor is working on a fix but it will be good to see that your test runs into that bug without his fix.

Feel free to reach out. The test I suggested can be complex to set up.

return op;
Translog.Operation op;
while ((op = current.next()) != null) {
if (op.seqNo() < 0 || seenSeqNo.getAndSet(op.seqNo()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check that the seqNo() is not set (UNASSIGNED_SEQ_NO) instead of a generic <0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
static final class SeqNumSet {
static final short BIT_SET_SIZE = 1024;
private final LongSet topTier = new LongHashSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not always the top tier - can we instead call it something like completedBitSets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was implementing a multi-levels of bitsets structure. When a bitset at the lower is completed, we move it to the higher level as a single bit in another bitset, and so on. However, I preferred a simpler solution. These names are derived from that structure but are no longer valid. I pushed 5935009

static final class SeqNumSet {
static final short BIT_SET_SIZE = 1024;
private final LongSet topTier = new LongHashSet();
private final LongObjectHashMap<CountedBitSet> bottomTier = new LongObjectHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ongoingSets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incompleteBitSets?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* The number of operations has been skipped in the snapshot so far.
* Unlike {@link #totalOperations()}, this value is updated each time after {@link #next()}) is called.
*/
default int skippedOperations() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find skippedOperations a bit tricky as we don't really skip operations. Rather we remove operations that have been overridden. Maybe "overridenOperations" and update the java docs to explain what overridden means?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 5cc33bb

});
}

public void testTrackSeqNumDenseRanges() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this test add compared to testTrackSeqNumRandomRanges? I think we can drop testTrackSeqNumRandomRanges and keep this one? It's better to have move collisions in testing).

public void testTrackSeqNumDenseRanges() throws Exception {
final MultiSnapshot.SeqNumSet bitSet = new MultiSnapshot.SeqNumSet();
final LongSet normalSet = new LongHashSet();
IntStream.range(0, between(20_000, 50_000)).forEach(i -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can run up to 10K all the time? it's enough show both collisions and non collisions?

long currentSeq = between(10_000_000, 1_000_000_000);
final int iterations = between(100, 2000);
for (long i = 0; i < iterations; i++) {
List<Long> batch = LongStream.range(currentSeq, currentSeq + between(1, 1000))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make sure batches sometimes crossed the underlying 1024 bitset implementation, to check that we handle completed sets correctly? Maybe we can also expose the number of completed sets in SeqNumSet and test that we see the right number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'd also quite like to see some non-random tests that deliberately hit the corners, rather than relying on the random ones finding everything, and also think it'd be useful to assert that there are the expected number of complete/incomplete sets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed ea11a2a

final MultiSnapshot.SeqNumSet bitSet = new MultiSnapshot.SeqNumSet();
final LongSet normalSet = new LongHashSet();
long currentSeq = between(10_000_000, 1_000_000_000);
final int iterations = between(100, 2000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use scaledRandomIntBetween? this means faster runs in intellij and a better coverage in CI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to know scaledRandomIntBetween. Thank you.

@bleskes
Copy link
Contributor

bleskes commented Nov 7, 2017

PS we have a bug in how replicas deal with the translog when it detects that there's a new primary when an operations acquires a replica lock (as opposed to the cluster state coming in), causing duplicates to be in the same translog gen. @jasontedor is working on a fix but it will be good to see that your test runs into that bug without his fix.

Scratch this PS. Jason looked more into this and concluded that the missing roll generation when processing an new term from the cluster state can't cause duplicates after all (I know this is vague - I'll explain once we're both online)

/**
* Sequence numbers from translog are likely to form contiguous ranges, thus using two tiers can reduce memory usage.
*/
static final class SeqNumSet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this SequenceNumberSet instead? Also, this idea feels more widely applicable. Can it be used elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be used by MultiSnapshot only.

}

/**
* Sequence numbers from translog are likely to form contiguous ranges, thus using two tiers can reduce memory usage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the scale of the numbers here, but I take it that the memory usage is an important constraint? Without looking at how the sets all work in great detail, I guess this takes ~10x less memory - does that sound about right? Is that enough? Could the comment include more detail, answering some of these questions? Could you expand the comment to answer some of these questions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have adjusted the comment in 5935009 but not going to such detail.


@Override
public void describeTo(Description description) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed ebe92b6. Thank you.

while ((op = current.next()) != null) {
if (op.seqNo() < 0 || seenSeqNo.getAndSet(op.seqNo()) == false) {
return op;
}else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spacing :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dnhatn
Copy link
Member Author

dnhatn commented Nov 8, 2017

@bleskes, @DaveCTurner I have added a replication test and addressed your feedbacks. Could you please take another look? Thank you!

import static org.hamcrest.CoreMatchers.equalTo;
import static org.hamcrest.Matchers.lessThanOrEqualTo;

public class MultiSnapshotTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see a deterministic (i.e. no randomness) test that exercises the corners of SeqNumSet.

Copy link
Member Author

@dnhatn dnhatn Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added 7d3482b

* Sequence numbers from translog are likely to form contiguous ranges,
* thus collapsing a completed bitset into a single entry will reduce memory usage.
*/
static final class SeqNumSet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this SequenceNumberSet instead?

Copy link
Member Author

@dnhatn dnhatn Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is used only here; a concise name in a private context is better. Anyway, I renamed it to SeqNoSet as we call SeqNo*, not SeqNum* in a3994f8

@DaveCTurner
Copy link
Contributor

Looks OK to me - I added a couple more comments as I think they were lost in your changes last time round. I don't feel qualified to give a full LGTM on this as I'm not familiar enough with this area yet. @ywelsch can you?

@dnhatn
Copy link
Member Author

dnhatn commented Nov 16, 2017

@jasontedor, @bleskes With this PR, we will read translog operations in reverse, however, this may cause LocalCheckpointTracker consume more memory than before when executing a peer-recovering. Is it ok if I address this later in a follow-up PR using the seenSeqNo of the MultiSnapshot?

final LocalCheckpointTracker tracker = new LocalCheckpointTracker(startingSeqNo, startingSeqNo - 1);
try (Translog.Snapshot snapshot = shard.getTranslog().newSnapshotFromMinSeqNo(startingSeqNo)) {
Translog.Operation operation;
while ((operation = snapshot.next()) != null) {
if (operation.seqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO) {
tracker.markSeqNoAsCompleted(operation.seqNo());
}
}
}
return tracker.getCheckpoint() >= endingSeqNo;

@bleskes
Copy link
Contributor

bleskes commented Nov 22, 2017

we will read translog operations in reverse, however, this may cause LocalCheckpointTracker consume more memory than before when executing a peer-recovering.

We can indeed optimize as a follow up. I do wonder if we should keep things simpler and use CountedBitSet until it's full and then free the underlying FixedBitSet and treat this case as "all set". WDYT? Maybe this is good enough for this PR too?

@dnhatn dnhatn added v6.2.0 and removed v6.1.0 labels Nov 22, 2017
@dnhatn dnhatn merged commit a4b4e14 into elastic:master Nov 26, 2017
@dnhatn dnhatn deleted the dedup-translog branch November 26, 2017 21:44
@dnhatn
Copy link
Member Author

dnhatn commented Nov 26, 2017

Thanks @bleskes and @DaveCTurner. I will add a follow-up to simplify the SeqNoSet.

dnhatn added a commit that referenced this pull request Nov 26, 2017
Currently, translog operations are read and processed one by one. This
may be a problem as stale operations in translogs may suddenly reappear
in recoveries. To make sure that stale operations won't be processed, we
read the translog files in a reverse order (eg. from the most recent
file to the oldest file) and only process an operation if its sequence
number was not seen before.

Relates to #10708
jasontedor added a commit that referenced this pull request Nov 27, 2017
* master:
  Skip shard refreshes if shard is `search idle` (#27500)
  Remove workaround in translog rest test (#27530)
  inner_hits: Return an empty _source for nested inner hit when filtering on a field that doesn't exist.
  percolator: Avoid TooManyClauses exception if number of terms / ranges is exactly equal to 1024
  Dedup translog operations by reading in reverse (#27268)
  Ensure logging is configured for CLI commands
  Ensure `doc_stats` are changing even if refresh is disabled (#27505)
  Fix classes that can exit
  Revert "Adjust CombinedDeletionPolicy for multiple commits (#27456)"
  Transpose expected and actual, and remove duplicate info from message. (#27515)
  [DOCS] Fixed broken link in breaking changes
jasontedor added a commit that referenced this pull request Nov 27, 2017
* 6.x:
  [DOCS] s/Spitting/Splitting in split index docs
  inner_hits: Return an empty _source for nested inner hit when filtering on a field that doesn't exist.
  percolator: Avoid TooManyClauses exception if number of terms / ranges is exactly equal to 1024
  Dedup translog operations by reading in reverse (#27268)
  Ensure logging is configured for CLI commands
  Ensure `doc_stats` are changing even if refresh is disabled (#27505)
  Fix classes that can exit
  Revert "Adjust CombinedDeletionPolicy for multiple commits (#27456)"
  Transpose expected and actual, and remove duplicate info from message.
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Nov 27, 2017
Today, we maintain two sets in a SeqNoSet: ongoing sets and completed
sets. We can remove the completed sets by releasing the internal bitset
of a CountedBitSet when all its bits are set.

Relates elastic#27268
dnhatn added a commit that referenced this pull request Dec 3, 2017
Today, we maintain two sets in a SeqNoSet: ongoing sets and completed
sets. We can remove the completed sets and use only the ongoing sets by
releasing the internal bitset of a CountedBitSet when all its bits are
set. This behaves like two sets but simpler. This commit also makes
CountedBitSet as a drop-in replacement for BitSet.

Relates #27268
dnhatn added a commit that referenced this pull request Dec 3, 2017
Today, we maintain two sets in a SeqNoSet: ongoing sets and completed
sets. We can remove the completed sets and use only the ongoing sets by
releasing the internal bitset of a CountedBitSet when all its bits are
set. This behaves like two sets but simpler. This commit also makes
CountedBitSet as a drop-in replacement for BitSet.

Relates #27268
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement v6.2.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants