Introduce random-access translog #28205

jasontedor · 2018-01-13T00:23:44Z

Today, the only way of reading the translog is by taking a snapshot and consuming the snapshot using a forward-only iterator. If the consumer only wants a portion of the translog, they might read large portions of the translog off of disk until they find the portion that they are looking for. If this consumer returns to the translog to later consume a following portion of the translog, they will be forced to repeatedly read the head of the translog over and over. This can be expensive, and is unfriendly to caches. Thus, it appears that some consumers of the translog would benefit from optimized access to specific operations in the translog.

For example, one use case could be recovery where the recovery source only wants to send operations above a certain sequence number to the recovery target. Yet, a snapshot covering this range of operations might be prefixed by a large number of operations that they would have to read before they can start sending to the target.

This commit introduces a random-access snapshot of the translog. This allows a consumer to read from disk only the operations that they are concerned with, ignoring all other operations. While this does not give the most optimal disk read pattern (sequential), it will be fairly close as the translog tends to be roughly ordered. To take a random-access snapshot of the translog, we lazily build per-generation indices over the open readers mapping sequence numbers to their position in the generation. For the current writer, we maintain this index as we go and this index is folded over when the writer is folded into a reader. These per-generation maps are trimmed when a reader is trimmed.

We are not attempting to make the most optimal implementation in this pull request. For example, it seems reasonable to believe that such an index could benefit from delta compression due to the rough ordering of the translog by sequence number. We will consider such optimizations in follow-ups. Additionally, we do not make any applications of the random-access translog in this changeset; we leave that for follow-ups.

Today, the only way of reading the translog is by taking a snapshot and consuming the snapshot using a forward-only iterator. If the consumer only wants a portion of the translog, they might read large portions of the translog off of disk until they find the portion that they are looking for. If this consumer returns to the translog to later consume a following portion of the translog, they will be forced to repeatedly read the head of the translog over and over. This can be expensive, and is unfriendly to caches. Thus, it appears that some consumers of the translog would benefit from optimized access to specific operations in the translog. For example, one use case could be recovery where the recovery source only wants to send operations above a certain sequence number to the recovery target. Yet, a snapshot covering this range of operations might be prefixed by a large number of operations that they would have to read before they can start sending to the target. This commit introduces a random-access snapshot of the translog. This allows a consumer to read from disk only the operations that they are concerned with, ignoring all other operations. While this does not give the most optimal disk read pattern (sequential), it will be fairly close as the translog tends to be roughly ordered. To take a random-access snapshot of the translog, we lazily build per-generation indices over the open readers mapping sequence numbers to their position in the generation. For the current writer, we maintain this index as we go and this index is folded over when the writer is folded into a reader. These per-generation maps are trimmed when a reader is trimmed. We are not attempting to make the most optimal implementation in this pull request. For example, it seems reasonable to believe that such an index could benefit from delta compression due to the rough ordering of the translog by sequence number. We will consider such optimizations in follow-ups. Additionally, we do not make any applications of the random-access translog in this changeset; we leave that for follow-ups.

s1monw

I don't think we should make something like RandomAccessSnapshot a first class citizen. I'd rather build the most basic primitive for this into the TranslogSnapshot something like this:

public void seek(Translog.Location location);

and make the TranslogSnapshot#position part of the TranslogOperation so we can call Translog.Operation#getLocation(). That means we can always seek to an operation and we can be sure that it's valid for this transaction log. The caller side can then decide (not part of this PR IMO) how to maintain a map from whatever (in this case SeqId) to the Translog.Location Instead of a dense map a caller might hold a map of LRU operations for instance in order to resume reading the transaction log etc.

Yet, a dense map might be still possible. If it turns out to be too heavy to hold Location objects we can still look into making it more efficient. What do you think?

bleskes · 2018-01-15T14:28:02Z

@s1monw what you say sound appealing, but I think it will be tricky (I don't see how to do it in a sane way) to maintain the collision repair on read logic we have in the snapshot. Let me explain: it is possible that the translog contains two operations with the same seq no. This can happen after primary failure where the old primary generated an op, managed to get it to some replicas and then failed. If the new primary didn't have the said op, it may generate a different op with the same seq no and send it to the replicas. The new op (with a higher term) should override the old op. We currently do it as follows:

We guaranteed that no collision can happen in a single translog generation file).
We repair this on read (because we rarely need to and it's never performance sensitive) by letting the translog snapshot read the generations in reverse.
We maintain a bit array of all seq# seen in higher generations and use that to drop the old op when we encounter it.

Introducing a seek command violates the validity of the bit array and this makes the snapshot dangerous to use (you need to no guarantee you maintain the right location for each seq no, and you should never call next more than once)

s1monw · 2018-01-16T14:55:41Z

I did look closer at some of the guarantees our tansaction log give the API consumer and I think we have some serious flaws committed to this part of the codebase. We literally turns a List into an ordered Set which is totally outside of the responsibilities of the transaction log. The guarantees that this deduplication provides have such a big impact on the transaction log that I think we need to move them out of it ASAP. This is literally a blocker for me to move forwards on any related code. IF there is a consumer of the transaction log that needs these guarantees then it should be build on to of the stream of changes.

bleskes · 2018-01-16T15:13:20Z

Thanks @s1monw for the review. I agree we should resolve this before proceeding.

We literally turns a List into an ordered Set

A small correction here - 6.0 turned an append only tape to a limited list were you can write multiple times into the same position and still only read in an unordered way.

This is literally a blocker for me to move forwards on any related code

I clarified with Simon on a different channel. All work scheduled for 6.2 can still proceed as planned. New features beyond that should wait until we resolve this.

martijnvg

LGTM

s1monw · 2018-01-17T08:56:27Z

A small correction here - 6.0 turned an append only tape to a limited list were you can write multiple times into the same position and still only read in an unordered way.

I am sorry about the confusion when I compared Set to List I was comparing the uniqueness properties of the datastructure. It's a really a stream that is now deduplicated which I think need to be fixed on a different level.

jasontedor · 2018-03-06T17:21:00Z

We are going to take a different approach, closing.

danopia · 2018-04-11T20:45:16Z

Is this different approach tracked by another issue/PR?

bleskes · 2018-04-12T06:44:31Z

@danopia it's going to take multiple PRs and relies on storing historical document copies in lucene for a limited, using a new lucene feature called soft deletes. See #29458 for the first step, which is part of many.

jasontedor added the :Translog label Jan 13, 2018

jasontedor requested review from martijnvg and s1monw January 13, 2018 00:23

jasontedor force-pushed the random-access-translog branch 7 times, most recently from 97a86de to 1bfc1e3 Compare January 13, 2018 12:54

jasontedor force-pushed the random-access-translog branch from 1bfc1e3 to 46dee31 Compare January 13, 2018 12:59

s1monw requested changes Jan 15, 2018

View reviewed changes

martijnvg approved these changes Jan 17, 2018

View reviewed changes

jasontedor closed this Mar 6, 2018

jasontedor deleted the random-access-translog branch March 6, 2018 17:22

danopia mentioned this pull request Apr 12, 2018

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce random-access translog #28205

Introduce random-access translog #28205

jasontedor commented Jan 13, 2018 •

edited

Loading

s1monw left a comment

bleskes commented Jan 15, 2018 •

edited

Loading

s1monw commented Jan 16, 2018

bleskes commented Jan 16, 2018

martijnvg left a comment

s1monw commented Jan 17, 2018

jasontedor commented Mar 6, 2018

danopia commented Apr 11, 2018

bleskes commented Apr 12, 2018

Introduce random-access translog #28205

Introduce random-access translog #28205

Conversation

jasontedor commented Jan 13, 2018 • edited Loading

s1monw left a comment

Choose a reason for hiding this comment

bleskes commented Jan 15, 2018 • edited Loading

s1monw commented Jan 16, 2018

bleskes commented Jan 16, 2018

martijnvg left a comment

Choose a reason for hiding this comment

s1monw commented Jan 17, 2018

jasontedor commented Mar 6, 2018

danopia commented Apr 11, 2018

bleskes commented Apr 12, 2018

jasontedor commented Jan 13, 2018 •

edited

Loading

bleskes commented Jan 15, 2018 •

edited

Loading