New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce random-access translog #28205

Closed
wants to merge 1 commit into
base: ccr
from

Conversation

Projects
None yet
6 participants
@jasontedor
Member

jasontedor commented Jan 13, 2018

Today, the only way of reading the translog is by taking a snapshot and consuming the snapshot using a forward-only iterator. If the consumer only wants a portion of the translog, they might read large portions of the translog off of disk until they find the portion that they are looking for. If this consumer returns to the translog to later consume a following portion of the translog, they will be forced to repeatedly read the head of the translog over and over. This can be expensive, and is unfriendly to caches. Thus, it appears that some consumers of the translog would benefit from optimized access to specific operations in the translog.

For example, one use case could be recovery where the recovery source only wants to send operations above a certain sequence number to the recovery target. Yet, a snapshot covering this range of operations might be prefixed by a large number of operations that they would have to read before they can start sending to the target.

This commit introduces a random-access snapshot of the translog. This allows a consumer to read from disk only the operations that they are concerned with, ignoring all other operations. While this does not give the most optimal disk read pattern (sequential), it will be fairly close as the translog tends to be roughly ordered. To take a random-access snapshot of the translog, we lazily build per-generation indices over the open readers mapping sequence numbers to their position in the generation. For the current writer, we maintain this index as we go and this index is folded over when the writer is folded into a reader. These per-generation maps are trimmed when a reader is trimmed.

We are not attempting to make the most optimal implementation in this pull request. For example, it seems reasonable to believe that such an index could benefit from delta compression due to the rough ordering of the translog by sequence number. We will consider such optimizations in follow-ups. Additionally, we do not make any applications of the random-access translog in this changeset; we leave that for follow-ups.

@jasontedor jasontedor requested review from martijnvg and s1monw Jan 13, 2018

Introduce random-access translog
Today, the only way of reading the translog is by taking a snapshot and
consuming the snapshot using a forward-only iterator. If the consumer
only wants a portion of the translog, they might read large portions of
the translog off of disk until they find the portion that they are
looking for. If this consumer returns to the translog to later consume a
following portion of the translog, they will be forced to repeatedly
read the head of the translog over and over. This can be expensive, and
is unfriendly to caches. Thus, it appears that some consumers of the
translog would benefit from optimized access to specific operations in
the translog.

For example, one use case could be recovery where the recovery source
only wants to send operations above a certain sequence number to the
recovery target. Yet, a snapshot covering this range of operations might
be prefixed by a large number of operations that they would have to read
before they can start sending to the target.

This commit introduces a random-access snapshot of the translog. This
allows a consumer to read from disk only the operations that they are
concerned with, ignoring all other operations. While this does not give
the most optimal disk read pattern (sequential), it will be fairly close
as the translog tends to be roughly ordered. To take a random-access
snapshot of the translog, we lazily build per-generation indices over
the open readers mapping sequence numbers to their position in the
generation. For the current writer, we maintain this index as we go and
this index is folded over when the writer is folded into a reader. These
per-generation maps are trimmed when a reader is trimmed.

We are not attempting to make the most optimal implementation in this
pull request. For example, it seems reasonable to believe that such an
index could benefit from delta compression due to the rough ordering of
the translog by sequence number. We will consider such optimizations in
follow-ups. Additionally, we do not make any applications of the
random-access translog in this changeset; we leave that for follow-ups.
@s1monw

I don't think we should make something like RandomAccessSnapshot a first class citizen. I'd rather build the most basic primitive for this into the TranslogSnapshot something like this:

public void seek(Translog.Location location);

and make the TranslogSnapshot#position part of the TranslogOperation so we can call Translog.Operation#getLocation(). That means we can always seek to an operation and we can be sure that it's valid for this transaction log. The caller side can then decide (not part of this PR IMO) how to maintain a map from whatever (in this case SeqId) to the Translog.Location Instead of a dense map a caller might hold a map of LRU operations for instance in order to resume reading the transaction log etc.

Yet, a dense map might be still possible. If it turns out to be too heavy to hold Location objects we can still look into making it more efficient. What do you think?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jan 15, 2018

Member

@s1monw what you say sound appealing, but I think it will be tricky (I don't see how to do it in a sane way) to maintain the collision repair on read logic we have in the snapshot. Let me explain: it is possible that the translog contains two operations with the same seq no. This can happen after primary failure where the old primary generated an op, managed to get it to some replicas and then failed. If the new primary didn't have the said op, it may generate a different op with the same seq no and send it to the replicas. The new op (with a higher term) should override the old op. We currently do it as follows:

  1. We guaranteed that no collision can happen in a single translog generation file).
  2. We repair this on read (because we rarely need to and it's never performance sensitive) by letting the translog snapshot read the generations in reverse.
  3. We maintain a bit array of all seq# seen in higher generations and use that to drop the old op when we encounter it.

Introducing a seek command violates the validity of the bit array and this makes the snapshot dangerous to use (you need to no guarantee you maintain the right location for each seq no, and you should never call next more than once)

Member

bleskes commented Jan 15, 2018

@s1monw what you say sound appealing, but I think it will be tricky (I don't see how to do it in a sane way) to maintain the collision repair on read logic we have in the snapshot. Let me explain: it is possible that the translog contains two operations with the same seq no. This can happen after primary failure where the old primary generated an op, managed to get it to some replicas and then failed. If the new primary didn't have the said op, it may generate a different op with the same seq no and send it to the replicas. The new op (with a higher term) should override the old op. We currently do it as follows:

  1. We guaranteed that no collision can happen in a single translog generation file).
  2. We repair this on read (because we rarely need to and it's never performance sensitive) by letting the translog snapshot read the generations in reverse.
  3. We maintain a bit array of all seq# seen in higher generations and use that to drop the old op when we encounter it.

Introducing a seek command violates the validity of the bit array and this makes the snapshot dangerous to use (you need to no guarantee you maintain the right location for each seq no, and you should never call next more than once)

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jan 16, 2018

Contributor

I did look closer at some of the guarantees our tansaction log give the API consumer and I think we have some serious flaws committed to this part of the codebase. We literally turns a List into an ordered Set which is totally outside of the responsibilities of the transaction log. The guarantees that this deduplication provides have such a big impact on the transaction log that I think we need to move them out of it ASAP. This is literally a blocker for me to move forwards on any related code. IF there is a consumer of the transaction log that needs these guarantees then it should be build on to of the stream of changes.

Contributor

s1monw commented Jan 16, 2018

I did look closer at some of the guarantees our tansaction log give the API consumer and I think we have some serious flaws committed to this part of the codebase. We literally turns a List into an ordered Set which is totally outside of the responsibilities of the transaction log. The guarantees that this deduplication provides have such a big impact on the transaction log that I think we need to move them out of it ASAP. This is literally a blocker for me to move forwards on any related code. IF there is a consumer of the transaction log that needs these guarantees then it should be build on to of the stream of changes.

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Jan 16, 2018

Member

Thanks @s1monw for the review. I agree we should resolve this before proceeding.

We literally turns a List into an ordered Set

A small correction here - 6.0 turned an append only tape to a limited list were you can write multiple times into the same position and still only read in an unordered way.

This is literally a blocker for me to move forwards on any related code

I clarified with Simon on a different channel. All work scheduled for 6.2 can still proceed as planned. New features beyond that should wait until we resolve this.

Member

bleskes commented Jan 16, 2018

Thanks @s1monw for the review. I agree we should resolve this before proceeding.

We literally turns a List into an ordered Set

A small correction here - 6.0 turned an append only tape to a limited list were you can write multiple times into the same position and still only read in an unordered way.

This is literally a blocker for me to move forwards on any related code

I clarified with Simon on a different channel. All work scheduled for 6.2 can still proceed as planned. New features beyond that should wait until we resolve this.

@martijnvg

LGTM

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jan 17, 2018

Contributor

A small correction here - 6.0 turned an append only tape to a limited list were you can write multiple times into the same position and still only read in an unordered way.

I am sorry about the confusion when I compared Set to List I was comparing the uniqueness properties of the datastructure. It's a really a stream that is now deduplicated which I think need to be fixed on a different level.

Contributor

s1monw commented Jan 17, 2018

A small correction here - 6.0 turned an append only tape to a limited list were you can write multiple times into the same position and still only read in an unordered way.

I am sorry about the confusion when I compared Set to List I was comparing the uniqueness properties of the datastructure. It's a really a stream that is now deduplicated which I think need to be fixed on a different level.

@jasontedor

This comment has been minimized.

Show comment
Hide comment
@jasontedor

jasontedor Mar 6, 2018

Member

We are going to take a different approach, closing.

Member

jasontedor commented Mar 6, 2018

We are going to take a different approach, closing.

@jasontedor jasontedor closed this Mar 6, 2018

@jasontedor jasontedor deleted the jasontedor:random-access-translog branch Mar 6, 2018

@danopia

This comment has been minimized.

Show comment
Hide comment
@danopia

danopia Apr 11, 2018

Is this different approach tracked by another issue/PR?

danopia commented Apr 11, 2018

Is this different approach tracked by another issue/PR?

@bleskes

This comment has been minimized.

Show comment
Hide comment
@bleskes

bleskes Apr 12, 2018

Member

@danopia it's going to take multiple PRs and relies on storing historical document copies in lucene for a limited, using a new lucene feature called soft deletes. See #29458 for the first step, which is part of many.

Member

bleskes commented Apr 12, 2018

@danopia it's going to take multiple PRs and relies on storing historical document copies in lucene for a limited, using a new lucene feature called soft deletes. See #29458 for the first step, which is part of many.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment