Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Introduce random-access translog #28205
Today, the only way of reading the translog is by taking a snapshot and consuming the snapshot using a forward-only iterator. If the consumer only wants a portion of the translog, they might read large portions of the translog off of disk until they find the portion that they are looking for. If this consumer returns to the translog to later consume a following portion of the translog, they will be forced to repeatedly read the head of the translog over and over. This can be expensive, and is unfriendly to caches. Thus, it appears that some consumers of the translog would benefit from optimized access to specific operations in the translog.
For example, one use case could be recovery where the recovery source only wants to send operations above a certain sequence number to the recovery target. Yet, a snapshot covering this range of operations might be prefixed by a large number of operations that they would have to read before they can start sending to the target.
This commit introduces a random-access snapshot of the translog. This allows a consumer to read from disk only the operations that they are concerned with, ignoring all other operations. While this does not give the most optimal disk read pattern (sequential), it will be fairly close as the translog tends to be roughly ordered. To take a random-access snapshot of the translog, we lazily build per-generation indices over the open readers mapping sequence numbers to their position in the generation. For the current writer, we maintain this index as we go and this index is folded over when the writer is folded into a reader. These per-generation maps are trimmed when a reader is trimmed.
We are not attempting to make the most optimal implementation in this pull request. For example, it seems reasonable to believe that such an index could benefit from delta compression due to the rough ordering of the translog by sequence number. We will consider such optimizations in follow-ups. Additionally, we do not make any applications of the random-access translog in this changeset; we leave that for follow-ups.
I don't think we should make something like
RandomAccessSnapshot a first class citizen. I'd rather build the most basic primitive for this into the
TranslogSnapshot something like this:
public void seek(Translog.Location location);
and make the
TranslogSnapshot#position part of the
TranslogOperation so we can call
Translog.Operation#getLocation(). That means we can always seek to an operation and we can be sure that it's valid for this transaction log. The caller side can then decide (not part of this PR IMO) how to maintain a map from whatever (in this case SeqId) to the
Translog.Location Instead of a dense map a caller might hold a map of LRU operations for instance in order to resume reading the transaction log etc.
Yet, a dense map might be still possible. If it turns out to be too heavy to hold
Location objects we can still look into making it more efficient. What do you think?
@s1monw what you say sound appealing, but I think it will be tricky (I don't see how to do it in a sane way) to maintain the collision repair on read logic we have in the snapshot. Let me explain: it is possible that the translog contains two operations with the same seq no. This can happen after primary failure where the old primary generated an op, managed to get it to some replicas and then failed. If the new primary didn't have the said op, it may generate a different op with the same seq no and send it to the replicas. The new op (with a higher term) should override the old op. We currently do it as follows:
Introducing a seek command violates the validity of the bit array and this makes the snapshot dangerous to use (you need to no guarantee you maintain the right location for each seq no, and you should never call next more than once)
I did look closer at some of the guarantees our tansaction log give the API consumer and I think we have some serious flaws committed to this part of the codebase. We literally turns a
Thanks @s1monw for the review. I agree we should resolve this before proceeding.
A small correction here - 6.0 turned an append only tape to a limited list were you can write multiple times into the same position and still only read in an unordered way.
I clarified with Simon on a different channel. All work scheduled for 6.2 can still proceed as planned. New features beyond that should wait until we resolve this.
I am sorry about the confusion when I compared