-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(1/3) Commitlog: Base implementation "sans I/O" #919
Conversation
If so desired, this patch could be split further into (perhaps) write path, read path, tests. I felt like the tests might be quite helpful to see, so both paths needed to be included. |
802f385
to
c518f99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably don't merge this without Tyler's review, but I'm hitting the button to signify that I've read the code, I feel like I understand it, and I'm comfortable with it.
One small nit about a use of Option<T>
where I feel Result<(), T>
is more idiomatic.
This is really nice, clear code, and was very easy to review. Thanks!
f1cd904
to
9e13cd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I have one question about how errors at the end of segments are handled, aside from that I'm happy.
/// If only a (non-empty) prefix of the segment could be read due to a failure | ||
/// to decode a [`Commit`], the segment [`Metadata`] read up to the faulty | ||
/// commit is returned in an `Err`. In this case, a new segment should be | ||
/// created for writing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the segment also be truncated? Due to the language in the proposal here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I did indeed deviate from this approach. Probably should submit an amendment to the spec.
I realized that, unless write(2)
can lie and return an error even though it has written everything, we can fairly easily skip over partial data (not least thanks to checksumming).
Because a new segment is started when a write error occurs, we'll retain data potentially useful for forensics in case of a bad disk.
Explicit truncation is still needed for consensus-based replication, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an issue for amending the spec, no rush on that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write(2) presumably can't lie in this way:
unless write(2) can lie and return an error even though it has written everything
but it can lie in this way:
write(2) can return no error even though it hasn't written everything
How do we know that the next commit is the actual next commit? i.e. we wrote commit 1, wrote partial commit 2, and wrote commit 3?
I think it is possible to write(2) a commit, it return no error, only partially write that commit to disk, have a disk-write-error, and then have us write another commit to disk without us ever seeing an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible I'm missing some other context like, we check that all of the commit offsets are sequential or something (and if not we truncate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync'd with Kim. We are indeed checking for sequential commit numbers when reloading a database. Thus if we encounter this type of situation, it will require manual intervention to recover from (if that is even possible). This error must have occurred silently as illustrated in my scenario above, so there's nothing we could have really done to detect it when it happened anyway.
Could we expand the comment above to add the stipulation that this behavior of skipping corrupted messages is only valid if sequential ordering of commits is verified at startup, and provide my example as an illustration as why that is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it can lie in this way:
write(2) can return no error even though it hasn't written everything
Wait... we cannot assume write(2)
is behaving arbitrarily.
Namely, if we assume that write calls can be reordered arbitrarily, an append-only log is impossible.
We also must assume that fsync(2)
won't return successfully if it hasn't flushed to disk.
What is indeed not correctly handled here is that an fsync(2)
error should cause the active segment to be discarded. And because everything is happening asynchronously, failure to open a fresh segment must result in a panic (i.e. we must not keep writing to the old segment).
I'll make a new patch implementing this.
Could we expand the comment above
I don't think a module which is not even exported from the crate is the right place to explain how different parts of the crate need to play together. Once we're overall confident with the implementation, I'd rather add some extensive crate-level docs, also summarizing the spec which is not accessible for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait... we cannot assume write(2) is behaving arbitrarily.
With regards to this, you're right, definitely not arbitrary. It is for sure possible for write(2) to "return no error even though it hasn't written everything" though. I guess I'm just saying we want to make sure in the following scenario:
- write(2) a commit
- it returns no error
- partially write that commit to disk
- crash
- restart
- write a new commit
I believe this is handled by the CRC, but I suppose my point is that it does need to be handled.
This seems fine to me, as long as we don't forget to add that comment in at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commentary states what happens, namely that one cannot resume a segment when it contains data which cannot be decoded as a commit. A new segment is created then, from the offset of the last good commit.
This can go wrong due to caching effects: re-reading from disk might actually be reading from the page cache. There is no remedy for this, but the next best thing is to bypass the page cache.
I don't know what else should be added here.
/// Verifies the checksum of the commit. If it doesn't match, an error of | ||
/// kind [`io::ErrorKind::InvalidData`] with an inner error downcastable to | ||
/// [`ChecksumMismatch`] is returned. | ||
pub fn decode<R: Read>(reader: R) -> io::Result<Option<Self>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could try to support reading into a user-supplied buffer / zero-copy reads with mmap in the future, but that should be left for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Gotta get some benchmarks in place...
|
||
/// A **datatype** which can be encoded. | ||
/// | ||
/// The transaction payload of the commitlog (i.e. individual records in the log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence confuses me; what is the requirement of this trait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is unclear? Familiarity with the format is obviously assumed: the log is composed of commits, which are composed of records, which we also refer to as "transactions" or "transaction payload". Each one of those must be [Encode
]-able.
d4e4dc5
to
a7f8dfd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern with this PR is that I think that max_tx_offset
is improperly calculated. I thought I had already written this a few weeks ago, but I can't find the comment. I may have not submitted it accidentally.
I think this some naming and comments could be cleaned up a bit, but happy to defer that until later.
}; | ||
|
||
#[derive(Debug)] | ||
pub struct Generic<R: Repo, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really love to name this something other than "Generic", but not enough to block this.
crates/commitlog/src/commit.rs
Outdated
) -> Self { | ||
Self { | ||
min_tx_offset, | ||
max_tx_offset: min_tx_offset + n as u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have sworn I wrote this somewhere before, but I can't find it. This seems wrong to me. If the min_tx_offset
is 0 and there are 5 things in the list, then the max_tx_offset
should be 4, not 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little unclear on the purpose of the metadata type generally, since you could just compute from the commit, but I'll reserve judgement for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offsets are zero-based, which means that the max offset of one commit is equal to the min offset of the next commit. Because it has a potential to trip up the reader, this is spelled out explicitly in the spec.
Also note that the metadata types are merely internal helpers. You know, like semigroups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that they are zero based, that's why I'm making this comment. The maximum offset in this commit cannot be equal to the minimum offset in the next commit because offsets are unique.
The max offset cannot exist in this commit. If you mean something like next offset, I strongly suggest calling it that instead. This number as calculated does not represent the maximum transaction offset in this commit. There is no transaction with that offset in this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider calling these upper and lower bound if what you mean to specify here is a half open range of tx offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to be represented by a Range<u64>
throughout the series.
crates/commitlog/src/commit.rs
Outdated
Self { | ||
min_tx_offset, | ||
max_tx_offset: min_tx_offset + n as u64, | ||
size_in_bytes: Header::LEN as u64 + records.len() as u64 + /* crc32 */ 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this just be a function on the Commit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crates/commitlog/src/commit.rs
Outdated
pub const FRAMING_LEN: usize = Header::LEN + /* crc32 */ 4; | ||
pub const CHECKSUM_ALGORITHM: u8 = CHECKSUM_ALGORITHM_CRC32C; | ||
|
||
/// The largest transaction offset in this commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is wrong AFAICT. This is the min_tx_offset of the next segment, not the max_tx_offset of this segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both is true. But if its a method on commit, then it's the max offset of this commit.
First in a series of patches to implement the new commitlog format. This patch implements the base format, leaving the transaction payload generic. Segment handling, writing and reading is implemented based on an in-memory backend, which greatly simplifies testing. As a notable deviation from the previous implementation, segments are never implicitly trimmed. Instead, faulty commits are ignored if and only if the next commit in the log sequence is valid and has the right offset. On the write path, this entails closing the active segment when an (I/O) error occurs, but retaining the commit in memory such that it is written to the next segment. Note that this patch does not define the final public API.
ecdb130
to
03a645e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM
First in a series of patches to implement the new commitlog format.
This patch implements the base format, leaving the transaction payload generic. Segment handling, writing and reading is implemented based on an in-memory backend, which greatly simplifies testing.
As a notable deviation from the previous implementation, segments are never implicitly trimmed. Instead, faulty commits are ignored if and only if the next commit in the log sequence is valid and has the right offset. On the write path, this entails closing the active segment when an (I/O) error occurs, but retaining the commit in memory such that it is written to the next segment.
Note that this patch does not define the final public API.
Expected complexity level and risk
5