(1/3) Commitlog: Base implementation "sans I/O" #919

kim · 2024-03-04T19:36:24Z

First in a series of patches to implement the new commitlog format.

This patch implements the base format, leaving the transaction payload generic. Segment handling, writing and reading is implemented based on an in-memory backend, which greatly simplifies testing.

As a notable deviation from the previous implementation, segments are never implicitly trimmed. Instead, faulty commits are ignored if and only if the next commit in the log sequence is valid and has the right offset. On the write path, this entails closing the active segment when an (I/O) error occurs, but retaining the commit in memory such that it is written to the next segment.

Note that this patch does not define the final public API.

Expected complexity level and risk

5

kim · 2024-03-04T19:41:32Z

If so desired, this patch could be split further into (perhaps) write path, read path, tests. I felt like the tests might be quite helpful to see, so both paths needed to be included.

gefjon

Probably don't merge this without Tyler's review, but I'm hitting the button to signify that I've read the code, I feel like I understand it, and I'm comfortable with it.

One small nit about a use of Option<T> where I feel Result<(), T> is more idiomatic.

This is really nice, clear code, and was very easy to review. Thanks!

crates/commitlog/src/segment.rs

kazimuth

This looks good. I have one question about how errors at the end of segments are handled, aside from that I'm happy.

kazimuth · 2024-03-12T21:36:00Z

crates/commitlog/src/repo/mod.rs

+/// If only a (non-empty) prefix of the segment could be read due to a failure
+/// to decode a [`Commit`], the segment [`Metadata`] read up to the faulty
+/// commit is returned in an `Err`. In this case, a new segment should be
+/// created for writing.


Should the segment also be truncated? Due to the language in the proposal here

Good catch, I did indeed deviate from this approach. Probably should submit an amendment to the spec.

I realized that, unless write(2) can lie and return an error even though it has written everything, we can fairly easily skip over partial data (not least thanks to checksumming).

Because a new segment is started when a write error occurs, we'll retain data potentially useful for forensics in case of a bad disk.

Explicit truncation is still needed for consensus-based replication, though.

Made an issue for amending the spec, no rush on that though.

write(2) presumably can't lie in this way:

unless write(2) can lie and return an error even though it has written everything

but it can lie in this way:

write(2) can return no error even though it hasn't written everything

How do we know that the next commit is the actual next commit? i.e. we wrote commit 1, wrote partial commit 2, and wrote commit 3?

I think it is possible to write(2) a commit, it return no error, only partially write that commit to disk, have a disk-write-error, and then have us write another commit to disk without us ever seeing an error.

It's possible I'm missing some other context like, we check that all of the commit offsets are sequential or something (and if not we truncate)

Sync'd with Kim. We are indeed checking for sequential commit numbers when reloading a database. Thus if we encounter this type of situation, it will require manual intervention to recover from (if that is even possible). This error must have occurred silently as illustrated in my scenario above, so there's nothing we could have really done to detect it when it happened anyway.

Could we expand the comment above to add the stipulation that this behavior of skipping corrupted messages is only valid if sequential ordering of commits is verified at startup, and provide my example as an illustration as why that is necessary.

but it can lie in this way:

write(2) can return no error even though it hasn't written everything

Wait... we cannot assume write(2) is behaving arbitrarily.
Namely, if we assume that write calls can be reordered arbitrarily, an append-only log is impossible.
We also must assume that fsync(2) won't return successfully if it hasn't flushed to disk.

What is indeed not correctly handled here is that an fsync(2) error should cause the active segment to be discarded. And because everything is happening asynchronously, failure to open a fresh segment must result in a panic (i.e. we must not keep writing to the old segment).

I'll make a new patch implementing this.

Could we expand the comment above

I don't think a module which is not even exported from the crate is the right place to explain how different parts of the crate need to play together. Once we're overall confident with the implementation, I'd rather add some extensive crate-level docs, also summarizing the spec which is not accessible for everyone.

Wait... we cannot assume write(2) is behaving arbitrarily.

With regards to this, you're right, definitely not arbitrary. It is for sure possible for write(2) to "return no error even though it hasn't written everything" though. I guess I'm just saying we want to make sure in the following scenario:

write(2) a commit

it returns no error

partially write that commit to disk

crash

restart

write a new commit

I believe this is handled by the CRC, but I suppose my point is that it does need to be handled.

This seems fine to me, as long as we don't forget to add that comment in at the end.

The commentary states what happens, namely that one cannot resume a segment when it contains data which cannot be decoded as a commit. A new segment is created then, from the offset of the last good commit.

This can go wrong due to caching effects: re-reading from disk might actually be reading from the page cache. There is no remedy for this, but the next best thing is to bypass the page cache.

I don't know what else should be added here.

kazimuth · 2024-03-12T21:39:42Z

crates/commitlog/src/commit.rs

+    /// Verifies the checksum of the commit. If it doesn't match, an error of
+    /// kind [`io::ErrorKind::InvalidData`] with an inner error downcastable to
+    /// [`ChecksumMismatch`] is returned.
+    pub fn decode<R: Read>(reader: R) -> io::Result<Option<Self>> {


We could try to support reading into a user-supplied buffer / zero-copy reads with mmap in the future, but that should be left for later.

True. Gotta get some benchmarks in place...

crates/commitlog/src/payload.rs

crates/commitlog/src/commit.rs

crates/commitlog/src/commitlog.rs

Centril · 2024-03-14T17:57:56Z

crates/commitlog/src/payload.rs

+
+/// A **datatype** which can be encoded.
+///
+/// The transaction payload of the commitlog (i.e. individual records in the log)


This sentence confuses me; what is the requirement of this trait?

What exactly is unclear? Familiarity with the format is obviously assumed: the log is composed of commits, which are composed of records, which we also refer to as "transactions" or "transaction payload". Each one of those must be [Encode]-able.

crates/commitlog/src/segment.rs

cloutiertyler

My biggest concern with this PR is that I think that max_tx_offset is improperly calculated. I thought I had already written this a few weeks ago, but I can't find the comment. I may have not submitted it accidentally.

I think this some naming and comments could be cleaned up a bit, but happy to defer that until later.

cloutiertyler · 2024-03-27T00:34:42Z

crates/commitlog/src/commitlog.rs

+};
+
+#[derive(Debug)]
+pub struct Generic<R: Repo, T> {


I'd really love to name this something other than "Generic", but not enough to block this.

cloutiertyler · 2024-03-27T01:13:18Z

crates/commitlog/src/commit.rs

+    ) -> Self {
+        Self {
+            min_tx_offset,
+            max_tx_offset: min_tx_offset + n as u64,


I could have sworn I wrote this somewhere before, but I can't find it. This seems wrong to me. If the min_tx_offset is 0 and there are 5 things in the list, then the max_tx_offset should be 4, not 5.

I'm a little unclear on the purpose of the metadata type generally, since you could just compute from the commit, but I'll reserve judgement for later.

The offsets are zero-based, which means that the max offset of one commit is equal to the min offset of the next commit. Because it has a potential to trip up the reader, this is spelled out explicitly in the spec.

Also note that the metadata types are merely internal helpers. You know, like semigroups.

I realize that they are zero based, that's why I'm making this comment. The maximum offset in this commit cannot be equal to the minimum offset in the next commit because offsets are unique.

The max offset cannot exist in this commit. If you mean something like next offset, I strongly suggest calling it that instead. This number as calculated does not represent the maximum transaction offset in this commit. There is no transaction with that offset in this commit.

Consider calling these upper and lower bound if what you mean to specify here is a half open range of tx offsets.

I changed this to be represented by a Range<u64> throughout the series.

cloutiertyler · 2024-03-27T01:15:09Z

crates/commitlog/src/commit.rs

+        Self {
+            min_tx_offset,
+            max_tx_offset: min_tx_offset + n as u64,
+            size_in_bytes: Header::LEN as u64 + records.len() as u64 + /* crc32 */ 4,


Shouldn't this just be a function on the Commit?

cloutiertyler · 2024-03-27T01:16:08Z

crates/commitlog/src/commit.rs

+    pub const FRAMING_LEN: usize = Header::LEN + /* crc32 */ 4;
+    pub const CHECKSUM_ALGORITHM: u8 = CHECKSUM_ALGORITHM_CRC32C;
+
+    /// The largest transaction offset in this commit.


This comment is wrong AFAICT. This is the min_tx_offset of the next segment, not the max_tx_offset of this segment.

Both is true. But if its a method on commit, then it's the max offset of this commit.

crates/commitlog/src/segment.rs

First in a series of patches to implement the new commitlog format. This patch implements the base format, leaving the transaction payload generic. Segment handling, writing and reading is implemented based on an in-memory backend, which greatly simplifies testing. As a notable deviation from the previous implementation, segments are never implicitly trimmed. Instead, faulty commits are ignored if and only if the next commit in the log sequence is valid and has the right offset. On the write path, this entails closing the active segment when an (I/O) error occurs, but retaining the commit in memory such that it is written to the next segment. Note that this patch does not define the final public API.

cloutiertyler

This LGTM

kim mentioned this pull request Mar 4, 2024

(2/3) Commitlog: Add I/O based on regular files #920

Merged

kim force-pushed the kim/commitlog2-sansio branch from 802f385 to c518f99 Compare March 5, 2024 08:54

gefjon approved these changes Mar 5, 2024

View reviewed changes

crates/commitlog/src/segment.rs Outdated Show resolved Hide resolved

crates/commitlog/src/segment.rs Show resolved Hide resolved

Centril self-requested a review March 5, 2024 17:00

kim force-pushed the kim/commitlog2-sansio branch 4 times, most recently from f1cd904 to 9e13cd5 Compare March 12, 2024 13:47

kazimuth approved these changes Mar 12, 2024

View reviewed changes

cloutiertyler mentioned this pull request Mar 14, 2024

Remove row_pk from datastore internals #972

Closed

Centril reviewed Mar 14, 2024

View reviewed changes

bfops added release-any To be landed in any release window no runtime change This change does not affect the final binaries commit-log labels Mar 18, 2024

kim force-pushed the kim/commitlog2-sansio branch 2 times, most recently from d4e4dc5 to a7f8dfd Compare March 21, 2024 15:32

cloutiertyler requested changes Mar 27, 2024

View reviewed changes

kim force-pushed the kim/commitlog2-sansio branch from ecdb130 to 03a645e Compare March 28, 2024 19:31

cloutiertyler approved these changes Apr 1, 2024

View reviewed changes

kim added this pull request to the merge queue Apr 2, 2024

Merged via the queue into master with commit 3b343e4 Apr 2, 2024
6 checks passed

kim deleted the kim/commitlog2-sansio branch April 2, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(1/3) Commitlog: Base implementation "sans I/O" #919

(1/3) Commitlog: Base implementation "sans I/O" #919

kim commented Mar 4, 2024

kim commented Mar 4, 2024

gefjon left a comment

kazimuth left a comment

kazimuth Mar 12, 2024

kim Mar 13, 2024

kazimuth Mar 13, 2024

cloutiertyler Mar 14, 2024 •

edited

Loading

cloutiertyler Mar 14, 2024

cloutiertyler Mar 14, 2024

kim Mar 14, 2024

cloutiertyler Mar 26, 2024 •

edited

Loading

kim Mar 28, 2024

kazimuth Mar 12, 2024

kim Mar 13, 2024

Centril Mar 14, 2024

kim Mar 15, 2024

cloutiertyler left a comment

cloutiertyler Mar 27, 2024

cloutiertyler Mar 27, 2024

cloutiertyler Mar 27, 2024

kim Mar 27, 2024

cloutiertyler Mar 27, 2024

cloutiertyler Mar 27, 2024

kim Mar 28, 2024

cloutiertyler Mar 27, 2024

kim Mar 27, 2024

cloutiertyler Mar 27, 2024

kim Mar 27, 2024

cloutiertyler left a comment

(1/3) Commitlog: Base implementation "sans I/O" #919

(1/3) Commitlog: Base implementation "sans I/O" #919

Conversation

kim commented Mar 4, 2024

Expected complexity level and risk

kim commented Mar 4, 2024

gefjon left a comment

Choose a reason for hiding this comment

kazimuth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloutiertyler Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloutiertyler Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloutiertyler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloutiertyler left a comment

Choose a reason for hiding this comment

cloutiertyler Mar 14, 2024 •

edited

Loading

cloutiertyler Mar 26, 2024 •

edited

Loading