Integrate WAL compression into log reader/writer. #9642

sidroyc · 2022-03-01T06:53:25Z

Integrate the streaming compress/uncompress API into WAL compression.
The streaming compression object is stored in the log_writer along with a reusable output buffer to store the compressed buffer(s).
The streaming uncompress object is stored in the log_reader along with a reusable output buffer to store the uncompressed buffer(s).

Test Plan:
Added unit tests to verify different scenarios - large buffers, split compressed buffers, etc.

Future optimizations:
The overhead for small records is quite high, so it makes sense to compress only buffers above a certain threshold and use a separate record type to indicate that those records are compressed.

anand1976

Thanks for the PR. I didn't fully understand the logic, so I've left some questions inline.

anand1976 · 2022-03-02T23:31:52Z

db/log_writer.cc

+    const size_t max_output_buffer_len =
+        kBlockSize - (recycle_log_files_ ? kRecyclableHeaderSize : kHeaderSize);
+    CompressionOptions opts;
+    constexpr uint32_t compression_format_version = 2;


Use a descriptive constant instead of 2

I looked at other call sites for the existing CompressData()/UncompressData() and the format seems to be hardcoded to 2.

anand1976 · 2022-03-03T00:09:51Z

db/log_writer.cc

@@ -85,10 +100,31 @@ IOStatus Writer::AddRecord(const Slice& slice) {
    assert(static_cast<int64_t>(kBlockSize - block_offset_) >= header_size);

    const size_t avail = kBlockSize - block_offset_ - header_size;
+
+    // Compress the record
+    if (compress_ && (compress_start || left == 0)) {


This is kinda confusing. IIUC, we try to compress as much as will fit in a block. However, the first physical record for the logical record, and the start of the block may not be aligned. So the compressed buffer may be split into multiple physical records. Is that correct? Would the records have a dependency?

It might be simpler and easier to reason about if we have nested while loops.

Yes, that is correct. There is no dependency other than all the compressed records need to be uncompressed in order to recover the original record. I tried changing the compress API to pass the available physical block size and it doesn't make a difference.

I can add some comments to make it more readable. The advantage of having a single loop is that there's less code required for compress vs uncompressed. Once a chunk is available, compressed or not, the rest of the code to generate the physical record is the same.

Added comments.

db/log_writer.h

db/log_reader.cc

anand1976 · 2022-03-03T19:51:46Z

db/log_reader.cc

+          uncompressed_record_.append(uncompressed_buffer_.get(),
+                                      uncompressed_size);
+        }
+      } while (remaining > 0);


Is remaining guaranteed to go down to 0 in all cases? I'm still not sure how it works in the case of a chunk compressed by ZSTD_compressStream2() spanning 2 physical records. It seems hard to believe that the uncompression algorithm can completely consume input upto arbitrary boundaries. Would it be safer to concat all the physical records belonging to a logical one and then uncompress?

To make it more concrete, consider an example of a logical record compressed by 2 calls to ZSTD_compressStream, and written to 3 physical records {p1, p2} and {p3}, where the output of the first call is split into p1 and p2. Is there a possibility of a a literal spanning the boundary between p1/p2, since the compress call would have been unaware of such a boundary?

That case works. The uncompression algorithm doesn't rely on knowing the boundaries since the input reader doesn't know them either in case of file compression.

Yeah, now that I think about it, I guess in the example I mentioned above the uncompression might keep any input partial data in its internal buffers and flush it to output when it sees the rest. The ZSTD streaming documentation is rather sparse, hence the doubt.

If you look at the example here, The outer loop just reads chunks from the compressed file. The inner loop uses the input.pos/input.size to figure out if the chunk can be decompressed any further.

https://github.com/facebook/zstd/blob/dev/examples/streaming_decompression.c#L47

anand1976 · 2022-03-03T19:54:54Z

db/log_reader.cc

+          uncompressed_record_.append(uncompressed_buffer_.get(),
+                                      uncompressed_size);
+        }
+      } while (remaining > 0);


Also, I think your original check of remaining > 0 || uncompressed_size == kBlockSize was probably correct. The documentation states that if output.pos == output.size, there may be some data left in internal buffers.

I think that's why I had added it, although removing it didn't cause the tests to fail.

Added it back.

anand1976

Thanks for the explanation. LGTM. Can we run db_bench to verify no regression? Same benchmarks as the previous PR for adding compression type record should suffice.

sidroyc · 2022-03-04T18:29:48Z

Unfortunately, I'm seeing a regression.

Base commit average 421886.00 ops/sec.
commit 9ed9670 (HEAD -> main, origin/main, origin/HEAD)
Author: Changneng Chen changneng@fb.com
Date: Fri Feb 25 23:13:11 2022 -0800

fillseq : 2.287 micros/op 437242 ops/sec; 48.4 MB/s
fillseq : 2.437 micros/op 410349 ops/sec; 45.4 MB/s
fillseq : 2.313 micros/op 432308 ops/sec; 47.8 MB/s
fillseq : 2.325 micros/op 430129 ops/sec; 47.6 MB/s
fillseq : 2.388 micros/op 418700 ops/sec; 46.3 MB/s
fillseq : 2.402 micros/op 416285 ops/sec; 46.1 MB/s
fillseq : 2.333 micros/op 428575 ops/sec; 47.4 MB/s
fillseq : 2.372 micros/op 421545 ops/sec; 46.6 MB/s
fillseq : 2.475 micros/op 404033 ops/sec; 44.7 MB/s
fillseq : 2.372 micros/op 421497 ops/sec; 46.6 MB/s
fillseq : 2.317 micros/op 431614 ops/sec; 47.7 MB/s
fillseq : 2.375 micros/op 420984 ops/sec; 46.6 MB/s
fillseq : 2.446 micros/op 408778 ops/sec; 45.2 MB/s
fillseq : 2.366 micros/op 422571 ops/sec; 46.7 MB/s
fillseq : 2.359 micros/op 423872 ops/sec; 46.9 MB/s
fillseq : 2.479 micros/op 403446 ops/sec; 44.6 MB/s
fillseq : 2.327 micros/op 429682 ops/sec; 47.5 MB/s
fillseq : 2.339 micros/op 427508 ops/sec; 47.3 MB/s
fillseq : 2.388 micros/op 418651 ops/sec; 46.3 MB/s
fillseq : 2.326 micros/op 429951 ops/sec; 47.6 MB/s

With the changes average 393841.50 ops/sec -
fillseq : 2.437 micros/op 410271 ops/sec; 45.4 MB/s
fillseq : 2.510 micros/op 398398 ops/sec; 44.1 MB/s
fillseq : 2.471 micros/op 404757 ops/sec; 44.8 MB/s
fillseq : 2.489 micros/op 401692 ops/sec; 44.4 MB/s
fillseq : 2.512 micros/op 398019 ops/sec; 44.0 MB/s
fillseq : 2.485 micros/op 402422 ops/sec; 44.5 MB/s
fillseq : 2.527 micros/op 395758 ops/sec; 43.8 MB/s
fillseq : 2.707 micros/op 369345 ops/sec; 40.9 MB/s
fillseq : 2.494 micros/op 400881 ops/sec; 44.3 MB/s
fillseq : 2.551 micros/op 392021 ops/sec; 43.4 MB/s
fillseq : 2.526 micros/op 395939 ops/sec; 43.8 MB/s
fillseq : 2.667 micros/op 374932 ops/sec; 41.5 MB/s
fillseq : 2.555 micros/op 391340 ops/sec; 43.3 MB/s
fillseq : 2.483 micros/op 402803 ops/sec; 44.6 MB/s
fillseq : 2.557 micros/op 391149 ops/sec; 43.3 MB/s
fillseq : 2.456 micros/op 407110 ops/sec; 45.0 MB/s
fillseq : 2.592 micros/op 385738 ops/sec; 42.7 MB/s
fillseq : 2.493 micros/op 401088 ops/sec; 44.4 MB/s
fillseq : 2.546 micros/op 392761 ops/sec; 43.4 MB/s
fillseq : 2.775 micros/op 360406 ops/sec; 39.9 MB/s

Will run it a few more times to confirm.

sidroyc · 2022-03-05T04:59:18Z

Comparing release builds -
Base - 483819.50 ops/sec
With PR - 455515.90 ops/sec

Benchmark results with ZSTD WAL compression enabled -
189432 ops/sec.

pdillinger · 2022-03-07T19:18:35Z

db/log_reader.cc

-  *fragment = Slice(header + header_size, length);
-  *fragment_type_or_err = type;
-  return true;
+  if (uncompress_ && type != kSetCompressionType) {


It's possible that re-arranging this if (negating the condition and swapping the "then" and "else" code blocks) could negate some of the regression. Why? Generally default branch predictions would be not to jump forward (favoring "then" rather than "else"). And possibly better code locality because of "return" from "then" and "else". You can also put LIKELY() around type != kSetCompressionType.

That seemed to help. I'll run it one more time to be sure.

Without PR -
Avg ops/sec 478815

fillseq : 2.042 micros/op 489809 ops/sec; 54.2 MB/s
fillseq : 2.134 micros/op 468617 ops/sec; 51.8 MB/s
fillseq : 2.066 micros/op 484099 ops/sec; 53.6 MB/s
fillseq : 2.140 micros/op 467203 ops/sec; 51.7 MB/s
fillseq : 2.146 micros/op 465934 ops/sec; 51.5 MB/s
fillseq : 2.038 micros/op 490768 ops/sec; 54.3 MB/s
fillseq : 2.194 micros/op 455855 ops/sec; 50.4 MB/s
fillseq : 2.051 micros/op 487570 ops/sec; 53.9 MB/s
fillseq : 2.075 micros/op 481840 ops/sec; 53.3 MB/s
fillseq : 2.150 micros/op 465151 ops/sec; 51.5 MB/s
fillseq : 2.062 micros/op 484875 ops/sec; 53.6 MB/s
fillseq : 2.054 micros/op 486925 ops/sec; 53.9 MB/s
fillseq : 2.095 micros/op 477330 ops/sec; 52.8 MB/s
fillseq : 2.063 micros/op 484727 ops/sec; 53.6 MB/s
fillseq : 2.020 micros/op 495089 ops/sec; 54.8 MB/s
fillseq : 2.093 micros/op 477879 ops/sec; 52.9 MB/s
fillseq : 2.133 micros/op 468929 ops/sec; 51.9 MB/s
fillseq : 2.014 micros/op 496462 ops/sec; 54.9 MB/s
fillseq : 2.151 micros/op 464845 ops/sec; 51.4 MB/s
fillseq : 2.073 micros/op 482397 ops/sec; 53.4 MB/s

With PR -
Avg ops/sec 470565

fillseq : 2.162 micros/op 462585 ops/sec; 51.2 MB/s
fillseq : 2.092 micros/op 478039 ops/sec; 52.9 MB/s
fillseq : 2.103 micros/op 475427 ops/sec; 52.6 MB/s
fillseq : 2.081 micros/op 480608 ops/sec; 53.2 MB/s
fillseq : 2.147 micros/op 465735 ops/sec; 51.5 MB/s
fillseq : 2.117 micros/op 472425 ops/sec; 52.3 MB/s
fillseq : 2.092 micros/op 478105 ops/sec; 52.9 MB/s
fillseq : 2.117 micros/op 472352 ops/sec; 52.3 MB/s
fillseq : 2.116 micros/op 472519 ops/sec; 52.3 MB/s
fillseq : 2.131 micros/op 469370 ops/sec; 51.9 MB/s
fillseq : 2.154 micros/op 464144 ops/sec; 51.3 MB/s
fillseq : 2.098 micros/op 476660 ops/sec; 52.7 MB/s
fillseq : 2.139 micros/op 467601 ops/sec; 51.7 MB/s
fillseq : 2.170 micros/op 460933 ops/sec; 51.0 MB/s
fillseq : 2.124 micros/op 470743 ops/sec; 52.1 MB/s
fillseq : 2.120 micros/op 471664 ops/sec; 52.2 MB/s
fillseq : 2.115 micros/op 472834 ops/sec; 52.3 MB/s
fillseq : 2.174 micros/op 460034 ops/sec; 50.9 MB/s
fillseq : 2.123 micros/op 471077 ops/sec; 52.1 MB/s
fillseq : 2.135 micros/op 468455 ops/sec; 51.8 MB/s

Subsequent runs, I still see some regression -
490107 vs 473037 ops/sec

My guess is that something similar needs to be done for log_writer as well but because the else block is embedded in the loop, it'll probably need a bigger refactoring.

I think its noise. The variation without your changes (478815 vs 490107) suggests so. Also, PGO should take care of rearrangement if necessary?

Ok. Let me submit a diff.

facebook-github-bot · 2022-03-08T06:18:26Z

@sidroyc has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

anand1976

LGTM. Thanks!

Integrate WAL compression into log reader/writer.

88915fc

facebook-github-bot added the CLA Signed label Mar 1, 2022

sidroyc requested a review from anand1976 March 1, 2022 06:53

sidroyc added 3 commits March 1, 2022 10:52

Fix tests.

0eb9062

Fix one more test.

d7b9bfe

Fix analyze warning.

d1f25eb

anand1976 reviewed Mar 3, 2022

View reviewed changes

Address review comments.

073c1f1

anand1976 reviewed Mar 3, 2022

View reviewed changes

Redo some changes.

17f909c

anand1976 reviewed Mar 3, 2022

View reviewed changes

anand1976 approved these changes Mar 4, 2022

View reviewed changes

pdillinger reviewed Mar 7, 2022

View reviewed changes

Rearrange if else blocks.

4822d9e

anand1976 approved these changes Mar 9, 2022

View reviewed changes

facebook-github-bot closed this in fec4403 Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate WAL compression into log reader/writer. #9642

Integrate WAL compression into log reader/writer. #9642

sidroyc commented Mar 1, 2022

anand1976 left a comment

anand1976 Mar 2, 2022

sidroyc Mar 3, 2022

anand1976 Mar 3, 2022

sidroyc Mar 3, 2022

sidroyc Mar 3, 2022

anand1976 Mar 3, 2022 •

edited

Loading

sidroyc Mar 3, 2022

anand1976 Mar 3, 2022

sidroyc Mar 3, 2022

anand1976 Mar 3, 2022

sidroyc Mar 3, 2022

sidroyc Mar 3, 2022

anand1976 left a comment

sidroyc commented Mar 4, 2022

sidroyc commented Mar 5, 2022

pdillinger Mar 7, 2022

sidroyc Mar 7, 2022

sidroyc Mar 7, 2022 •

edited

Loading

anand1976 Mar 8, 2022

sidroyc Mar 8, 2022

facebook-github-bot commented Mar 8, 2022

anand1976 left a comment

Integrate WAL compression into log reader/writer. #9642

Integrate WAL compression into log reader/writer. #9642

Conversation

sidroyc commented Mar 1, 2022

anand1976 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anand1976 Mar 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anand1976 left a comment

Choose a reason for hiding this comment

sidroyc commented Mar 4, 2022

sidroyc commented Mar 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sidroyc Mar 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 8, 2022

anand1976 left a comment

Choose a reason for hiding this comment

anand1976 Mar 3, 2022 •

edited

Loading

sidroyc Mar 7, 2022 •

edited

Loading