[TechDebt] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock #9393

hx235 · 2022-01-17T23:14:25Z

Context:
Inside BlockBasedTableBuilder::WriteRawBlock, there are multiple places that change local variables io_s and s while
depend on them. This PR attempts to clarify the relevant logics so that it's easier to read and add places of changing these local variables later (like #9342.) without changing the current behavior.

Summary:

Shorten the lifetime of local var io_s and s as much as possible to avoid if-else branches by early return

Test

Reasoned against original behavior to verify new changes do not break existing behaviors.
Rely on CI tests since we are not changing current behavior.
db_bench ./db_bench --benchmarks="fillseq" X 5:
- before the change: fillseq : 3.6086 micros/op, 282474.6 ops/sec, 31.24 MB/s
- after the change: fillseq : 3.4078 micros/op 293929.2 ops/sec; 32.52 MB/s

facebook-github-bot · 2022-01-17T23:38:28Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger · 2022-01-18T18:47:41Z

Wow, the existing code is nightmarish. Your changes are small improvement.

It looks like all callers to SetIOStatus also call SetStatus immediately after. Why not instead make SetIOStatus call SetStatus? I don't really know why we even have separate state for them. @anand1976 do you know?
Why do we need Status s and IOStatus io_s variables that live for the whole function? Can they be temporaries with minimum lifetime, as in

{
  IOStatus io_s = r->file->Append(block_contents);
  if (!io_s.ok()) {
    SetIOStatus(io_s);
    // SetStatus would go here if not implied by SetIOStatus
    return;
  }
}

This sort of approach should greatly reduce the nesting depth (if not also cyclomatic complexity). The early return idiom might not be appropriate in cases where there's significant "clean up" not easily handled with RAII, but here it seems appropriate.

hx235 · 2022-01-19T01:40:07Z

@pdillinger

Why do we need Status s and IOStatus io_s variables that live for the whole function? Can they be temporaries with minimum lifetime, as in

Good suggestion! Fixed this, although there are 2 cases where we can't do early return cuz some existing behavior continues despite of errors (noted in PR comment), which I don't know why but that's the current behavior.

I don't really know why we even have separate state for them. @anand1976 do you know?

Looked into the history and found #6487. Seems like IOStatus has more info than Status "such as error scope, error retryable attribute" and it was used in the bottom IO layer and being passed upward into the write path (including block based table). That seems to be a legitimate reason to have two separate statuses and two separate SetXX function.

It looks like all callers to SetIOStatus also call SetStatus immediately after. Why not instead make SetIOStatus call SetStatus?

I am not strongly against the idea of a nested call to SetStatus inside SetIOStatus, although I like them being independent so that there is no side-effect when calling SetIOStatus, unless we want to explicitly make this "side-effect" as a correct effect and document it as an invariant in the API cuz "SetIOStatus" does not sound obvious that it will SetStatus

facebook-github-bot · 2022-01-19T04:34:44Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

hx235 · 2022-01-19T04:49:43Z

table/block_based/block_based_table_builder.cc

      }
-      // TODO:: Should InsertBlockInCompressedCache take into account error from
-      // InsertBlockInCache or ignore and overwrite it.
-      s = InsertBlockInCompressedCache(block_contents, type, handle);
      if (!s.ok()) {
        r->SetStatus(s);


Note:
This is one of the two places where we don't do early return since we move on to s = InsertBlockInCompressedCache(block_contents, type, handle); even when !s.ok() and someone left a todo about it.

Can we resolve the TODO? I think there's no point caching a block for a TableBuilder where non-OK SetStatus() has already happened (i.e., table building already failed).

Yep - and let me take a look into the blame on why TODO is left.

facebook-github-bot · 2022-01-19T05:47:49Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2022-01-19T06:14:13Z

Update:

Adopted suggestion in shortening the lifetime of local variable io_s and s
Left some PR comment and note responding to Peter's questions
Ready for review again

facebook-github-bot · 2022-01-19T18:08:07Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-19T18:08:22Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2022-01-19T18:11:43Z

table/block_based/block_based_table_builder.cc

    } else {
      r->SetIOStatus(io_s);
+      r->SetStatus(io_s);


Note:
This is the second place of the two places where we don't do early return as the original behavior progresses to parallel compression despite of errors, mostly because in the case where file->Pad() fails, parallel compression relies on r->get_offset() with no padding added and can still work.

Although it smells suspicious to me that the original PR of parallel compression might have overlooked the need to be compatible with our block_align feature thus ignored this failure status.

Note:
Also, I do SetStatus() before doing parallel compression instead of after like the existing behavior. This should be fine as the following r->pc_rep->file_size_estimator.ReapBlock and r->pc_rep->file_size_estimator.SetEstimatedFileSize(r->get_offset()); did not touch r->status nor any condition variables that might trigger any subsequent behaviors of potentially touching r->status

I don't think we need to update file size estimation when padding fails; the whole table building has already been determined to fail.

Yeah - I agree on your note about this. As noted in my PR comment, I suspect proceeding to parallel compression despite of padding error can cause incompatibility with the block_align feature. Will update this.

facebook-github-bot · 2022-01-19T18:15:19Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

anand1976 · 2022-01-19T18:42:21Z

I don't really know why we even have separate state for them. @anand1976 do you know?

Looked into the history and found #6487. Seems like IOStatus has more info than Status "such as error scope, error retryable attribute" and it was used in the bottom IO layer and being passed upward into the write path (including block based table). That seems to be a legitimate reason to have two separate statuses and two separate SetXX function.

Right. And higher up in the write path, we always consider the IOStatus, but Status is ignored. So I guess non-IO failures are not considered critical and hence the need to keep them separate.

It looks like all callers to SetIOStatus also call SetStatus immediately after. Why not instead make SetIOStatus call SetStatus?

I am not strongly against the idea of a nested call to SetStatus inside SetIOStatus, although I like them being independent so that there is no side-effect when calling SetIOStatus, unless we want to explicitly make this "side-effect" as a correct effect and document it as an invariant in the API cuz "SetIOStatus" does not sound obvious that it will SetStatus

facebook-github-bot · 2022-01-19T19:32:21Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-19T19:32:24Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

zhichao-cao

Usually, we use "s" to bring the general "status" in a function logic and use "io_s" to bring the I/O related status. io_s is the subset of s. I'm not sure if we have multiple "IOStatus io_s" or "Status s" in one function is helpful for the general logic.

hx235 · 2022-01-20T01:15:23Z

@zhichao-cao I can indeed see this pattern in other places. But the existing logic in WriteRawBlock of reusing the same s and io_s gives us many nested loop and make it hard to reason/ add code on top of if.

What do you think of naming each io_s and s differently in my revision to WriteRawBlock and refrain from using same naming of s and io_s, like :
IOStatus io_s_append_trailer = r->file->Append(Slice(trailer.data(), trailer.size()));
Status s_insert_block_in_cache = InsertBlockInCacheHelper(block_contents, handle, block_type, is_top_level_filter_block);

zhichao-cao · 2022-01-20T01:19:06Z

What do you think of naming each io_s and s differently in my revision to WriteRawBlock and refrain from using same naming of s and io_s, like : IOStatus io_s_append_trailer = r->file->Append(Slice(trailer.data(), trailer.size())); Status s_insert_block_in_cache = InsertBlockInCacheHelper(block_contents, handle, block_type, is_top_level_filter_block);

Might be good for the "local" Status or IOStatus. Maybe shorter name, 5 words like "s_insert_block_in_cache" is too long.

ajkr · 2022-01-20T01:49:50Z

I am not strongly against the idea of a nested call to SetStatus inside SetIOStatus, although I like them being independent so that there is no side-effect when calling SetIOStatus, unless we want to explicitly make this "side-effect" as a correct effect and document it as an invariant in the API cuz "SetIOStatus" does not sound obvious that it will SetStatus

The API doc for the getters suggest non-OK IOStatus implies non-OK Status:

  // Return non-ok iff some error has been detected.
  Status status() const override;                                                                                                                                                                                                                                           
                               
  // Return non-ok iff some error happens during IO.
  IOStatus io_status() const override;

Right. And higher up in the write path, we always consider the IOStatus, but Status is ignored. So I guess non-IO failures are not considered critical and hence the need to keep them separate.

@anand1976 To me this justified IOStatus but not Status. Hypothetical question: if IOStatus were named something non-IO specific, would we still need two?

ajkr

I think the existing changes look good, and didn't see a reason why not to continue the early-return pattern to the remaining cases.

edit: Normally I don't request changes for something that's already an improvement. But the inconsistency in early-return or not - without a reason I could understand - felt too apparent after the changes.

edit 2: Also I might end up too scared to accept in any case because the risk might be too high for me. This code path writes every block, after all...

ajkr · 2022-01-20T02:21:15Z

table/block_based/block_based_table_builder.cc

      }
-      // TODO:: Should InsertBlockInCompressedCache take into account error from
-      // InsertBlockInCache or ignore and overwrite it.
-      s = InsertBlockInCompressedCache(block_contents, type, handle);
      if (!s.ok()) {
        r->SetStatus(s);


Can we resolve the TODO? I think there's no point caching a block for a TableBuilder where non-OK SetStatus() has already happened (i.e., table building already failed).

ajkr · 2022-01-20T02:32:17Z

table/block_based/block_based_table_builder.cc

    } else {
      r->SetIOStatus(io_s);
+      r->SetStatus(io_s);


I don't think we need to update file size estimation when padding fails; the whole table building has already been determined to fail.

hx235 · 2022-01-20T02:48:48Z

TODO:

Investigate the rest of the two non-early return
Run db bench for performance since the change is in the write path.
Rest of the comments

zhichao-cao · 2022-01-20T04:19:22Z

TODO:

Investigate the rest of the two non-early return

Run db bench for performance since the change is in the write path.

Rest of the comments

db_bench might not be enough, agree with @ajkr , this code path is critical, any logical changes might result in a corner case failure.......

facebook-github-bot · 2022-01-20T23:13:03Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-22T14:52:38Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-24T17:53:07Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-01-24T19:02:19Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

hx235 · 2022-01-24T19:02:28Z

Update:

Early-returned the rest of two cases I didn't do in my last update as it would change the existing behavior, quoted @akankshamahajan15 to review the cache warming case
Set Status in SetIOStatus and added comment

As for any performance concern on the very hot write path WriteRawBlock, is there any test I could do other than db_bench @ajkr @pdillinger? By the way, this PR is considered non-blocking to my #9342 and it's just easier to read/write #9342's change to WriteRawBlock under this refactoring.

hx235 · 2022-01-24T19:08:01Z

table/block_based/block_based_table_builder.cc

      if (!s.ok()) {
        r->SetStatus(s);
+        return;


@akankshamahajan15 I saw you worked on [cache warming] (#8242) and left the "// TODO:: Should InsertBlockInCompressedCache take into account error from .. ".

Do you foresee any problem with removing the TODO and returning from InsertBlockInCacheHelper's error earlier like my change here as we are pushing for an early-return-from-error coding paradigm in WriteRawBlocks in this PR? Thanks!

facebook-github-bot · 2022-01-24T19:08:21Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pdillinger

LGTM, with suggestion for follow-up change (not high priority)

pdillinger · 2022-01-25T16:37:51Z

table/block_based/block_based_table_builder.cc

      }
-      // TODO:: Should InsertBlockInCompressedCache take into account error from
-      // InsertBlockInCache or ignore and overwrite it.
-      s = InsertBlockInCompressedCache(block_contents, type, handle);
      if (!s.ok()) {
        r->SetStatus(s);


IMO we should not be failing table construction on failure to insert in cache, but that is the old behavior and OK not to change it here and now. We wouldn't want transient memory pressure to abort a giant compaction.

(So IMO, whether compressed cache insertion is conditional on regular cache insertion is unimportant. We should proceed despite cache insertion errors, ideally logging max 1 warning per table builder about failed cache insertion.)

Opened task

pdillinger · 2022-01-25T16:52:25Z

As for any performance concern on the very hot write path WriteRawBlock, is there any test I could do other than db_bench @ajkr @pdillinger?

You can run a db_bench fillseq test to be safe, but the changes are mostly cosmetic. The set of code paths is nearly unchanged.

hx235 · 2022-01-26T03:52:28Z

As for any performance concern on the very hot write path WriteRawBlock, is there any test I could do other than db_bench @ajkr @pdillinger?

You can run a db_bench fillseq test to be safe, but the changes are mostly cosmetic. The set of code paths is nearly unchanged.

Run it and see a slight improvement after the change

Summary: Note: rebase on and merge after #9349, #9345, (optional) #9393 **Context:** (Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following: a) set of keys to add to filter b) set of hashes to add to filter (64-bit hash applied to each key) c) set of Bloom indices to set in filter, with duplicates d) set of Bloom indices to set in filter, deduplicated e) final filter and its checksum This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level. - b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`) - c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO. Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default. **Summary:** - Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()` - Ensured hash entries, final filter and banding and their [cache reservation ](#9073) are released properly despite corruption - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design - Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries` - Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()` - When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder. Pull Request resolved: #9342 Test Plan: - Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption` - Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break. - Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()` - FastLocalBloom - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s** - After change: - `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)** - Standard128Ribbon - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s** - After change: - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)** - Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true` - Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status. Reviewed By: pdillinger Differential Revision: D33746928 Pulled By: hx235 fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57

facebook-github-bot added the CLA Signed label Jan 17, 2022

hx235 marked this pull request as draft January 17, 2022 23:15

hx235 mentioned this pull request Jan 17, 2022

Detect (new) Bloom/Ribbon Filter construction corruption #9342

Closed

hx235 changed the title ~~[Draft] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock~~ [TechDebt] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock Jan 18, 2022

hx235 requested a review from pdillinger January 18, 2022 00:13

hx235 marked this pull request as ready for review January 18, 2022 00:32

hx235 force-pushed the write_raw_block_status_clarify branch from 88e1759 to e0372eb Compare January 19, 2022 04:34

hx235 commented Jan 19, 2022

View reviewed changes

zhichao-cao reviewed Jan 20, 2022

View reviewed changes

ajkr requested changes Jan 20, 2022

View reviewed changes

hx235 force-pushed the write_raw_block_status_clarify branch from a0cbdd0 to c4c7e7d Compare January 20, 2022 23:13

shorten local io_s, s lifetime

c89d110

hx235 force-pushed the write_raw_block_status_clarify branch from c4c7e7d to 5200ebe Compare January 22, 2022 14:52

hx235 force-pushed the write_raw_block_status_clarify branch from 5200ebe to 44b622f Compare January 24, 2022 17:53

Review: rest 2 early return + SetStatus inside of SetIOStatus

65ec3e5

hx235 force-pushed the write_raw_block_status_clarify branch from 44b622f to 65ec3e5 Compare January 24, 2022 19:02

hx235 commented Jan 24, 2022

View reviewed changes

hx235 requested a review from ajkr January 24, 2022 19:08

pdillinger approved these changes Jan 25, 2022

View reviewed changes

facebook-github-bot closed this in 11d7329 Jan 26, 2022

[TechDebt] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock #9393

[TechDebt] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock #9393

Conversation

hx235 commented Jan 17, 2022 • edited

facebook-github-bot commented Jan 17, 2022

pdillinger commented Jan 18, 2022

hx235 commented Jan 19, 2022 • edited

facebook-github-bot commented Jan 19, 2022

hx235 Jan 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 19, 2022

hx235 commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 19, 2022

anand1976 commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

facebook-github-bot commented Jan 19, 2022

zhichao-cao left a comment

Choose a reason for hiding this comment

hx235 commented Jan 20, 2022 • edited

zhichao-cao commented Jan 20, 2022

ajkr commented Jan 20, 2022

ajkr left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 commented Jan 20, 2022 • edited

zhichao-cao commented Jan 20, 2022

facebook-github-bot commented Jan 20, 2022

facebook-github-bot commented Jan 22, 2022

facebook-github-bot commented Jan 24, 2022

facebook-github-bot commented Jan 24, 2022

hx235 commented Jan 24, 2022 • edited

hx235 Jan 24, 2022 • edited

Choose a reason for hiding this comment

facebook-github-bot commented Jan 24, 2022

pdillinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdillinger commented Jan 25, 2022

hx235 commented Jan 26, 2022

hx235 commented Jan 17, 2022 •

edited

hx235 commented Jan 19, 2022 •

edited

hx235 Jan 19, 2022 •

edited

hx235 commented Jan 20, 2022 •

edited

ajkr left a comment •

edited

hx235 commented Jan 20, 2022 •

edited

hx235 commented Jan 24, 2022 •

edited

hx235 Jan 24, 2022 •

edited