Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TechDebt] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock #9393

Closed
wants to merge 2 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Jan 17, 2022

Context:
Inside BlockBasedTableBuilder::WriteRawBlock, there are multiple places that change local variables io_s and s while
depend on them. This PR attempts to clarify the relevant logics so that it's easier to read and add places of changing these local variables later (like #9342.) without changing the current behavior.

Summary:

  • Shorten the lifetime of local var io_s and s as much as possible to avoid if-else branches by early return

Test

  • Reasoned against original behavior to verify new changes do not break existing behaviors.
  • Rely on CI tests since we are not changing current behavior.
  • db_bench ./db_bench --benchmarks="fillseq" X 5:
    • before the change: fillseq : 3.6086 micros/op, 282474.6 ops/sec, 31.24 MB/s
    • after the change: fillseq : 3.4078 micros/op 293929.2 ops/sec; 32.52 MB/s

@hx235 hx235 marked this pull request as draft January 17, 2022 23:15
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 changed the title [Draft] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock [TechDebt] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock Jan 18, 2022
@hx235 hx235 requested a review from pdillinger January 18, 2022 00:13
@hx235 hx235 marked this pull request as ready for review January 18, 2022 00:32
@pdillinger
Copy link
Contributor

Wow, the existing code is nightmarish. Your changes are small improvement.

  • It looks like all callers to SetIOStatus also call SetStatus immediately after. Why not instead make SetIOStatus call SetStatus? I don't really know why we even have separate state for them. @anand1976 do you know?
  • Why do we need Status s and IOStatus io_s variables that live for the whole function? Can they be temporaries with minimum lifetime, as in
{
  IOStatus io_s = r->file->Append(block_contents);
  if (!io_s.ok()) {
    SetIOStatus(io_s);
    // SetStatus would go here if not implied by SetIOStatus
    return;
  }
}

This sort of approach should greatly reduce the nesting depth (if not also cyclomatic complexity). The early return idiom might not be appropriate in cases where there's significant "clean up" not easily handled with RAII, but here it seems appropriate.

@hx235
Copy link
Contributor Author

hx235 commented Jan 19, 2022

@pdillinger

Why do we need Status s and IOStatus io_s variables that live for the whole function? Can they be temporaries with minimum lifetime, as in

Good suggestion! Fixed this, although there are 2 cases where we can't do early return cuz some existing behavior continues despite of errors (noted in PR comment), which I don't know why but that's the current behavior.

I don't really know why we even have separate state for them. @anand1976 do you know?

Looked into the history and found #6487. Seems like IOStatus has more info than Status "such as error scope, error retryable attribute" and it was used in the bottom IO layer and being passed upward into the write path (including block based table). That seems to be a legitimate reason to have two separate statuses and two separate SetXX function.

It looks like all callers to SetIOStatus also call SetStatus immediately after. Why not instead make SetIOStatus call SetStatus?

I am not strongly against the idea of a nested call to SetStatus inside SetIOStatus, although I like them being independent so that there is no side-effect when calling SetIOStatus, unless we want to explicitly make this "side-effect" as a correct effect and document it as an invariant in the API cuz "SetIOStatus" does not sound obvious that it will SetStatus

@hx235 hx235 force-pushed the write_raw_block_status_clarify branch from 88e1759 to e0372eb Compare January 19, 2022 04:34
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

}
// TODO:: Should InsertBlockInCompressedCache take into account error from
// InsertBlockInCache or ignore and overwrite it.
s = InsertBlockInCompressedCache(block_contents, type, handle);
if (!s.ok()) {
r->SetStatus(s);
Copy link
Contributor Author

@hx235 hx235 Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:
This is one of the two places where we don't do early return since we move on to s = InsertBlockInCompressedCache(block_contents, type, handle); even when !s.ok() and someone left a todo about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we resolve the TODO? I think there's no point caching a block for a TableBuilder where non-OK SetStatus() has already happened (i.e., table building already failed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - and let me take a look into the blame on why TODO is left.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor Author

hx235 commented Jan 19, 2022

Update:

  • Adopted suggestion in shortening the lifetime of local variable io_s and s
  • Left some PR comment and note responding to Peter's questions
  • Ready for review again

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

} else {
r->SetIOStatus(io_s);
r->SetStatus(io_s);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:
This is the second place of the two places where we don't do early return as the original behavior progresses to parallel compression despite of errors, mostly because in the case where file->Pad() fails, parallel compression relies on r->get_offset() with no padding added and can still work.

Although it smells suspicious to me that the original PR of parallel compression might have overlooked the need to be compatible with our block_align feature thus ignored this failure status.

Note:
Also, I do SetStatus() before doing parallel compression instead of after like the existing behavior. This should be fine as the following r->pc_rep->file_size_estimator.ReapBlock and r->pc_rep->file_size_estimator.SetEstimatedFileSize(r->get_offset()); did not touch r->status nor any condition variables that might trigger any subsequent behaviors of potentially touching r->status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to update file size estimation when padding fails; the whole table building has already been determined to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I agree on your note about this. As noted in my PR comment, I suspect proceeding to parallel compression despite of padding error can cause incompatibility with the block_align feature. Will update this.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@anand1976
Copy link
Contributor

I don't really know why we even have separate state for them. @anand1976 do you know?

Looked into the history and found #6487. Seems like IOStatus has more info than Status "such as error scope, error retryable attribute" and it was used in the bottom IO layer and being passed upward into the write path (including block based table). That seems to be a legitimate reason to have two separate statuses and two separate SetXX function.

Right. And higher up in the write path, we always consider the IOStatus, but Status is ignored. So I guess non-IO failures are not considered critical and hence the need to keep them separate.

It looks like all callers to SetIOStatus also call SetStatus immediately after. Why not instead make SetIOStatus call SetStatus?

I am not strongly against the idea of a nested call to SetStatus inside SetIOStatus, although I like them being independent so that there is no side-effect when calling SetIOStatus, unless we want to explicitly make this "side-effect" as a correct effect and document it as an invariant in the API cuz "SetIOStatus" does not sound obvious that it will SetStatus

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@zhichao-cao zhichao-cao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, we use "s" to bring the general "status" in a function logic and use "io_s" to bring the I/O related status. io_s is the subset of s. I'm not sure if we have multiple "IOStatus io_s" or "Status s" in one function is helpful for the general logic.

@hx235
Copy link
Contributor Author

hx235 commented Jan 20, 2022

@zhichao-cao I can indeed see this pattern in other places. But the existing logic in WriteRawBlock of reusing the same s and io_s gives us many nested loop and make it hard to reason/ add code on top of if.

What do you think of naming each io_s and s differently in my revision to WriteRawBlock and refrain from using same naming of s and io_s, like :
IOStatus io_s_append_trailer = r->file->Append(Slice(trailer.data(), trailer.size()));
Status s_insert_block_in_cache = InsertBlockInCacheHelper(block_contents, handle, block_type, is_top_level_filter_block);

@zhichao-cao
Copy link
Contributor

What do you think of naming each io_s and s differently in my revision to WriteRawBlock and refrain from using same naming of s and io_s, like : IOStatus io_s_append_trailer = r->file->Append(Slice(trailer.data(), trailer.size())); Status s_insert_block_in_cache = InsertBlockInCacheHelper(block_contents, handle, block_type, is_top_level_filter_block);

Might be good for the "local" Status or IOStatus. Maybe shorter name, 5 words like "s_insert_block_in_cache" is too long.

@ajkr
Copy link
Contributor

ajkr commented Jan 20, 2022

I am not strongly against the idea of a nested call to SetStatus inside SetIOStatus, although I like them being independent so that there is no side-effect when calling SetIOStatus, unless we want to explicitly make this "side-effect" as a correct effect and document it as an invariant in the API cuz "SetIOStatus" does not sound obvious that it will SetStatus

The API doc for the getters suggest non-OK IOStatus implies non-OK Status:

  // Return non-ok iff some error has been detected.
  Status status() const override;                                                                                                                                                                                                                                           
                               
  // Return non-ok iff some error happens during IO.
  IOStatus io_status() const override;                                                                                                                                                                                                                                      

Right. And higher up in the write path, we always consider the IOStatus, but Status is ignored. So I guess non-IO failures are not considered critical and hence the need to keep them separate.

@anand1976 To me this justified IOStatus but not Status. Hypothetical question: if IOStatus were named something non-IO specific, would we still need two?

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing changes look good, and didn't see a reason why not to continue the early-return pattern to the remaining cases.

edit: Normally I don't request changes for something that's already an improvement. But the inconsistency in early-return or not - without a reason I could understand - felt too apparent after the changes.

edit 2: Also I might end up too scared to accept in any case because the risk might be too high for me. This code path writes every block, after all...

}
// TODO:: Should InsertBlockInCompressedCache take into account error from
// InsertBlockInCache or ignore and overwrite it.
s = InsertBlockInCompressedCache(block_contents, type, handle);
if (!s.ok()) {
r->SetStatus(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we resolve the TODO? I think there's no point caching a block for a TableBuilder where non-OK SetStatus() has already happened (i.e., table building already failed).

} else {
r->SetIOStatus(io_s);
r->SetStatus(io_s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to update file size estimation when padding fails; the whole table building has already been determined to fail.

@hx235
Copy link
Contributor Author

hx235 commented Jan 20, 2022

TODO:

  • Investigate the rest of the two non-early return
  • Run db bench for performance since the change is in the write path.
  • Rest of the comments

@zhichao-cao
Copy link
Contributor

TODO:

  • Investigate the rest of the two non-early return
  • Run db bench for performance since the change is in the write path.
  • Rest of the comments

db_bench might not be enough, agree with @ajkr , this code path is critical, any logical changes might result in a corner case failure.......

@hx235 hx235 force-pushed the write_raw_block_status_clarify branch from a0cbdd0 to c4c7e7d Compare January 20, 2022 23:13
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the write_raw_block_status_clarify branch from c4c7e7d to 5200ebe Compare January 22, 2022 14:52
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the write_raw_block_status_clarify branch from 5200ebe to 44b622f Compare January 24, 2022 17:53
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 force-pushed the write_raw_block_status_clarify branch from 44b622f to 65ec3e5 Compare January 24, 2022 19:02
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Contributor Author

hx235 commented Jan 24, 2022

Update:

  • Early-returned the rest of two cases I didn't do in my last update as it would change the existing behavior, quoted @akankshamahajan15 to review the cache warming case
  • Set Status in SetIOStatus and added comment

As for any performance concern on the very hot write path WriteRawBlock, is there any test I could do other than db_bench @ajkr @pdillinger? By the way, this PR is considered non-blocking to my #9342 and it's just easier to read/write #9342's change to WriteRawBlock under this refactoring.

if (!s.ok()) {
r->SetStatus(s);
return;
Copy link
Contributor Author

@hx235 hx235 Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akankshamahajan15 I saw you worked on [cache warming] (#8242) and left the "// TODO:: Should InsertBlockInCompressedCache take into account error from .. ".

Do you foresee any problem with removing the TODO and returning from InsertBlockInCacheHelper's error earlier like my change here as we are pushing for an early-return-from-error coding paradigm in WriteRawBlocks in this PR? Thanks!

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested a review from ajkr January 24, 2022 19:08
Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with suggestion for follow-up change (not high priority)

}
// TODO:: Should InsertBlockInCompressedCache take into account error from
// InsertBlockInCache or ignore and overwrite it.
s = InsertBlockInCompressedCache(block_contents, type, handle);
if (!s.ok()) {
r->SetStatus(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should not be failing table construction on failure to insert in cache, but that is the old behavior and OK not to change it here and now. We wouldn't want transient memory pressure to abort a giant compaction.

(So IMO, whether compressed cache insertion is conditional on regular cache insertion is unimportant. We should proceed despite cache insertion errors, ideally logging max 1 warning per table builder about failed cache insertion.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened task

@pdillinger
Copy link
Contributor

As for any performance concern on the very hot write path WriteRawBlock, is there any test I could do other than db_bench @ajkr @pdillinger?

You can run a db_bench fillseq test to be safe, but the changes are mostly cosmetic. The set of code paths is nearly unchanged.

@hx235
Copy link
Contributor Author

hx235 commented Jan 26, 2022

As for any performance concern on the very hot write path WriteRawBlock, is there any test I could do other than db_bench @ajkr @pdillinger?

You can run a db_bench fillseq test to be safe, but the changes are mostly cosmetic. The set of code paths is nearly unchanged.

Run it and see a slight improvement after the change

facebook-github-bot pushed a commit that referenced this pull request Feb 2, 2022
Summary:
Note: rebase on and merge after #9349, #9345, (optional) #9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum

This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.

Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.

**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](#9073) are released properly despite corruption
   - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
   -  Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.

Pull Request resolved: #9342

Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
   - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true  -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
   -  FastLocalBloom
       - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
       - After change:
          -  `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
          - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
    -  Standard128Ribbon
       - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
       - After change:
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
          - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.

Reviewed By: pdillinger

Differential Revision: D33746928

Pulled By: hx235

fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants