New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TechDebt] Clarify status-handling logic in BlockBasedTableBuilder::WriteRawBlock #9393
Conversation
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Wow, the existing code is nightmarish. Your changes are small improvement.
This sort of approach should greatly reduce the nesting depth (if not also cyclomatic complexity). The early return idiom might not be appropriate in cases where there's significant "clean up" not easily handled with RAII, but here it seems appropriate. |
Good suggestion! Fixed this, although there are 2 cases where we can't do early return cuz some existing behavior continues despite of errors (noted in PR comment), which I don't know why but that's the current behavior.
Looked into the history and found #6487. Seems like IOStatus has more info than Status "such as error scope, error retryable attribute" and it was used in the bottom IO layer and being passed upward into the write path (including block based table). That seems to be a legitimate reason to have two separate statuses and two separate SetXX function.
I am not strongly against the idea of a nested call to SetStatus inside SetIOStatus, although I like them being independent so that there is no side-effect when calling SetIOStatus, unless we want to explicitly make this "side-effect" as a correct effect and document it as an invariant in the API cuz "SetIOStatus" does not sound obvious that it will SetStatus |
88e1759
to
e0372eb
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
} | ||
// TODO:: Should InsertBlockInCompressedCache take into account error from | ||
// InsertBlockInCache or ignore and overwrite it. | ||
s = InsertBlockInCompressedCache(block_contents, type, handle); | ||
if (!s.ok()) { | ||
r->SetStatus(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
This is one of the two places where we don't do early return since we move on to s = InsertBlockInCompressedCache(block_contents, type, handle);
even when !s.ok()
and someone left a todo about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we resolve the TODO? I think there's no point caching a block for a TableBuilder where non-OK SetStatus() has already happened (i.e., table building already failed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - and let me take a look into the blame on why TODO is left.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Update:
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
} else { | ||
r->SetIOStatus(io_s); | ||
r->SetStatus(io_s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
This is the second place of the two places where we don't do early return as the original behavior progresses to parallel compression despite of errors, mostly because in the case where file->Pad() fails, parallel compression relies on r->get_offset() with no padding added and can still work.
Although it smells suspicious to me that the original PR of parallel compression might have overlooked the need to be compatible with our block_align feature thus ignored this failure status.
Note:
Also, I do SetStatus() before doing parallel compression instead of after like the existing behavior. This should be fine as the following r->pc_rep->file_size_estimator.ReapBlock and r->pc_rep->file_size_estimator.SetEstimatedFileSize(r->get_offset()); did not touch r->status nor any condition variables that might trigger any subsequent behaviors of potentially touching r->status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to update file size estimation when padding fails; the whole table building has already been determined to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I agree on your note about this. As noted in my PR comment, I suspect proceeding to parallel compression despite of padding error can cause incompatibility with the block_align feature. Will update this.
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Right. And higher up in the write path, we always consider the IOStatus, but Status is ignored. So I guess non-IO failures are not considered critical and hence the need to keep them separate.
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we use "s" to bring the general "status" in a function logic and use "io_s" to bring the I/O related status. io_s
is the subset of s
. I'm not sure if we have multiple "IOStatus io_s" or "Status s" in one function is helpful for the general logic.
@zhichao-cao I can indeed see this pattern in other places. But the existing logic in WriteRawBlock of reusing the same s and io_s gives us many nested loop and make it hard to reason/ add code on top of if. What do you think of naming each io_s and s differently in my revision to |
Might be good for the "local" Status or IOStatus. Maybe shorter name, 5 words like "s_insert_block_in_cache" is too long. |
The API doc for the getters suggest non-OK IOStatus implies non-OK Status:
@anand1976 To me this justified IOStatus but not Status. Hypothetical question: if IOStatus were named something non-IO specific, would we still need two? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the existing changes look good, and didn't see a reason why not to continue the early-return pattern to the remaining cases.
edit: Normally I don't request changes for something that's already an improvement. But the inconsistency in early-return or not - without a reason I could understand - felt too apparent after the changes.
edit 2: Also I might end up too scared to accept in any case because the risk might be too high for me. This code path writes every block, after all...
} | ||
// TODO:: Should InsertBlockInCompressedCache take into account error from | ||
// InsertBlockInCache or ignore and overwrite it. | ||
s = InsertBlockInCompressedCache(block_contents, type, handle); | ||
if (!s.ok()) { | ||
r->SetStatus(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we resolve the TODO? I think there's no point caching a block for a TableBuilder where non-OK SetStatus() has already happened (i.e., table building already failed).
} else { | ||
r->SetIOStatus(io_s); | ||
r->SetStatus(io_s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to update file size estimation when padding fails; the whole table building has already been determined to fail.
TODO:
|
db_bench might not be enough, agree with @ajkr , this code path is critical, any logical changes might result in a corner case failure....... |
a0cbdd0
to
c4c7e7d
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
c4c7e7d
to
5200ebe
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
5200ebe
to
44b622f
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
44b622f
to
65ec3e5
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Update:
As for any performance concern on the very hot write path WriteRawBlock, is there any test I could do other than db_bench @ajkr @pdillinger? By the way, this PR is considered non-blocking to my #9342 and it's just easier to read/write #9342's change to WriteRawBlock under this refactoring. |
if (!s.ok()) { | ||
r->SetStatus(s); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akankshamahajan15 I saw you worked on [cache warming] (#8242) and left the "// TODO:: Should InsertBlockInCompressedCache take into account error from .. ".
Do you foresee any problem with removing the TODO and returning from InsertBlockInCacheHelper
's error earlier like my change here as we are pushing for an early-return-from-error coding paradigm in WriteRawBlocks in this PR? Thanks!
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with suggestion for follow-up change (not high priority)
} | ||
// TODO:: Should InsertBlockInCompressedCache take into account error from | ||
// InsertBlockInCache or ignore and overwrite it. | ||
s = InsertBlockInCompressedCache(block_contents, type, handle); | ||
if (!s.ok()) { | ||
r->SetStatus(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should not be failing table construction on failure to insert in cache, but that is the old behavior and OK not to change it here and now. We wouldn't want transient memory pressure to abort a giant compaction.
(So IMO, whether compressed cache insertion is conditional on regular cache insertion is unimportant. We should proceed despite cache insertion errors, ideally logging max 1 warning per table builder about failed cache insertion.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened task
You can run a db_bench fillseq test to be safe, but the changes are mostly cosmetic. The set of code paths is nearly unchanged. |
Run it and see a slight improvement after the change |
Summary: Note: rebase on and merge after #9349, #9345, (optional) #9393 **Context:** (Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following: a) set of keys to add to filter b) set of hashes to add to filter (64-bit hash applied to each key) c) set of Bloom indices to set in filter, with duplicates d) set of Bloom indices to set in filter, deduplicated e) final filter and its checksum This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level. - b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`) - c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO. Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default. **Summary:** - Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()` - Ensured hash entries, final filter and banding and their [cache reservation ](#9073) are released properly despite corruption - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design - Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries` - Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()` - When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder. Pull Request resolved: #9342 Test Plan: - Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption` - Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break. - Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()` - FastLocalBloom - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s** - After change: - `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)** - Standard128Ribbon - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s** - After change: - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)** - Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true` - Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status. Reviewed By: pdillinger Differential Revision: D33746928 Pulled By: hx235 fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Context:
Inside
BlockBasedTableBuilder::WriteRawBlock
, there are multiple places that change local variablesio_s
ands
whiledepend on them. This PR attempts to clarify the relevant logics so that it's easier to read and add places of changing these local variables later (like #9342.) without changing the current behavior.
Summary:
io_s
ands
as much as possible to avoid if-else branches by early returnTest
./db_bench --benchmarks="fillseq"
X 5: