-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate-limit automatic WAL flush after each user write #9607
Conversation
d54271b
to
426792d
Compare
TODO:
|
4d65392
to
9f2c805
Compare
Update:
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Planning to review this one only as it's one functionality and easier to understand the whole picture by looking at one PR. Do you want to merge the descriptions from the other PRs here if needed? |
Yep - I will do that tomorrow morning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry there is some issue I did not think of before, and am still thinking about how to solve.
db/db_rate_limiter_test.cc
Outdated
EXPECT_EQ(flush_rate_limiter_request, | ||
options_.rate_limiter->GetTotalRequests(Env::IO_HIGH)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly confusing because it assumes compaction has not happened yet when flush_rate_limiter_request
was initialized to GetTotalRequests(Env::IO_HIGH)
(if compaction had already happened, then compaction requests could have been charged at high-pri and this assertion wouldn't notice). But it is non-deterministic whether compaction has happened at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if compaction had already happened, then compaction requests could have been charged at high-pri and this assertion wouldn't notice)
Would the above
// Init() is setup in a way such that we flush per file
ASSERT_EQ(flush_rate_limiter_request, kNumFiles);
be sufficient in noticing compaction is not charged at high-pri even in the case where compaction happened before std::int64_t flush_rate_limiter_request = options_.rate_limiter->GetTotalRequests(Env::IO_HIGH);
?
I was trying to use the following two
// Init() is setup in a way such that we flush per file
ASSERT_EQ(flush_rate_limiter_request, kNumFiles);
...
EXPECT_EQ(flush_rate_limiter_request,
options_.rate_limiter->GetTotalRequests(Env::IO_HIGH));
to make sure compaction are not charged at Env::IO_HIGH otherwise either of these two assertion will catch it.
But you did remind me of I did not check whether compaction is charged at other wrong pri like "MID" or "USER". Actually I have this at the end
EXPECT_EQ(compaction_rate_limiter_request + flush_rate_limiter_request,
options_.rate_limiter->GetTotalRequests(Env::IO_TOTAL))
But given that the current way is confusing, let me think of a way to clarify/write it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
db/db_rate_limiter_test.cc
Outdated
EXPECT_EQ(compaction_rate_limiter_request, | ||
kNumFiles - options_.level0_file_num_compaction_trigger); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this seems riskier than the above. It is non-deterministic whether the N+1
th flush happens before the N
th compaction is picked. In this case the symptom would be a test flake rather than an unlikely race condition that causes the assertion to miss a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is non-deterministic whether the N+1th flush happens before the Nth compaction is picked
Good point - I overlooked this. Let me think more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
db/db_rate_limiter_test.cc
Outdated
10 /* fairness */, RateLimiter::Mode::kWritesOnly)); | ||
options.table_factory.reset( | ||
NewBlockBasedTableFactory(BlockBasedTableOptions())); | ||
options.disable_auto_compactions = GetParam(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels kind of redundant with having distinct "Flush" and "Compact" tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will combine them. Thanks!
db/db_impl/db_impl_write.cc
Outdated
io_s = log.writer->file()->Sync(immutable_db_options_.use_fsync); | ||
io_s = | ||
log.writer->file()->Sync(immutable_db_options_.use_fsync, | ||
write_group.leader->rate_limiter_priority); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data that may be written out here is unrelated to write_group.leader
so shouldn't use its rate_limiter_priority
. I see the problem you ran into though. For per-request rate limiting we only know the precise rate_limiter_priority
until WritableFileWriter::Append()
. However the existing rate limiting code was in WritableFileWriter::Write*()
.
This Sync()
would not call the WritableFileWriter::Write*()
functions in the common case. I think it requires the user to enable manual WAL flush PLUS not call FlushWAL()
for a while in order for WritableFileWriter
to have a nonempty buffer that will be written out during this Sync()
.
The manual WAL flush case has a more likely issue, though, which is we aren't passing any rate_limiter_priority
during FlushWAL()
.
Will think about it more but here are some possible options -
(1) Track number of bytes Append()
d at IO_USER
priority and charge that amount to rate limiter during Flush()
(2) Make WriteOptions::rate_limiter_priority
explicitly not supported together with manual WAL flush
(3) Give up on per-request tracking and make it a DBOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Still reading your arguments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data that may be written out here is unrelated to write_group.leader so shouldn't use its rate_limiter_priority
This is the case I haven't considered thoroughly - need to study more into this code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline that for now we only support custom rate-limiting priority for WAL writes per-operation. Fixed.
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Compared with last update, I had some further improvement:
|
@@ -814,12 +828,13 @@ IOStatus WritableFileWriter::WriteDirectWithChecksum( | |||
// TODO: need to be improved since it sort of defeats the purpose of the rate | |||
// limiter | |||
size_t data_size = left; | |||
if (rate_limiter_ != nullptr) { | |||
Env::IOPriority rate_limiter_priority_used = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One consequence of bypassing rate-limiter's charging instead of charging it with the Env::IO_TOTAL is that we need to decide rate_limiter_priority_used
before size = rate_limiter_->RequestToken
.
It's not a big deal as long as the same writable_file_
's io_priority_
is not shared between threads. Otherwise
we might have writable_file_
's io_priority_
being changed in the loop while (data_size > 0) {}
below.
I don't think this is the case and I am double checking with Anand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @anand1976 , another quick question related to concurrency of FileSystem.
Is it guaranteed that writable_file_
's io_priority_
is not shared between threads? I don't think so by briefly inspecting the code but figured I should double check with you.
If you need more context for my question, see the PR comment right above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We never have multiple threads writing to the same WritableFile
.
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ajkr Not sure how far you were at your previous review so I have some summary of the incremental changes I made since the last update: #9607 (comment), #9607 (comment) Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress! A few smaller comments this time.
HISTORY.md
Outdated
@@ -10,6 +10,7 @@ | |||
|
|||
### Public API changes | |||
* Remove BlockBasedTableOptions.hash_index_allow_collision which already takes no effect. | |||
* Added `WriteOptions::rate_limiter_priority`. When set to something other than `Env::IO_TOTAL`, the internal rate limiter (`DBOptions::rate_limiter`) will be charged at the specified priority for automatic WAL flush (`Options::manual_wal_flush` == false) associated with the API to which the `WriteOptions` was provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be pedantic. It looks correct but automatic WAL flush could be elaborated slightly and decoupled from the option's purpose. Such as:
Added `WriteOptions::rate_limiter_priority`. When set to something other than `Env::IO_TOTAL`, the internal rate limiter (`DBOptions::rate_limiter`) will be charged at the specified priority for writes associated with the API to which the `WriteOptions` was provided. Currently the support covers automatic WAL flushes, which happen during live updates (`Put()`, `Write()`, `Delete()`, etc.) when `WriteOptions::disableWAL == false` and `DBOptions::manual_wal_flush == false`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got you! I can see your point here - will fix this and the API comment too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
db/db_impl/db_impl_write.cc
Outdated
// See `WriteOptions::rate_limiter_priority` for this constraint | ||
if (manual_wal_flush_ || rate_limiter_priority != Env::IO_USER) { | ||
rate_limiter_priority = Env::IO_TOTAL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the conditions be checked at the top of WriteImpl()
and return Status::InvalidArgument
when violated? I know we discussed this, but at that time I was under the impression manual_wal_flush
is dynamically changeable, in which case it'd only be accessible with lock held like PreprocessWrite()
where returning failure would cause the DB to enter read-only mode. But it turns out it's not dynamically changeable and manual_wal_flush_
can be accessed anywhere, like on entry to WriteImpl()
.
Also for the rate_limiter_priority != Env::IO_USER
case I don't see a reason to overwrite invalid values here, as opposed to validating on entry to WriteImpl()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I forgot to leave you a note on this to ask for clarification on what you meant by "entering read-only mode". Will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
db/db_rate_limiter_test.cc
Outdated
TEST_F(DBRateLimiterOnWriteTest, Compact) { | ||
Init(); | ||
|
||
// files_per_level_pre_compaction: 1,1,...,1 (in total kNumFiles levels) | ||
#ifndef ROCKSDB_LITE | ||
std::string files_per_level_pre_compaction = | ||
CreateSimpleFilesPerLevelString("1", "1"); | ||
ASSERT_EQ(files_per_level_pre_compaction, FilesPerLevel(0 /* cf */)); | ||
#endif // !ROCKSDB_LITE | ||
|
||
std::int64_t prev_total_request = | ||
options_.rate_limiter->GetTotalRequests(Env::IO_TOTAL); | ||
ASSERT_EQ(0, options_.rate_limiter->GetTotalRequests(Env::IO_LOW)); | ||
|
||
Compact(kStartKey, kEndKey); | ||
|
||
std::int64_t actual_compaction_request = | ||
options_.rate_limiter->GetTotalRequests(Env::IO_TOTAL) - | ||
prev_total_request; | ||
|
||
// files_per_level_post_compaction: 0,0,...,1 (in total kNumFiles levels) | ||
#ifndef ROCKSDB_LITE | ||
std::string files_per_level_post_compaction = | ||
CreateSimpleFilesPerLevelString("0", "1"); | ||
ASSERT_EQ(files_per_level_post_compaction, FilesPerLevel(0 /* cf */)); | ||
#endif // !ROCKSDB_LITE | ||
|
||
std::int64_t exepcted_compaction_request = kNumFiles - 1; | ||
EXPECT_EQ(actual_compaction_request, exepcted_compaction_request); | ||
EXPECT_EQ(actual_compaction_request, | ||
options_.rate_limiter->GetTotalRequests(Env::IO_LOW)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to understand. Is covering compaction of multiple successive levels (e.g., L0->L1 then L1->L2) in the most flexible way worth the complexity? I'd just flush a few overlapping files with disable_auto_compactions=true then call CompactRange()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try CompactRange()
and simpler compaction case to see how it reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
file/writable_file_writer.h
Outdated
IOStatus Append(const Slice& data, uint32_t crc32c_checksum = 0); | ||
IOStatus Append(const Slice& data, uint32_t crc32c_checksum = 0, | ||
Env::IOPriority op_rate_limiter_priority = Env::IO_TOTAL, | ||
bool op_override_file_priority = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a provided op-level priority (i.e., non-Env::IO_TOTAL
) always override the file-level priority? It feels expected that the finer granularity setting overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a provided op-level priority (i.e., non-Env::IO_TOTAL) always override the file-level priority?
Yes, I believe this is how WritableFileWriter::DecideRateLimiterPriority
works except that it is written more verbosely (and I can change that).
But the issue op_override_file_priority
was trying to solve is when op-level priority == Env::IO_TOTAL
, where it does not override file-level-pri for compact/flush and "does" override for WAL (more for a future hypothetical case where "what-if" WAL's io priority is set to non-Env::IO_TOTAL and developers forget to test the compatibility of rate-limiting auto WAL flush and non-Env::IO_TOTAL WAL's io pri ).
I am also open for discussion whether it worths the extra parameter for the "hypothetical case".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future hypothetical case where "what-if" WAL's io priority is set to non-Env::IO_TOTAL and developers forget to test the compatibility of rate-limiting auto WAL flush and non-Env::IO_TOTAL WAL's io pri ).
non-Env::IO_TOTAL in the WAL and rate-limiting auto WAL flush means the auto WAL flush priority should override. And it does because auto-flush callers set this value to true. Whereas, other WAL users leave this as false - but they also don't pass a op-level priority, so it'd take the file-level priority regardless of whether this is false/true. So actually I still don't really understand the problem we're solving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, this doesn't behave the way I thought it would. Setting this flag forces op-level priority to override in every case. OK let me rethink it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my new understanding: we set this flag to true in auto-flushes, then somebody who comes in the future and gives a file-level priority WAL feature must notice there are op-level priorities too because their new feature won't work at all. Then they make sure it's compatible and set it to false. Well, the logic makes sense now. I wouldn't personally add code complexity to give hypothetical feature implementors a smooth path to not learning the surrounding code, but maybe I'm just mean. So will leave it up to you.
edit: Although considering SetIOPriority() and GetIIOPriority() are public APIs, it's not entirely our choice whether there's a file-level priority on a WAL. An application developer who calls SetIOPriority() on a WAL might not be interested in changing RocksDB and waiting for a release for it to work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline and concluded that we can clarify FSWritableFile::SetIOPriority and addressed the concern of mine behind adding the extra parameter "op_override_file_priority"
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would've been extra difficult if we had continued trying to keep #9606 separate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL!
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
Update:
|
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent PR - thanks!
::testing::Values(std::make_tuple(false, false, Env::IO_TOTAL), | ||
std::make_tuple(false, false, Env::IO_USER), | ||
std::make_tuple(false, false, Env::IO_HIGH), | ||
std::make_tuple(false, true, Env::IO_USER), | ||
std::make_tuple(true, false, Env::IO_USER)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice parameterization - thanks for testing the error cases.
Context:
WAL flush is currently not rate-limited by
Options::rate_limiter
. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e,Options::manual_wal_flush == false
), by addingWriteOptions::rate_limiter_options
.Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as
Options::manual_wal_flush == true + manual FlushWAL()
(rate-limiting multiple WAL flushes), for the benefits of:WriteOptions::rate_limiter_options
only acceptEnv::IO_USER
andEnv::IO_TOTAL
currently due to an implementation constraint.Env::IO_HIGH/MID/LOW
and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allowEnv::IO_USER
andEnv::IO_TOTAL
right now before improving that scheduling.A pre-requisite to this feature is to support operation-level rate limiting in
WritableFileWriter
, which is also included in this PR.Summary:
DBRateLimiterTest to DBRateLimiterOnReadTest
for adding a new test suiterate_limiter_priority
inWritableFileWriter
's private and public write functionsWriteOptions::rate_limiter_options
toWritableFileWriter
in the path of automatic WAL flush.Test:
DBTest, RateLimitingTest
is disabled and current db-level rate-limiting tests focus on read only (e.g,db_rate_limiter_test
,DBTest2, RateLimitedCompactionReads
).DBRateLimiterOnWriteWALTest, AutoWalFlush
strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100
-rate_limit_auto_wal_flush=0
python3 tools/db_crashtest.py blackbox --disable_wal=0 --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10
killed as normalBenchmarked on flush/compaction to ensure no performance regression:
907350 micros/op (improved by 0.106%)
| table 1 - compact with rate-limiting|
| table 2 - compact w/o rate-limiting|
| table 3 - flush with rate-limiting|
| table 4 - flush w/o rate-limiting|