-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add rate limiter priority to ReadOptions #9424
Conversation
9282547
to
c4bfc3d
Compare
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@ajkr has updated the pull request. You must reimport the pull request before landing. |
4c124fc
to
bf60a5c
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
bf60a5c
to
7f592b3
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early review: I looked at the code change/testing for compaction and VerifyChecksum and believe it's sufficient for delivery 👍 with a few nits.
Will look at other read-related changes in the second round. If there is a way to split them into two PRs, I am good with approving the compaction/Verifychecksum related one.
One thing though is we might want to work on a better API doc for both DBOptions::rate_limiter and ReadOptions::priority since (1) we expand the scope of things being rate limited by DBOptions::rate_limiter (2) we open a way to bypass that rate limiting using ReadOptions::priority (3) some lower-level operation will ignore ReadOptions::priority no matter what.
I'm thinking of the following comment:
(a) For DBOptions::rate_limiter
API:
If not null, bytes_per_sync is set to 1MB by default and it will be used to rate-limit the following (the list is subjected to future expansion):
Read:
- Compaction read (at IOPriority::LOW)
- DB::VerifyChecksum(at specified priority or be bypassed, see ReadOptions::priority for more)
Write:
- Compaction write (at IOPriority::LOW)
- Flush write (at IOPriority::HIGH)
Default: nullptr (rate limiting disabled)
(b) ReadOptions::rate_limiter: see PR comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
Early review: I looked at the code change/testing for compaction and VerifyChecksum and believe it's sufficient for delivery 👍 with a few nits.
Will look at other read-related changes in the second round. If there is a way to split them into two PRs, I am good with approving the compaction/Verifychecksum related one.
Sorry the main idea of this PR requires them to be quite entangled (see below).
One thing though is we might want to work on a better API doc for both DBOptions::rate_limiter and ReadOptions::priority since (1) we expand the scope of things being rate limited by DBOptions::rate_limiter (2) we open a way to bypass that rate limiting using ReadOptions::priority (3) some lower-level operation will ignore ReadOptions::priority no matter what.
I'm thinking of the following comment: (a) For
DBOptions::rate_limiter
API:If not null, bytes_per_sync is set to 1MB by default and it will be used to rate-limit the following (the list is subjected to future expansion): Read: - Compaction read (at IOPriority::LOW) - DB::VerifyChecksum(at specified priority or be bypassed, see ReadOptions::priority for more) Write: - Compaction write (at IOPriority::LOW) - Flush write (at IOPriority::HIGH) Default: nullptr (rate limiting disabled)
(b) ReadOptions::rate_limiter: see PR comment
The goal of this PR is to generally support ReadOptions::priority
besides in unimportant cases and a handful of uncommon cases.
It is easier to describe where support is missing than where support is available. Describing either one should be acceptable, IMO, since it's the complement of the other. I am happy to reexamine the list of missing support, both for making sure it is at the proper level of abstraction (for example, the reference to MultiRead()
is too low-level) and making sure it is complete (for example, I have not checked for APIs accepting ReadOptions
that do their reads outside RandomAccessFileReader
).
It is difficult to list available support because this PR made changes at a low level to support most user APIs (without needing to look at what exactly they are) at once. For example, a lot of APIs (Get()
, iterator, compaction, etc.) already passed their ReadOptions
all the way down to BlockFetcher
. I only made a small code change to BlockFetcher
to make rate limiting available to all those APIs. That is also why this approach requires we do it all at once -- if we wanted to only support compaction/VerifyChecksum, we'd have to go into BlockFetcher
and explicitly remove support for all other cases, which seems like extra code for an unnecessary limitation.
Meanwhile, it was actually fairly easy to identify the user API corresponding to missing support. I guess it's because that case is (luckily) much less common, where the user can provide ReadOptions
that are inaccessible to RandomAccessFileReader
.
Aha - I could see that now. Read operations that pass
This will be great. I'm good with listing unsupported user read API (such as DB::MultiGet(), old BlobDB reads, general reads under plain/cuckoo table configuration) in I will also update the comment as getting SequentialFileReader and write-side of things done. |
7f592b3
to
6439991
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
3 similar comments
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be great. I'm good with listing unsupported user read API (such as DB::MultiGet(), old BlobDB reads, general reads under plain/cuckoo table configuration) in DBOptions::rate_limiter API doc so that users know what NOT to expect for rate limiting support. It will also be good to briefly mention there something like "see ReadOptions::priority for more info on specifying rate-limiting priority of each read or bypassing the rate limiting" so that users know how correctly add such rate-limiting support.
I put the list in one place (ReadOptions::priority
) and referred to it from the DBOptions::rate_limiter
. The list of scenarios that ignore/do not support ReadOptions::priority
is shorter than the list of scenarios where read rate limiting is unavailable. Scenarios like reading MANIFEST/WALs during DB::Open()
do not support read rate limiting, but I do not think we need to explicitly list cases where the API offers no way to enable rate limiting (there is no ReadOptions
passed to DB::Open()
, for example).
Also done:
- Went through all clients of
SequentialFileReader
and did not find any where the user-facing API offers a way to enable rate limiting. - Updated to specify which
MultiGet()
APIs are supported at the proper level of abstraction. - Removed mention of old
BlobDB
since it was never actually exposed in any public header. - Elaborated on "cuckoo/plain table" so people can identify whether they are relevant to their use case
For internal read/write of compaction and write of flush, contrary to user read API, their priorities are fixed and will not be bypassed using ReadOptions::priority, which is basically what the existing DBOptions::rate_limiter API describes.
Right. Keep in mind there is no way for a user to supply the ReadOptions
for flush or compaction, so we don't really have a choice about bypassing ReadOptions::priority
or not.
fa05d10
to
f22a4d9
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@ajkr has updated the pull request. You must reimport the pull request before landing. |
Sounds good - thanks for clarification. Sounds like now we make ReadOptions a necessary condition to having rate limiting support for user read. Then we either take ReadOptions::priority into account (i.e, user specifying a priority or RocksDB internal sets IO_TOTAL as not-supported) or just ignore it. Good with me! 👍 |
Not a rush to finishing the PR but are we still going to change the name for ReadOptions::priority? Since it's a public API, it will be good to finalize the naming as early as possible. Other changes in the update look good! |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
61c7b6d
to
10574a5
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@@ -349,6 +349,8 @@ bool StressTest::VerifySecondaries() { | |||
fprintf(stderr, "Secondary failed to catch up with primary\n"); | |||
return false; | |||
} | |||
// This `ReadOptions` is for validation purposes. Ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to add it to db_crashtest.py for automating the test?
Others LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might worth adding an option FLAG_read_rate_limiter_priority to stress test considering it touches many aspects of our read. Hopefully it wouldn’t be hard as a quick search in db_stress_test_base reveals 11 ReadOptions.
Done. Although, keep in mind the crash test wrapper does not use rate limiter yet, so the support is a bit superficial for now. We will enable rate limiting eventually, but not in the near future (IngestExternalFile()
is my next step to enable). I at least ran it manually and added the command to the test plan to show the new feature can work.
Sorry I wrote my last comment without refreshing. Basically yes, but it's non-trivial to enable it without slowing things down too much, so I hope to defer it. edit: Another reason I'm seeing it's nontrivial as it runs is it's printing a lot of these lines:
|
Sure! I was actually just referring to something like |
Got you! Might be good to just leave a comment |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
Not sure, I think the person to enable rate limiting in CI will need to do a tour of all the possibilities. There's a bunch of things I would do if we did it right now, like renaming rate_limit_bg_reads, making it support kAllIo (currently we only support reads or writes, never both), picking a rate_limiter_bytes_per_sec that doesn't slow things down, dynamically changing the rate limit, enabling IO_USER, etc. etc. But I guess whoever eventually does it will have to explore a bit to figure out the right way. I kind of don't think they'll miss the flags with "rate_limit" in the name :). |
@ajkr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Sounds good! I would likely check back on this comment if I happen to be that person to enable rate limiting in CI ..... |
Summary: **Context/Summary:** #9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not. However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments. Pull Request resolved: #10159 Test Plan: - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet` - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds. - Stress test: - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work Reviewed By: ajkr, anand1976 Differential Revision: D37135172 Pulled By: hx235 fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
…+ misc (#11444) Summary: **Context/Summary:** - Similar to #11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`. - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation) - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main. - Misc - More refactoring: with #11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](#9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority` - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now Pull Request resolved: #11444 Test Plan: - CI fake db crash/stress test - Microbenchmarking **Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench` - google benchmark version: google/benchmark@604f6fd - db_basic_bench_base: upstream - db_basic_bench_pr: db_basic_bench_base + this PR - asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](main...hx235:rocksdb:micro_bench_async_read) - asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR **Test** Get ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000 ``` Result ``` Coming soon ``` AsyncRead ``` TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out ``` Result ``` Base: 1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008, PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before): 1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053, ``` Reviewed By: ajkr Differential Revision: D45918925 Pulled By: hx235 fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
Users can set the priority for file reads associated with their operation by setting
ReadOptions::rate_limiter_priority
to something other thanEnv::IO_TOTAL
. Rate limitingVerifyChecksum()
andVerifyFileChecksums()
is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.RandomAccessFileReader::Read()
already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed inEnv::IOPriority
. Now the compaction read rate limiting is supported by settingrate_limiter_priority = Env::IO_LOW
on itsReadOptions
.There is no default value for the new
Env::IOPriority
parameter toRandomAccessFileReader::Read()
. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find aReadOptions
to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).The API doc only lists the missing cases where a file read associated with a provided
ReadOptions
cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provideReadOptions
orEnv::IOPriority
, so I didn't count that as missing.Test Plan:
./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true
strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true
python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10