Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate-limit automatic WAL flush after each user write #9607

Closed
wants to merge 5 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Feb 19, 2022

Context:
WAL flush is currently not rate-limited by Options::rate_limiter. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, Options::manual_wal_flush == false), by adding WriteOptions::rate_limiter_options.

Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as Options::manual_wal_flush == true + manual FlushWAL() (rate-limiting multiple WAL flushes), for the benefits of:

  • being consistent with ReadOptions::rate_limiter_priority
  • being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)

WriteOptions::rate_limiter_options only accept Env::IO_USER and Env::IO_TOTAL currently due to an implementation constraint.

  • The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as Env::IO_HIGH/MID/LOW and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow Env::IO_USER and Env::IO_TOTAL right now before improving that scheduling.

A pre-requisite to this feature is to support operation-level rate limiting in WritableFileWriter, which is also included in this PR.

Summary:

  • Renamed test suite DBRateLimiterTest to DBRateLimiterOnReadTest for adding a new test suite
  • Accept rate_limiter_priority in WritableFileWriter's private and public write functions
  • Passed WriteOptions::rate_limiter_options to WritableFileWriter in the path of automatic WAL flush.

Test:

  • Added new unit test to verify existing flush/compaction rate-limiting does not break, since DBTest, RateLimitingTest is disabled and current db-level rate-limiting tests focus on read only (e.g, db_rate_limiter_test, DBTest2, RateLimitedCompactionReads).
  • Added new unit test DBRateLimiterOnWriteWALTest, AutoWalFlush
  • strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100
    • verified that WAL flush(i.e, system-call write) were chunked into 15 bytes and each write was roughly 1 second apart
    • verified the chunking disappeared when -rate_limit_auto_wal_flush=0
  • crash test: python3 tools/db_crashtest.py blackbox --disable_wal=0 --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10 killed as normal

Benchmarked on flush/compaction to ensure no performance regression:

  • compaction with rate-limiting (see table 1, avg over 1280-run): pre-change: 915635 micros/op; post-change:
    907350 micros/op (improved by 0.106%)
#!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
    done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
done
  • compaction w/o rate-limiting (see table 2, avg over 640-run): pre-change: 822197 micros/op; post-change: 823148 micros/op (regressed by 0.12%)
Same as above script, except that -rate_limiter_bytes_per_sec=0
  • flush with rate-limiting (see table 3, avg over 320-run, run on the patch to augment current db_bench ): pre-change: 745752 micros/op; post-change: 745331 micros/op (regressed by 0.06 %)
 #!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10

rm -f flush_bmk_output.txt flush_bmk_output_2.txt

for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
    NUM_RUN=$(($N*(2**($i-1))))
    for j in $(eval echo "{$START..$NUM_RUN}")
    do
       ./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
    done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
done

  • flush w/o rate-limiting (see table 4, avg over 320-run, run on the patch to augment current db_bench): pre-change: 487512 micros/op, post-change: 485856 micors/ops (improved by 0.34%)
Same as above script, except that -rate_limiter_bytes_per_sec=0

| table 1 - compact with rate-limiting|

#-run (pre-change) avg micros/op std micros/op (post-change) avg micros/op std micros/op change in avg micros/op (%)
10 896978 16046.9 901242 15670.9 0.475373978
20 893718 15813 886505 17544.7 -0.8070778478
40 900426 23882.2 894958 15104.5 -0.6072681153
80 906635 21761.5 903332 23948.3 -0.3643141948
160 898632 21098.9 907583 21145 0.9960695813
3.20E+02 905252 22785.5 908106 25325.5 0.3152713278
6.40E+02 905213 23598.6 906741 21370.5 0.1688000504
1.28E+03 908316 23533.1 907350 24626.8 -0.1063506533
average over #-run 901896.25 21064.9625 901977.125 20592.025 0.008967217682

| table 2 - compact w/o rate-limiting|

#-run (pre-change) avg micros/op std micros/op (post-change) avg micros/op std micros/op change in avg micros/op (%)
10 811211 26996.7 807586 28456.4 -0.4468627768
20 815465 14803.7 814608 28719.7 -0.105093413
40 809203 26187.1 797835 25492.1 -1.404839082
80 822088 28765.3 822192 32840.4 0.01265071379
160 821719 36344.7 821664 29544.9 -0.006693285661
3.20E+02 820921 27756.4 821403 28347.7 0.05871454135
6.40E+02 822197 28960.6 823148 30055.1 0.1156657103
average over #-run 8.18E+05 2.71E+04 8.15E+05 2.91E+04  -0.25

| table 3 - flush with rate-limiting|

#-run (pre-change) avg micros/op std micros/op (post-change) avg micros/op std micros/op change in avg micros/op (%)
10 741721 11770.8 740345 5949.76 -0.1855144994
20 735169 3561.83 743199 9755.77 1.09226586
40 743368 8891.03 742102 8683.22 -0.1703059588
80 742129 8148.51 743417 9631.58 0.1735547324
160 749045 9757.21 746256 9191.86 -0.3723407806
3.20E+02 745752 9819.65 745331 9840.62 -0.0564530836
6.40E+02 749006 11080.5 748173 10578.7 -0.1112140624
average over #-run 743741.4286 9004.218571 744117.5714 9090.215714 0.05057441238

| table 4 - flush w/o rate-limiting|

#-run (pre-change) avg micros/op std micros/op (post-change) avg micros/op std micros/op change in avg micros/op (%)
10 477283 24719.6 473864 12379 -0.7163464863
20 486743 20175.2 502296 23931.3 3.195320734
40 482846 15309.2 489820 22259.5 1.444352858
80 491490 21883.1 490071 23085.7 -0.2887139108
160 493347 28074.3 483609 21211.7 -1.973864238
3.20E+02 487512 21401.5 485856 22195.2 -0.3396839462
6.40E+02 490307 25418.6 485435 22405.2 -0.9936631539
average over #-run 4.87E+05 2.24E+04 4.87E+05 2.11E+04 0.00E+00

@hx235 hx235 added the WIP Work in progress label Feb 19, 2022
@hx235 hx235 marked this pull request as draft February 19, 2022 03:30
@hx235 hx235 force-pushed the rl_wal branch 7 times, most recently from d54271b to 426792d Compare February 23, 2022 07:16
@hx235
Copy link
Contributor Author

hx235 commented Feb 23, 2022

TODO:

  • Add HISTORY.md after rebase
  • db bench/stress test/trace

@hx235 hx235 force-pushed the rl_wal branch 2 times, most recently from 4d65392 to 9f2c805 Compare February 23, 2022 09:19
@hx235
Copy link
Contributor Author

hx235 commented Feb 24, 2022

Update:

  • More testings are done, ready for review @ajkr
  • Will add to HISTORY.md after rebase

@hx235 hx235 marked this pull request as ready for review February 24, 2022 06:47
@hx235 hx235 removed the WIP Work in progress label Feb 24, 2022
@hx235 hx235 requested a review from ajkr February 24, 2022 06:47
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ajkr
Copy link
Contributor

ajkr commented Mar 2, 2022

Planning to review this one only as it's one functionality and easier to understand the whole picture by looking at one PR. Do you want to merge the descriptions from the other PRs here if needed?

@hx235
Copy link
Contributor Author

hx235 commented Mar 2, 2022

Planning to review this one only as it's one functionality and easier to understand the whole picture by looking at one PR. Do you want to merge the descriptions from the other PRs here if needed?

Yep - I will do that tomorrow morning.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry there is some issue I did not think of before, and am still thinking about how to solve.

Comment on lines 325 to 326
EXPECT_EQ(flush_rate_limiter_request,
options_.rate_limiter->GetTotalRequests(Env::IO_HIGH));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly confusing because it assumes compaction has not happened yet when flush_rate_limiter_request was initialized to GetTotalRequests(Env::IO_HIGH) (if compaction had already happened, then compaction requests could have been charged at high-pri and this assertion wouldn't notice). But it is non-deterministic whether compaction has happened at that point.

Copy link
Contributor Author

@hx235 hx235 Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if compaction had already happened, then compaction requests could have been charged at high-pri and this assertion wouldn't notice)

Would the above

// Init() is setup in a way such that we flush per file 
ASSERT_EQ(flush_rate_limiter_request, kNumFiles);

be sufficient in noticing compaction is not charged at high-pri even in the case where compaction happened before std::int64_t flush_rate_limiter_request = options_.rate_limiter->GetTotalRequests(Env::IO_HIGH);?

I was trying to use the following two

// Init() is setup in a way such that we flush per file
  ASSERT_EQ(flush_rate_limiter_request, kNumFiles);
...
EXPECT_EQ(flush_rate_limiter_request,
              options_.rate_limiter->GetTotalRequests(Env::IO_HIGH));

to make sure compaction are not charged at Env::IO_HIGH otherwise either of these two assertion will catch it.

But you did remind me of I did not check whether compaction is charged at other wrong pri like "MID" or "USER". Actually I have this at the end

    EXPECT_EQ(compaction_rate_limiter_request + flush_rate_limiter_request,
              options_.rate_limiter->GetTotalRequests(Env::IO_TOTAL))

But given that the current way is confusing, let me think of a way to clarify/write it better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 330 to 331
EXPECT_EQ(compaction_rate_limiter_request,
kNumFiles - options_.level0_file_num_compaction_trigger);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this seems riskier than the above. It is non-deterministic whether the N+1th flush happens before the Nth compaction is picked. In this case the symptom would be a test flake rather than an unlikely race condition that causes the assertion to miss a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is non-deterministic whether the N+1th flush happens before the Nth compaction is picked

Good point - I overlooked this. Let me think more about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

10 /* fairness */, RateLimiter::Mode::kWritesOnly));
options.table_factory.reset(
NewBlockBasedTableFactory(BlockBasedTableOptions()));
options.disable_auto_compactions = GetParam();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels kind of redundant with having distinct "Flush" and "Compact" tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will combine them. Thanks!

io_s = log.writer->file()->Sync(immutable_db_options_.use_fsync);
io_s =
log.writer->file()->Sync(immutable_db_options_.use_fsync,
write_group.leader->rate_limiter_priority);
Copy link
Contributor

@ajkr ajkr Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data that may be written out here is unrelated to write_group.leader so shouldn't use its rate_limiter_priority. I see the problem you ran into though. For per-request rate limiting we only know the precise rate_limiter_priority until WritableFileWriter::Append(). However the existing rate limiting code was in WritableFileWriter::Write*().

This Sync() would not call the WritableFileWriter::Write*() functions in the common case. I think it requires the user to enable manual WAL flush PLUS not call FlushWAL() for a while in order for WritableFileWriter to have a nonempty buffer that will be written out during this Sync().

The manual WAL flush case has a more likely issue, though, which is we aren't passing any rate_limiter_priority during FlushWAL().

Will think about it more but here are some possible options -

(1) Track number of bytes Append()d at IO_USER priority and charge that amount to rate limiter during Flush()
(2) Make WriteOptions::rate_limiter_priority explicitly not supported together with manual WAL flush
(3) Give up on per-request tracking and make it a DBOptions

Copy link
Contributor Author

@hx235 hx235 Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Still reading your arguments)

Copy link
Contributor Author

@hx235 hx235 Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data that may be written out here is unrelated to write_group.leader so shouldn't use its rate_limiter_priority

This is the case I haven't considered thoroughly - need to study more into this code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline that for now we only support custom rate-limiting priority for WAL writes per-operation. Fixed.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 changed the title Rate-limit WAL writes Rate-limit WAL writes at granularity of per user write Mar 3, 2022
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Contributor Author

hx235 commented Mar 4, 2022

Compared with last update, I had some further improvement:

  • Clarified in API the case where Options::manual_wal_flush == true or priority is set to other than Env::IO_USER and Env::IO_TOTAL; Sanitized WriteOptions::rate_limiter_priority in WriteToWAL.
  • In WritableFileWriter, added a new bool parameter op_override_file_priority and renamed the newly added parameter rate_limiter_priority to be op_rate_limiter_priority and for the following reasons:
    • In my previous impl of WritableFileWriter::DecideRateLimiterPriority, in order to turn off rate-limiting for WAL flush using operation-level rate limiter priority, it assumes the WAL file's io_priority will always be Env::IO_TOTAL. Although this is currently the case, I find it hard to assert it always true for the future, considering the presence of SetIOPriority.
    • Therefore I introduced a bool op_override_file_priority to unconditionally override file's io_priority for WAL case.
    • To emphasize the concept of op-level vs file-level priority, I renamed rate_limiter_priority to be op_rate_limiter_priority in WritableFileWriter
  • In this improvement, WritableFileWriter::Close() and WritableFileWriter::Sync() do not accept rate_limiter_priority (now renamed as op_rate_limiter_priority) since we don't consider op-level rate-limiting priority in these two functions as discussed offline
  • In this improvement, if WritableFileWriter::DecideRateLimiterPriority returns Env::IO_TOTAL, rate limiter is bypassed instead of being charged with the Env::IO_TOTAL. This has 3 consequences:
    • Now we need to call WritableFileWriter::DecideRateLimiterPriority regardless of whether rate_limiter_ != nullptr due to the conditional statement if (rate_limiter_ != nullptr && rate_limiter_priority_used != Env::IO_TOTAL). Therefore a no-rate-limiter path benchmarking is provided (no regression).
    • WritableFileWriter no longer calls RateLimiter::RequestToken() with Env::IO_TOTAL as before, which I think it's fine based on internal search.
    • See PR comment
  • Fixed a typo "priorty"
  • Updated benchmark

@@ -814,12 +828,13 @@ IOStatus WritableFileWriter::WriteDirectWithChecksum(
// TODO: need to be improved since it sort of defeats the purpose of the rate
// limiter
size_t data_size = left;
if (rate_limiter_ != nullptr) {
Env::IOPriority rate_limiter_priority_used =
Copy link
Contributor Author

@hx235 hx235 Mar 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One consequence of bypassing rate-limiter's charging instead of charging it with the Env::IO_TOTAL is that we need to decide rate_limiter_priority_used before size = rate_limiter_->RequestToken.

It's not a big deal as long as the same writable_file_'s io_priority_ is not shared between threads. Otherwise
we might have writable_file_'s io_priority_ being changed in the loop while (data_size > 0) {} below.

I don't think this is the case and I am double checking with Anand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @anand1976 , another quick question related to concurrency of FileSystem.

Is it guaranteed that writable_file_'s io_priority_ is not shared between threads? I don't think so by briefly inspecting the code but figured I should double check with you.

If you need more context for my question, see the PR comment right above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We never have multiple threads writing to the same WritableFile.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235
Copy link
Contributor Author

hx235 commented Mar 5, 2022

@ajkr Not sure how far you were at your previous review so I have some summary of the incremental changes I made since the last update: #9607 (comment), #9607 (comment)

Ready for review.

@hx235 hx235 marked this pull request as ready for review March 5, 2022 07:50
@hx235 hx235 removed the WIP Work in progress label Mar 5, 2022
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress! A few smaller comments this time.

HISTORY.md Outdated
@@ -10,6 +10,7 @@

### Public API changes
* Remove BlockBasedTableOptions.hash_index_allow_collision which already takes no effect.
* Added `WriteOptions::rate_limiter_priority`. When set to something other than `Env::IO_TOTAL`, the internal rate limiter (`DBOptions::rate_limiter`) will be charged at the specified priority for automatic WAL flush (`Options::manual_wal_flush` == false) associated with the API to which the `WriteOptions` was provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be pedantic. It looks correct but automatic WAL flush could be elaborated slightly and decoupled from the option's purpose. Such as:

Added `WriteOptions::rate_limiter_priority`. When set to something other than `Env::IO_TOTAL`, the internal rate limiter (`DBOptions::rate_limiter`) will be charged at the specified priority for writes associated with the API to which the `WriteOptions` was provided. Currently the support covers automatic WAL flushes, which happen during live updates (`Put()`, `Write()`, `Delete()`, etc.) when `WriteOptions::disableWAL == false` and `DBOptions::manual_wal_flush == false`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you! I can see your point here - will fix this and the API comment too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 1154 to 1157
// See `WriteOptions::rate_limiter_priority` for this constraint
if (manual_wal_flush_ || rate_limiter_priority != Env::IO_USER) {
rate_limiter_priority = Env::IO_TOTAL;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the conditions be checked at the top of WriteImpl() and return Status::InvalidArgument when violated? I know we discussed this, but at that time I was under the impression manual_wal_flush is dynamically changeable, in which case it'd only be accessible with lock held like PreprocessWrite() where returning failure would cause the DB to enter read-only mode. But it turns out it's not dynamically changeable and manual_wal_flush_ can be accessed anywhere, like on entry to WriteImpl().

Also for the rate_limiter_priority != Env::IO_USER case I don't see a reason to overwrite invalid values here, as opposed to validating on entry to WriteImpl().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I forgot to leave you a note on this to ask for clarification on what you meant by "entering read-only mode". Will fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 314 to 344
TEST_F(DBRateLimiterOnWriteTest, Compact) {
Init();

// files_per_level_pre_compaction: 1,1,...,1 (in total kNumFiles levels)
#ifndef ROCKSDB_LITE
std::string files_per_level_pre_compaction =
CreateSimpleFilesPerLevelString("1", "1");
ASSERT_EQ(files_per_level_pre_compaction, FilesPerLevel(0 /* cf */));
#endif // !ROCKSDB_LITE

std::int64_t prev_total_request =
options_.rate_limiter->GetTotalRequests(Env::IO_TOTAL);
ASSERT_EQ(0, options_.rate_limiter->GetTotalRequests(Env::IO_LOW));

Compact(kStartKey, kEndKey);

std::int64_t actual_compaction_request =
options_.rate_limiter->GetTotalRequests(Env::IO_TOTAL) -
prev_total_request;

// files_per_level_post_compaction: 0,0,...,1 (in total kNumFiles levels)
#ifndef ROCKSDB_LITE
std::string files_per_level_post_compaction =
CreateSimpleFilesPerLevelString("0", "1");
ASSERT_EQ(files_per_level_post_compaction, FilesPerLevel(0 /* cf */));
#endif // !ROCKSDB_LITE

std::int64_t exepcted_compaction_request = kNumFiles - 1;
EXPECT_EQ(actual_compaction_request, exepcted_compaction_request);
EXPECT_EQ(actual_compaction_request,
options_.rate_limiter->GetTotalRequests(Env::IO_LOW));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to understand. Is covering compaction of multiple successive levels (e.g., L0->L1 then L1->L2) in the most flexible way worth the complexity? I'd just flush a few overlapping files with disable_auto_compactions=true then call CompactRange()

Copy link
Contributor Author

@hx235 hx235 Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try CompactRange() and simpler compaction case to see how it reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

IOStatus Append(const Slice& data, uint32_t crc32c_checksum = 0);
IOStatus Append(const Slice& data, uint32_t crc32c_checksum = 0,
Env::IOPriority op_rate_limiter_priority = Env::IO_TOTAL,
bool op_override_file_priority = false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a provided op-level priority (i.e., non-Env::IO_TOTAL) always override the file-level priority? It feels expected that the finer granularity setting overrides.

Copy link
Contributor Author

@hx235 hx235 Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a provided op-level priority (i.e., non-Env::IO_TOTAL) always override the file-level priority?

Yes, I believe this is how WritableFileWriter::DecideRateLimiterPriority works except that it is written more verbosely (and I can change that).

But the issue op_override_file_priority was trying to solve is when op-level priority == Env::IO_TOTAL, where it does not override file-level-pri for compact/flush and "does" override for WAL (more for a future hypothetical case where "what-if" WAL's io priority is set to non-Env::IO_TOTAL and developers forget to test the compatibility of rate-limiting auto WAL flush and non-Env::IO_TOTAL WAL's io pri ).

I am also open for discussion whether it worths the extra parameter for the "hypothetical case".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future hypothetical case where "what-if" WAL's io priority is set to non-Env::IO_TOTAL and developers forget to test the compatibility of rate-limiting auto WAL flush and non-Env::IO_TOTAL WAL's io pri ).

non-Env::IO_TOTAL in the WAL and rate-limiting auto WAL flush means the auto WAL flush priority should override. And it does because auto-flush callers set this value to true. Whereas, other WAL users leave this as false - but they also don't pass a op-level priority, so it'd take the file-level priority regardless of whether this is false/true. So actually I still don't really understand the problem we're solving.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this doesn't behave the way I thought it would. Setting this flag forces op-level priority to override in every case. OK let me rethink it...

Copy link
Contributor

@ajkr ajkr Mar 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my new understanding: we set this flag to true in auto-flushes, then somebody who comes in the future and gives a file-level priority WAL feature must notice there are op-level priorities too because their new feature won't work at all. Then they make sure it's compatible and set it to false. Well, the logic makes sense now. I wouldn't personally add code complexity to give hypothetical feature implementors a smooth path to not learning the surrounding code, but maybe I'm just mean. So will leave it up to you.

edit: Although considering SetIOPriority() and GetIIOPriority() are public APIs, it's not entirely our choice whether there's a file-level priority on a WAL. An application developer who calls SetIOPriority() on a WAL might not be interested in changing RocksDB and waiting for a release for it to work properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and concluded that we can clarify FSWritableFile::SetIOPriority and addressed the concern of mine behind adding the extra parameter "op_override_file_priority"

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would've been extra difficult if we had continued trying to keep #9606 separate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Contributor Author

hx235 commented Mar 8, 2022

Update:

  • Addressed comment
  • Clarified the test a bit more

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 requested a review from ajkr March 8, 2022 07:24
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent PR - thanks!

Comment on lines +402 to +406
::testing::Values(std::make_tuple(false, false, Env::IO_TOTAL),
std::make_tuple(false, false, Env::IO_USER),
std::make_tuple(false, false, Env::IO_HIGH),
std::make_tuple(false, true, Env::IO_USER),
std::make_tuple(true, false, Env::IO_USER)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice parameterization - thanks for testing the error cases.

facebook-github-bot pushed a commit that referenced this pull request Mar 25, 2022
…WAL flush after each user write (#9745)

Summary:
As title for #9607

Pull Request resolved: #9745

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D35096901

Pulled By: hx235

fbshipit-source-id: 6bd3671baecfdc04579b0a81a957bfaa7bed81e1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants