Fix bug in HyperClockCache ApplyToEntries; cleanup #10768

pdillinger · 2022-10-03T20:00:01Z

Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)

This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.

Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)

Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Test Plan: haven't been able to reproduce the failure, but should be in a better state

…ache_apply_entries_bug

facebook-github-bot · 2022-10-03T20:04:37Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

anand1976

LGTM. One minor comment. Also, HISTORY.md needs to be updated.

anand1976 · 2022-10-06T17:07:39Z

tools/db_crashtest.py

@@ -137,6 +137,7 @@
    "index_block_restart_interval": lambda: random.choice(range(1, 16)),
    "use_multiget": lambda: random.randint(0, 1),
    "periodic_compaction_seconds": lambda: random.choice([0, 0, 1, 2, 10, 100, 1000]),
+    "stats_dump_period_sec": lambda: random.choice([0, 30]),


Is there any reason to set it to 0 (disable)? Maybe use the default of 600 instead of 0.

I know we have at least one internal user setting to 0. Hypothetically, if stats dump does some kind of reset on some data that prevents some overflow when run periodically, it could hide a bug. That's hypothetical but I think sufficiently motivates 0. Perhaps I'll do [0, 10, 600].

…ly_entries_bug

facebook-github-bot · 2022-10-06T18:05:20Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-10-06T18:10:19Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Pull Request resolved: #10768 Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test) Reviewed By: anand1976 Differential Revision: D40034747 Pulled By: anand1976 fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68

facebook-github-bot added the CLA Signed label Oct 3, 2022

Merge branch 'main' of github.com:facebook/rocksdb into hyper_clock_c…

4adf2b1

…ache_apply_entries_bug

pdillinger requested review from anand1976 and hx235 October 3, 2022 20:00

anand1976 approved these changes Oct 6, 2022

View reviewed changes

pdillinger added 2 commits October 6, 2022 10:45

Merge remote-tracking branch 'origin/main' into hyper_clock_cache_app…

fabf90c

…ly_entries_bug

HISTORY and db_crashtest.py updates

4a55a78

facebook-github-bot closed this in b205c6d Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in HyperClockCache ApplyToEntries; cleanup #10768

Fix bug in HyperClockCache ApplyToEntries; cleanup #10768

pdillinger commented Oct 3, 2022

facebook-github-bot commented Oct 3, 2022

anand1976 left a comment

anand1976 Oct 6, 2022

pdillinger Oct 6, 2022

facebook-github-bot commented Oct 6, 2022

facebook-github-bot commented Oct 6, 2022

Fix bug in HyperClockCache ApplyToEntries; cleanup #10768

Fix bug in HyperClockCache ApplyToEntries; cleanup #10768

Conversation

pdillinger commented Oct 3, 2022

facebook-github-bot commented Oct 3, 2022

anand1976 left a comment

Choose a reason for hiding this comment

anand1976 Oct 6, 2022

Choose a reason for hiding this comment

pdillinger Oct 6, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Oct 6, 2022

facebook-github-bot commented Oct 6, 2022