Fix a major performance bug in 6.21 for cache entry stats #8369

pdillinger · 2021-06-08T00:10:56Z

Summary: In final polishing of #8297 (after most manual testing), I
broke my own caching layer by sanitizing an input parameter with
std::min(0, x) instead of std::max(0, x). I resisted unit testing the
timing part of the result caching because historically, these test
are either flaky or difficult to write, and this was not a correctness
issue. This bug is essentially unnoticeable with a small number
of column families but can explode background work with a
large number of column families.

This change fixes the logical error, removes some unnecessary related
optimization, and adds mock time/sleeps to the unit test to ensure we
can cache hit within the age limit.

Test Plan: added time testing logic to existing unit test

Summary: In final polishing of facebook#8297 (after most manual testing), I broke my own caching layer by sanitizing an input parameter with std::min(0, x) instead of std::max(0, x). I resisted unit testing the timing part of the result caching because historically, these test are either flaky or difficult to write. This bug is essentially unnoticeable with a small number of column families but can explode background work with a large number of column families. This change fixes the logical error, removes some unnecessary related optimization, and adds mock time/sleeps to the unit test to ensure we can cache hit within the age limit. Test Plan: added time testing logic to existing unit test

pdillinger · 2021-06-08T00:13:30Z

I'm assuming this doesn't need a HISTORY.md entry in master since it will be back-ported to 6.21, where the bug was introduced.

facebook-github-bot · 2021-06-08T00:13:50Z

@pdillinger has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ajkr

LGTM.

facebook-github-bot · 2021-06-08T12:03:43Z

@pdillinger merged this pull request in 2f93a3b.

Summary: In final polishing of #8297 (after most manual testing), I broke my own caching layer by sanitizing an input parameter with std::min(0, x) instead of std::max(0, x). I resisted unit testing the timing part of the result caching because historically, these test are either flaky or difficult to write, and this was not a correctness issue. This bug is essentially unnoticeable with a small number of column families but can explode background work with a large number of column families. This change fixes the logical error, removes some unnecessary related optimization, and adds mock time/sleeps to the unit test to ensure we can cache hit within the age limit. Pull Request resolved: #8369 Test Plan: added time testing logic to existing unit test Reviewed By: ajkr Differential Revision: D28950892 Pulled By: pdillinger fbshipit-source-id: e79cd4ff3eec68fd0119d994f1ed468c38026c3b

Summary: If the block Cache is full with strict_capacity_limit=false, then our CacheEntryStatsCollector could be immediately evicted on release, so iterating through column families with shared block cache could trigger re-scan for each CF. This change fixes that problem by pinning the CacheEntryStatsCollector from InternalStats so that it's not evicted. I had originally thought that this object could participate in LRU like everything else, but even though a re-load+re-scan only touches memory, it can be orders of magnitude more expensive than other cache misses. One service in Facebook has scans that take ~20s over 100GB block cache that is mostly 4KB entries. (The up-side of this bug and facebook#8369 is that we had a natural experiment on the affect on some service metrics even with block cache scans running continuously in the background--a kind of worst case scenario. Metrics like latency were not affected enough to trigger warnings.) Other smaller fixes: 20s is already a sizeable portion of 600s stats dump period, or 180s default max age to force re-scan, so added logic to ensure that (for each block cache) we don't spend more than 1% of our background thread time scanning it. Renamed field to cache_entry_stats_ to match code style. This change is intended for patching in 6.21 release. Test Plan: unit test expanded to cover new logic (detect regression)

Summary: If the block Cache is full with strict_capacity_limit=false, then our CacheEntryStatsCollector could be immediately evicted on release, so iterating through column families with shared block cache could trigger re-scan for each CF. This change fixes that problem by pinning the CacheEntryStatsCollector from InternalStats so that it's not evicted. I had originally thought that this object could participate in LRU like everything else, but even though a re-load+re-scan only touches memory, it can be orders of magnitude more expensive than other cache misses. One service in Facebook has scans that take ~20s over 100GB block cache that is mostly 4KB entries. (The up-side of this bug and #8369 is that we had a natural experiment on the effect on some service metrics even with block cache scans running continuously in the background--a kind of worst case scenario. Metrics like latency were not affected enough to trigger warnings.) Other smaller fixes: 20s is already a sizable portion of 600s stats dump period, or 180s default max age to force re-scan, so added logic to ensure that (for each block cache) we don't spend more than 0.2% of our background thread time scanning it. Nevertheless, "foreground" requests for cache entry stats (calls to `db->GetMapProperty(DB::Properties::kBlockCacheEntryStats)`) are permitted to consume more CPU. Renamed field to cache_entry_stats_ to match code style. This change is intended for patching in 6.21 release. Pull Request resolved: #8385 Test Plan: unit test expanded to cover new logic (detect regression), some manual testing with db_bench Reviewed By: ajkr Differential Revision: D29042759 Pulled By: pdillinger fbshipit-source-id: 236faa902397f50038c618f50fbc8cf3f277308c

pdillinger requested a review from ltamasi June 8, 2021 00:10

facebook-github-bot added the CLA Signed label Jun 8, 2021

ajkr approved these changes Jun 8, 2021

View reviewed changes

facebook-github-bot closed this in 2f93a3b Jun 8, 2021

facebook-github-bot added the Merged label Jun 8, 2021

pdillinger mentioned this pull request Jun 10, 2021

Pin CacheEntryStatsCollector to fix performance bug #8385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a major performance bug in 6.21 for cache entry stats #8369

Fix a major performance bug in 6.21 for cache entry stats #8369

pdillinger commented Jun 8, 2021

pdillinger commented Jun 8, 2021

facebook-github-bot commented Jun 8, 2021

ajkr left a comment

facebook-github-bot commented Jun 8, 2021

Fix a major performance bug in 6.21 for cache entry stats #8369

Fix a major performance bug in 6.21 for cache entry stats #8369

Conversation

pdillinger commented Jun 8, 2021

pdillinger commented Jun 8, 2021

facebook-github-bot commented Jun 8, 2021

ajkr left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jun 8, 2021