-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor (Hyper)ClockCache code for upcoming changes #11572
Conversation
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Separate out some functionality that will be common to both static and dynamic HCC into BaseClockTable. Table::InsertState and GrowIfNeeded will be used by the dynamic HCC so don't make much sense right now. One small functional change in insertion has the process abort as soon as it finds a pre-existing match (already_matches & FindSlot). This wasn't strictly needed for performance but cleans up the calling of TryInsert as we now don't try to call it after we've already seen a match. Test Plan: existing tests. TODO: performance tests
8285618
to
65dcd46
Compare
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
Summary: Stacked on facebook#11572 * Minimize use of std::function and lambdas to minimize chances of compiler heap-allocating closures (unnecessary stress on allocator). It appears that converting FindSlot to a template enables inlining the lambda parameters, avoiding heap allocations. * Clean up some logic with FindSlot * Fix handling of rare case of probing all slots, with new unit test. (Previously Insert would not roll back displacements in that case, which would kill performance if it were to happen.) * Add an -early_exit option to cache_bench for gathering memory stats before deallocation. Test Plan: unit test added for probing all slots ## Seeing heap allocations Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache` before facebook#11572 vs. after this change. Before, we see this in the interesting bin statistics: ``` size nrequests ---- --------- 32 578460 64 24340 8192 578460 ``` And after: ``` size nrequests ---- --------- 32 (insignificant) 64 24370 8192 579130 ``` ## Performance test Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench` Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000` in before and after configurations, simultaneously: Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442 After: Complete in 32.773 s; Rough parallel ops/sec = 2441019
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Stacked on facebook#11572 * Minimize use of std::function and lambdas to minimize chances of compiler heap-allocating closures (unnecessary stress on allocator). It appears that converting FindSlot to a template enables inlining the lambda parameters, avoiding heap allocations. * Clean up some logic with FindSlot * Fix handling of rare case of probing all slots, with new unit test. (Previously Insert would not roll back displacements in that case, which would kill performance if it were to happen.) * Add an -early_exit option to cache_bench for gathering memory stats before deallocation. Test Plan: unit test added for probing all slots ## Seeing heap allocations Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache` before facebook#11572 vs. after this change. Before, we see this in the interesting bin statistics: ``` size nrequests ---- --------- 32 578460 64 24340 8192 578460 ``` And after: ``` size nrequests ---- --------- 32 (insignificant) 64 24370 8192 579130 ``` ## Performance test Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench` Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000` in before and after configurations, simultaneously: Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442 After: Complete in 32.773 s; Rough parallel ops/sec = 2441019
Summary: ... ahead of planned dynamic HCC variant. This changes simplifies some logic while still enabling future code sharing between implementation variants. Stacked on facebook#11572 Test Plan: existing tests
// required, and the operation should fail if not possible. | ||
// NOTE: Otherwise, occupancy_ is not managed in this function | ||
template <class Table> | ||
Status ChargeUsageMaybeEvictStrict(size_t total_charge, size_t capacity, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this and below ChargeUsageMaybeEvictNonStrict
were private before, should they continue to have more restricted access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll fix this in #11609
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the PR!
@pdillinger merged this pull request in c3c84b3. |
Summary: ... ahead of planned dynamic HCC variant. This changes simplifies some logic while still enabling future code sharing between implementation variants. Stacked on facebook#11572 Test Plan: existing tests
Summary: Stacked on facebook#11572 * Minimize use of std::function and lambdas to minimize chances of compiler heap-allocating closures (unnecessary stress on allocator). It appears that converting FindSlot to a template enables inlining the lambda parameters, avoiding heap allocations. * Clean up some logic with FindSlot * Fix handling of rare case of probing all slots, with new unit test. (Previously Insert would not roll back displacements in that case, which would kill performance if it were to happen.) * Add an -early_exit option to cache_bench for gathering memory stats before deallocation. Test Plan: unit test added for probing all slots ## Seeing heap allocations Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache` before facebook#11572 vs. after this change. Before, we see this in the interesting bin statistics: ``` size nrequests ---- --------- 32 578460 64 24340 8192 578460 ``` And after: ``` size nrequests ---- --------- 32 (insignificant) 64 24370 8192 579130 ``` ## Performance test Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench` Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000` in before and after configurations, simultaneously: Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442 After: Complete in 32.773 s; Rough parallel ops/sec = 2441019
Summary: Stacked on #11572 * Minimize use of std::function and lambdas to minimize chances of compiler heap-allocating closures (unnecessary stress on allocator). It appears that converting FindSlot to a template enables inlining the lambda parameters, avoiding heap allocations. * Clean up some logic with FindSlot (FIXMEs from #11572) * Fix handling of rare case of probing all slots, with new unit test. (Previously Insert would not roll back displacements in that case, which would kill performance if it were to happen.) * Add an -early_exit option to cache_bench for gathering memory stats before deallocation. Pull Request resolved: #11601 Test Plan: unit test added for probing all slots ## Seeing heap allocations Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache` before #11572 vs. after this change. Before, we see this in the interesting bin statistics: ``` size nrequests ---- --------- 32 578460 64 24340 8192 578460 ``` And after: ``` size nrequests ---- --------- 32 (insignificant) 64 24370 8192 579130 ``` ## Performance test Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench` Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000` in before and after configurations, simultaneously: ``` Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442 After: Complete in 32.773 s; Rough parallel ops/sec = 2441019 ``` Reviewed By: jowlyzhang Differential Revision: D47375092 Pulled By: pdillinger fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410
Summary: Separate out some functionality that will be common to both static and dynamic HCC into BaseClockTable. Table::InsertState and GrowIfNeeded will be used by the dynamic HCC so don't make much sense right now. Pull Request resolved: facebook#11572 Test Plan: existing tests. No functional changes intended. Performance test in subsequent PR facebook#11601 Reviewed By: jowlyzhang Differential Revision: D47110496 Pulled By: pdillinger fbshipit-source-id: 379bd433322a42ea28c0043b41ec24956d21e7aa
Summary: ... ahead of planned dynamic HCC variant. This changes simplifies some logic while still enabling future code sharing between implementation variants. Stacked on facebook#11572 Test Plan: existing tests
Summary: ... ahead of planned dynamic HCC variant. This changes simplifies some logic while still enabling future code sharing between implementation variants. Detail: For complicated reasons, using a std::function parameter to `ConstApplyToEntriesRange` with a lambda argument does not play nice with templated HandleImpl. An explicit conversion to std::function would be needed for it to compile. Templating the function type is the easy work-around. Also made some functions from #11572 private as recommended Pull Request resolved: #11609 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D47407415 Pulled By: pdillinger fbshipit-source-id: 0f65954db16335999b78fb7d2563ec627624cef0
Summary: Separate out some functionality that will be common to both static and dynamic HCC into BaseClockTable. Table::InsertState and GrowIfNeeded will be used by the dynamic HCC so don't make much sense right now.
Test Plan: existing tests. No functional changes intended.
Performance test in subsequent PR #11601