New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
content-cache: general cleanup, small bug fixes, and test improvement #3645
Conversation
I don't see a lot of opportunity to increase the diff coverage here since most of it is unlikely error paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall, LGTM, although one comment below
t/t0012-content-sqlite.t
Outdated
@@ -14,14 +14,14 @@ RPC=${FLUX_BUILD_DIR}/t/request/rpc | |||
|
|||
HASHFUN=`flux getattr content.hash` | |||
|
|||
test_expect_success 'load heartbeat module with fast rate to drive purge' ' | |||
flux module load heartbeat period=1s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skimming the heartbeat module, could we make the period < 1s? 1s seems long in the unit tests (especially w/ the sleep 1 below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sync callback rate is forced to be within sync_min=1
and sync_max=10
seconds, so it doesn't help to make the heartbeat < 1s. However, that min value was set to avoid triggering the heavyweight cache purge too frequently, and now with the LRU, the purge doesn't need to scan for eligible items and should exit immediately if there is nothing to do. So maybe we can just eliminate the min value and crank the heartbeat down as you say. I'll go ahead and do that.
OK, made that change. DId you want to have another look @chu11? I see you already approved and I'll set MWP if you're satisfied. |
@garlick LGTM |
Thanks! |
Problem: the content.dropcache rpc handler walks the entire cache, but it only needs to walk the LRU now that all LRU entries are neither dirty nor invalid. Simplify the content.dropcache RPC handler.
Problem: when a cache entry is used and moved to the front of the LRU, the lastused timestamp is also updated; however, if the entry is already at the front, it is not. Update entry->lastused when entry is already at the front as well.
Problem: if flux_future_aux_set() fails in cache_store(), a future is leaked. Simplify that function so there is only one error path and thereby stop the leak.
Problem: if flux_future_aux_set() fails in cache_load(), a future is leaked. Simplify that function so there is only one error path and thereby stop the leak.
Problem: comment misuses semicolon. Fix semicolon usage.
Problem: Some request handlers pass the request message through flux_request_decode() with NULL arguments, which accomplishes nothing since the message dispatcher will have already verified the message type and topic string. Drop the extra checks.
Problem: cache_load() contains handling for ENOSYS and ENOENT errors but those errors cannot occur until the continuation. Drop dead error handling code.
Problem: t0012-content-sqlite.t has inconsistent indent and use of tabs vs spaces. Convert to single tab indent.
Problem: codecov report shows that test to supposedly exercise the content-cache store batching on rank 0 when a backing store is loaded is not actually exercising that code. Probably the synchronous stores from the test take about as long as the stores from the cache to sqlite, so no backlog is created by the test. Solution: overlap the content store RPCs from the test.
Problem: the content.flush request handler contains dead code and emits useless debug log messages. Drop the dead code and the logs.
Problem: content.dropcache has no test coverage. Drop the cache once in the t0012-content-sqlite sharness test.
Problem: cache entries are purged every heartbeat with the period bounded by sync_min=1 and sync_max=10 seconds. In test we would like to crank down the heartbeat period to make the test run faster but setting it less than sync_min doesn't help. sync_min was established to avoid triggering the heavyweight cache purge too frequently, for example when heartbeat messages "bunch up" in the message queue. Now that purging uses an LRU, it doesn't need to scan for eligible items, and exits immediately if there is nothing to do. Eliminate the sync_min lower bound on the cache purge period.
Problem: test coverage for purging the content-cache in front of a backing store is minimal. Add some tests to t0012-content-sqlite.t that ought to improve coverage.
a63d788
to
c149c03
Compare
Codecov Report
@@ Coverage Diff @@
## master #3645 +/- ##
==========================================
+ Coverage 82.65% 82.78% +0.13%
==========================================
Files 325 325
Lines 49076 49041 -35
==========================================
+ Hits 40562 40597 +35
+ Misses 8514 8444 -70
|
Split from PR #3639 (this should go in first).