Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content-cache: general cleanup, small bug fixes, and test improvement #3645

Merged
merged 13 commits into from May 9, 2021

Conversation

garlick
Copy link
Member

@garlick garlick commented May 9, 2021

Split from PR #3639 (this should go in first).

@garlick
Copy link
Member Author

garlick commented May 9, 2021

I don't see a lot of opportunity to increase the diff coverage here since most of it is unlikely error paths.

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, LGTM, although one comment below

@@ -14,14 +14,14 @@ RPC=${FLUX_BUILD_DIR}/t/request/rpc

HASHFUN=`flux getattr content.hash`

test_expect_success 'load heartbeat module with fast rate to drive purge' '
flux module load heartbeat period=1s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimming the heartbeat module, could we make the period < 1s? 1s seems long in the unit tests (especially w/ the sleep 1 below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync callback rate is forced to be within sync_min=1 and sync_max=10 seconds, so it doesn't help to make the heartbeat < 1s. However, that min value was set to avoid triggering the heavyweight cache purge too frequently, and now with the LRU, the purge doesn't need to scan for eligible items and should exit immediately if there is nothing to do. So maybe we can just eliminate the min value and crank the heartbeat down as you say. I'll go ahead and do that.

@garlick
Copy link
Member Author

garlick commented May 9, 2021

OK, made that change. DId you want to have another look @chu11? I see you already approved and I'll set MWP if you're satisfied.

@chu11
Copy link
Member

chu11 commented May 9, 2021

@garlick LGTM

@garlick
Copy link
Member Author

garlick commented May 9, 2021

Thanks!

garlick added 13 commits May 9, 2021 19:20
Problem: the content.dropcache rpc handler walks the entire cache,
but it only needs to walk the LRU now that all LRU entries are neither
dirty nor invalid.

Simplify the content.dropcache RPC handler.
Problem: when a cache entry is used and moved to the front of the LRU,
the lastused timestamp is also updated; however, if the entry is already
at the front, it is not.

Update entry->lastused when entry is already at the front as well.
Problem: if flux_future_aux_set() fails in cache_store(), a
future is leaked.

Simplify that function so there is only one error path and
thereby stop the leak.
Problem: if flux_future_aux_set() fails in cache_load(), a
future is leaked.

Simplify that function so there is only one error path and
thereby stop the leak.
Problem: comment misuses semicolon.

Fix semicolon usage.
Problem: Some request handlers pass the request message through
flux_request_decode() with NULL arguments, which accomplishes
nothing since the message dispatcher will have already verified
the message type and topic string.

Drop the extra checks.
Problem: cache_load() contains handling for ENOSYS and
ENOENT errors but those errors cannot occur until the
continuation.

Drop dead error handling code.
Problem: t0012-content-sqlite.t has inconsistent indent
and use of tabs vs spaces.

Convert to single tab indent.
Problem: codecov report shows that test to supposedly exercise
the content-cache store batching on rank 0 when a backing store
is loaded is not actually exercising that code.

Probably the synchronous stores from the test take about as long
as the stores from the cache to sqlite, so no backlog is created
by the test.

Solution: overlap the content store RPCs from the test.
Problem: the content.flush request handler contains dead
code and emits useless debug log messages.

Drop the dead code and the logs.
Problem: content.dropcache has no test coverage.

Drop the cache once in the t0012-content-sqlite sharness test.
Problem: cache entries are purged every heartbeat with the period
bounded by sync_min=1 and sync_max=10 seconds.  In test we would
like to crank down the heartbeat period to make the test run faster
but setting it less than sync_min doesn't help.

sync_min was established to avoid triggering the heavyweight cache
purge too frequently, for example when heartbeat messages "bunch up"
in the message queue.  Now that purging uses an LRU, it doesn't need
to scan for eligible items, and exits immediately if there is nothing to do.

Eliminate the sync_min lower bound on the cache purge period.
Problem: test coverage for purging the content-cache in front of
a backing store is minimal.

Add some tests to t0012-content-sqlite.t that ought to improve coverage.
@codecov
Copy link

codecov bot commented May 9, 2021

Codecov Report

Merging #3645 (c149c03) into master (642fa1a) will increase coverage by 0.13%.
The diff coverage is 86.20%.

@@            Coverage Diff             @@
##           master    #3645      +/-   ##
==========================================
+ Coverage   82.65%   82.78%   +0.13%     
==========================================
  Files         325      325              
  Lines       49076    49041      -35     
==========================================
+ Hits        40562    40597      +35     
+ Misses       8514     8444      -70     
Impacted Files Coverage Δ
src/broker/content-cache.c 83.70% <80.00%> (+12.93%) ⬆️
src/cmd/flux-kvs.c 81.58% <100.00%> (-0.09%) ⬇️
src/modules/kvs/kvs.c 67.78% <100.00%> (+0.07%) ⬆️
src/broker/module.c 76.30% <0.00%> (-1.25%) ⬇️
src/broker/state_machine.c 78.59% <0.00%> (-0.85%) ⬇️
src/common/libflux/message.c 84.04% <0.00%> (+0.12%) ⬆️
src/modules/job-info/guest_watch.c 76.61% <0.00%> (+0.61%) ⬆️
src/broker/overlay.c 88.86% <0.00%> (+0.75%) ⬆️
src/cmd/builtin/content.c 73.87% <0.00%> (+8.10%) ⬆️

@mergify mergify bot merged commit 560293e into flux-framework:master May 9, 2021
@garlick garlick deleted the content_cleanup branch May 9, 2021 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants