Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content-{sqlite,files,s3}: route checkpoint-get and checkpoint-put through broker #4463

Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Aug 3, 2022

As per-requisite support for issue #4267, we will need to route checkpoints through the broker into the content backing modules.

There is a nice chunk of cleanup to do before this can happen, mostly surrounding topic names. So this is a bit of a pre-requisite PR I'm splitting off from the main PR since it doesn't really do much but touches a lot of lines of code.

Note that in the last commit, there is some code that is "common" and would look like I should add a common function. But they will be modified shortly so I left them separate. I could clean that up if desired.

@chu11 chu11 force-pushed the issue4267_route_checkpoint_broker branch 3 times, most recently from 2c5741d to 049da09 Compare August 5, 2022 15:30
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the KVS checkpiont functionality is being folded into the content service, does it make sense to prefix the functions with content_ instead of kvs_ and add a flags option to allow the cache to be bypassed like the content functions?

The dump/restore commands have a --no-cache option for bypassing the broker cache. This would let them trivially follow suit with the checkpoint functions.

I'll stop short of suggesting that libcontent be merged into libkvs since it only contains non-public helpers and the content service now more or less exists only for the kvs, but maybe that's food for thought for a future cleanup.

Comment on lines -348 to -349
if (e->dirty)
cache->acct_dirty--;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, dead code there.

Note "decrement" is misspelled in the commit description.

"kvs-checkpoint.put",
"content-backing.checkpoint-put",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good cleanup. This was initially created as a separate service since it seemed like it ought to be independent of the "content" service, but that's not how it has played out so this change make sense and in fact has been bugging me.


if (flux_respond (cache->h, msg, s) < 0) {
flux_log_error (cache->h, "%s: flux_respond", __FUNCTION__);
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't call flux_respond_error() to handle a flux_respond() error (it is likely to fail).

Comment on lines 633 to 638
/* flux_free_f type */
static void flux_msg_decref_wrapper (void *arg)
{
const flux_msg_t *msg = arg;
flux_msg_decref (msg);
}
Copy link
Member

@garlick garlick Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cast flux_msg_decref() to flux_free_f instead?

Comment on lines 662 to 663
|| flux_future_then (f, -1, checkpoint_get_continuation, cache) < 0) {
flux_log_error (h, "%s: checkpoint backing get", __FUNCTION__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message would be "broker: content_checkpoint_get_request: checkpoint backing get: ...".

Maybe just have it be "error starting checkpoint-get RPC: ..."

Also: this leaks msg if the rpc succeeds but the aux_set fails.

Comment on lines 686 to 688
if (flux_respond (cache->h, msg, s) < 0) {
flux_log_error (cache->h, "%s: flux_respond", __FUNCTION__);
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When flux_respond() fails, don't try to send an error response (it is likely to fail). Logging the unlikely error is sufficient.


error:
if (flux_respond_error (h, msg, errno, NULL) < 0)
flux_log_error (h, "flux_respond_error");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "broker: flux_respond_error..." which would be infuriatingly non specific, maybe "error responding to checkpoint-get request" or similar? I realize there's a lot of examples of the former in the code base but I think in this case it's better to be inconsistent and start doing better rather rather than defer fixing these until we can fix all of them.

Comment on lines 717 to 714
if (!(f = flux_rpc (h, topic, s, 0, 0))
|| flux_future_aux_set (f,
"msg",
(void *)flux_msg_incref (msg),
flux_msg_decref_wrapper) < 0
|| flux_future_then (f, -1, checkpoint_put_continuation, cache) < 0) {
flux_log_error (h, "%s: checkpoint backing put", __FUNCTION__);
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as other function: leaks msg on error, tries to send an error response in response to a response error, unnecessarily vague log message if flux_respond_error() fails.

@chu11
Copy link
Member Author

chu11 commented Aug 8, 2022

Given that the KVS checkpiont functionality is being folded into the content service, does it make sense to prefix the functions with content_ instead of kvs_ and add a flags option to allow the cache to be bypassed like the content functions?

You're referring to the libkvs checkpoint functions, right?

Hmmm. I like the idea of adding the bypass flag, but wondering if we should keep them prefixed as kvs_, solely b/c of the KVS_DEFAULT_CHECKPOINT key. Effectively they act as a wrapper for this special case checkpointing.

@chu11
Copy link
Member Author

chu11 commented Aug 8, 2022

Hmmm. I like the idea of adding the bypass flag, but wondering if we should keep them prefixed as kvs_, solely b/c of the KVS_DEFAULT_CHECKPOINT key. Effectively they act as a wrapper for this special case checkpointing.

I'm going to actually do this in a separate PR leaving this one isolated to what it currently does (and adding flags would break libkvs ABI). I'll write up an issue.

Problem: Several lines of code were > 80 characters, which violates
RFC7.

Correct the long lines.
Problem: The test output file names rootref3.{exp,out} is duplicated
in two tests.  On an error, we can't debug a failure in the first
test using these filenames.

Solution: Update filenames in tests to be unique.
Problem: In cache_entry_remove(), the `acct_dirty` variable is
decremented if the entry is dirty.  However, cache_entry_remove()
should never be called if a cache entry is dirty.

Solution: Remove the decrement of `acct_dirty` if a cache entry
is dirty.  Instead add an assert() check that dirty is false.
Problem: In the near future we would like the KVS checkpoint
to content modules to be routed through the broker, instead
of communicating directly to the modules.  The topic names
of kvs-checkpoint.get and kvs-checkpoint.put should be renamed
similarly to other topics in the content modules so readers
will know they all behave in similar fashion.

Solution: Rename kvs-checkpoint.get and kvs-checkpoint.put to
content-backing.checkpoint-get and content-backing.checkpoint-put
respectively.  Update all callers and tests accordingly.  Due to
the length of the topic name, adjust some text from
"content-backing.checkpoint" to just "checkpoint" when appropriate.
Problem: In the future, we would like to cache checkpoint
data in the broker in the event a content-backing module is
not loaded.  But this requires checkpoint data to be routed
through the broker and not sent directly to the content-backing
modules.

Solution: Add new content.checkpoint-get and content-checkpoint-put
handlers in the broker which will route to content-backing.checkpoint-get
and content-backing.checkpoint-put if they are available.  Update
callers to call content.checkpoint-{get,put}.

Update t0018-content-files.t tests that had previously assumed no
registration with the broker.

Update t2807-dump-cmd.t test that will get different error message
as a result.
@chu11 chu11 force-pushed the issue4267_route_checkpoint_broker branch from e47c80c to ff1da9e Compare August 8, 2022 19:28
@chu11
Copy link
Member Author

chu11 commented Aug 8, 2022

re-pushed with fixes for comments from above, went ahead and squashed them since they were all relatively minor

@codecov
Copy link

codecov bot commented Aug 8, 2022

Codecov Report

Merging #4463 (ff1da9e) into master (92921b3) will increase coverage by 0.01%.
The diff coverage is 80.76%.

@@            Coverage Diff             @@
##           master    #4463      +/-   ##
==========================================
+ Coverage   83.33%   83.35%   +0.01%     
==========================================
  Files         400      400              
  Lines       67273    67342      +69     
==========================================
+ Hits        56062    56130      +68     
- Misses      11211    11212       +1     
Impacted Files Coverage Δ
src/cmd/builtin/restore.c 87.78% <0.00%> (ø)
src/common/libkvs/kvs_checkpoint.c 88.37% <ø> (ø)
src/modules/content-files/content-files.c 79.26% <0.00%> (-0.13%) ⬇️
src/modules/content-sqlite/content-sqlite.c 67.61% <ø> (-0.08%) ⬇️
src/broker/content-cache.c 85.68% <86.30%> (+0.25%) ⬆️
src/common/libsdprocess/sdprocess.c 69.13% <0.00%> (-0.13%) ⬇️
src/broker/overlay.c 86.79% <0.00%> (+0.10%) ⬆️
src/cmd/flux-job.c 87.42% <0.00%> (+0.13%) ⬆️
src/common/libterminus/terminus.c 86.06% <0.00%> (+0.24%) ⬆️
src/common/libflux/handle.c 83.87% <0.00%> (+0.38%) ⬆️
... and 2 more

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should have approved last time! LGTM!

@mergify mergify bot merged commit f9bc5da into flux-framework:master Aug 8, 2022
@chu11 chu11 deleted the issue4267_route_checkpoint_broker branch August 8, 2022 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants