content-{sqlite,files,s3}: route checkpoint-get and checkpoint-put through broker #4463

chu11 · 2022-08-03T19:09:05Z

As per-requisite support for issue #4267, we will need to route checkpoints through the broker into the content backing modules.

There is a nice chunk of cleanup to do before this can happen, mostly surrounding topic names. So this is a bit of a pre-requisite PR I'm splitting off from the main PR since it doesn't really do much but touches a lot of lines of code.

Note that in the last commit, there is some code that is "common" and would look like I should add a common function. But they will be modified shortly so I left them separate. I could clean that up if desired.

garlick

Given that the KVS checkpiont functionality is being folded into the content service, does it make sense to prefix the functions with content_ instead of kvs_ and add a flags option to allow the cache to be bypassed like the content functions?

The dump/restore commands have a --no-cache option for bypassing the broker cache. This would let them trivially follow suit with the checkpoint functions.

I'll stop short of suggesting that libcontent be merged into libkvs since it only contains non-public helpers and the content service now more or less exists only for the kvs, but maybe that's food for thought for a future cleanup.

garlick · 2022-08-08T15:16:54Z

src/broker/content-cache.c

-    if (e->dirty)
-        cache->acct_dirty--;


Ah sorry, dead code there.

Note "decrement" is misspelled in the commit description.

garlick · 2022-08-08T15:24:18Z

src/common/libkvs/kvs_checkpoint.c

-                             "kvs-checkpoint.put",
+                             "content-backing.checkpoint-put",


Ah good cleanup. This was initially created as a separate service since it seemed like it ought to be independent of the "content" service, but that's not how it has played out so this change make sense and in fact has been bugging me.

garlick · 2022-08-08T15:29:43Z

src/broker/content-cache.c

+
+    if (flux_respond (cache->h, msg, s) < 0) {
+        flux_log_error (cache->h, "%s: flux_respond", __FUNCTION__);
+        goto error;


Don't call flux_respond_error() to handle a flux_respond() error (it is likely to fail).

garlick · 2022-08-08T15:31:27Z

src/broker/content-cache.c

+/* flux_free_f type */
+static void flux_msg_decref_wrapper (void *arg)
+{
+    const flux_msg_t *msg = arg;
+    flux_msg_decref (msg);
+}


Just cast flux_msg_decref() to flux_free_f instead?

garlick · 2022-08-08T15:34:27Z

src/broker/content-cache.c

+        || flux_future_then (f, -1, checkpoint_get_continuation, cache) < 0) {
+        flux_log_error (h, "%s: checkpoint backing get", __FUNCTION__);


The error message would be "broker: content_checkpoint_get_request: checkpoint backing get: ...".

Maybe just have it be "error starting checkpoint-get RPC: ..."

Also: this leaks msg if the rpc succeeds but the aux_set fails.

garlick · 2022-08-08T15:36:08Z

src/broker/content-cache.c

+    if (flux_respond (cache->h, msg, s) < 0) {
+        flux_log_error (cache->h, "%s: flux_respond", __FUNCTION__);
+        goto error;


When flux_respond() fails, don't try to send an error response (it is likely to fail). Logging the unlikely error is sufficient.

garlick · 2022-08-08T15:40:45Z

src/broker/content-cache.c

+
+error:
+    if (flux_respond_error (h, msg, errno, NULL) < 0)
+        flux_log_error (h, "flux_respond_error");


Instead of "broker: flux_respond_error..." which would be infuriatingly non specific, maybe "error responding to checkpoint-get request" or similar? I realize there's a lot of examples of the former in the code base but I think in this case it's better to be inconsistent and start doing better rather rather than defer fixing these until we can fix all of them.

garlick · 2022-08-08T15:43:18Z

src/broker/content-cache.c

+    if (!(f = flux_rpc (h, topic, s, 0, 0))
+        || flux_future_aux_set (f,
+                             "msg",
+                             (void *)flux_msg_incref (msg),
+                             flux_msg_decref_wrapper) < 0
+        || flux_future_then (f, -1, checkpoint_put_continuation, cache) < 0) {
+        flux_log_error (h, "%s: checkpoint backing put", __FUNCTION__);
+        goto error;


Same comments as other function: leaks msg on error, tries to send an error response in response to a response error, unnecessarily vague log message if flux_respond_error() fails.

chu11 · 2022-08-08T18:32:52Z

Given that the KVS checkpiont functionality is being folded into the content service, does it make sense to prefix the functions with content_ instead of kvs_ and add a flags option to allow the cache to be bypassed like the content functions?

You're referring to the libkvs checkpoint functions, right?

Hmmm. I like the idea of adding the bypass flag, but wondering if we should keep them prefixed as kvs_, solely b/c of the KVS_DEFAULT_CHECKPOINT key. Effectively they act as a wrapper for this special case checkpointing.

chu11 · 2022-08-08T18:40:24Z

Hmmm. I like the idea of adding the bypass flag, but wondering if we should keep them prefixed as kvs_, solely b/c of the KVS_DEFAULT_CHECKPOINT key. Effectively they act as a wrapper for this special case checkpointing.

I'm going to actually do this in a separate PR leaving this one isolated to what it currently does (and adding flags would break libkvs ABI). I'll write up an issue.

Problem: Several lines of code were > 80 characters, which violates RFC7. Correct the long lines.

Problem: The test output file names rootref3.{exp,out} is duplicated in two tests. On an error, we can't debug a failure in the first test using these filenames. Solution: Update filenames in tests to be unique.

Problem: In cache_entry_remove(), the `acct_dirty` variable is decremented if the entry is dirty. However, cache_entry_remove() should never be called if a cache entry is dirty. Solution: Remove the decrement of `acct_dirty` if a cache entry is dirty. Instead add an assert() check that dirty is false.

Problem: In the near future we would like the KVS checkpoint to content modules to be routed through the broker, instead of communicating directly to the modules. The topic names of kvs-checkpoint.get and kvs-checkpoint.put should be renamed similarly to other topics in the content modules so readers will know they all behave in similar fashion. Solution: Rename kvs-checkpoint.get and kvs-checkpoint.put to content-backing.checkpoint-get and content-backing.checkpoint-put respectively. Update all callers and tests accordingly. Due to the length of the topic name, adjust some text from "content-backing.checkpoint" to just "checkpoint" when appropriate.

Problem: In the future, we would like to cache checkpoint data in the broker in the event a content-backing module is not loaded. But this requires checkpoint data to be routed through the broker and not sent directly to the content-backing modules. Solution: Add new content.checkpoint-get and content-checkpoint-put handlers in the broker which will route to content-backing.checkpoint-get and content-backing.checkpoint-put if they are available. Update callers to call content.checkpoint-{get,put}. Update t0018-content-files.t tests that had previously assumed no registration with the broker. Update t2807-dump-cmd.t test that will get different error message as a result.

chu11 · 2022-08-08T19:29:10Z

re-pushed with fixes for comments from above, went ahead and squashed them since they were all relatively minor

codecov · 2022-08-08T20:13:05Z

Codecov Report

Merging #4463 (ff1da9e) into master (92921b3) will increase coverage by 0.01%.
The diff coverage is 80.76%.

@@            Coverage Diff             @@
##           master    #4463      +/-   ##
==========================================
+ Coverage   83.33%   83.35%   +0.01%     
==========================================
  Files         400      400              
  Lines       67273    67342      +69     
==========================================
+ Hits        56062    56130      +68     
- Misses      11211    11212       +1

Impacted Files	Coverage Δ
src/cmd/builtin/restore.c	`87.78% <0.00%> (ø)`
src/common/libkvs/kvs_checkpoint.c	`88.37% <ø> (ø)`
src/modules/content-files/content-files.c	`79.26% <0.00%> (-0.13%)`	⬇️
src/modules/content-sqlite/content-sqlite.c	`67.61% <ø> (-0.08%)`	⬇️
src/broker/content-cache.c	`85.68% <86.30%> (+0.25%)`	⬆️
src/common/libsdprocess/sdprocess.c	`69.13% <0.00%> (-0.13%)`	⬇️
src/broker/overlay.c	`86.79% <0.00%> (+0.10%)`	⬆️
src/cmd/flux-job.c	`87.42% <0.00%> (+0.13%)`	⬆️
src/common/libterminus/terminus.c	`86.06% <0.00%> (+0.24%)`	⬆️
src/common/libflux/handle.c	`83.87% <0.00%> (+0.38%)`	⬆️
... and 2 more

garlick

Sorry, should have approved last time! LGTM!

chu11 mentioned this pull request Aug 4, 2022

support content.backing-module=none #4267

Closed

chu11 force-pushed the issue4267_route_checkpoint_broker branch 3 times, most recently from 2c5741d to 049da09 Compare August 5, 2022 15:30

garlick reviewed Aug 8, 2022

View reviewed changes

chu11 mentioned this pull request Aug 8, 2022

libkvs: add cache bypass option to kvs_checkpoint functions #4476

Closed

chu11 force-pushed the issue4267_route_checkpoint_broker branch from 049da09 to e47c80c Compare August 8, 2022 19:28

chu11 added 7 commits August 8, 2022 12:28

content-s3: fix code line length

177e52c

Problem: Several lines of code were > 80 characters, which violates RFC7. Correct the long lines.

cmd/restore: fix invalid code indent

5ffc761

testsuite: update output test file names

6a62325

Problem: The test output file names rootref3.{exp,out} is duplicated in two tests. On an error, we can't debug a failure in the first test using these filenames. Solution: Update filenames in tests to be unique.

testsuite: fix cut & paste typo in test message

2e75e43

chu11 force-pushed the issue4267_route_checkpoint_broker branch from e47c80c to ff1da9e Compare August 8, 2022 19:28

garlick approved these changes Aug 8, 2022

View reviewed changes

chu11 added the merge-when-passing label Aug 8, 2022

mergify bot merged commit f9bc5da into flux-framework:master Aug 8, 2022

chu11 deleted the issue4267_route_checkpoint_broker branch August 8, 2022 23:24

This was referenced Aug 22, 2022

etc: support content.backing-module=none #4492

Merged

broker: forward content.checkpoint-{get,put} RPCs to rank 0 #4519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content-{sqlite,files,s3}: route checkpoint-get and checkpoint-put through broker #4463

content-{sqlite,files,s3}: route checkpoint-get and checkpoint-put through broker #4463

chu11 commented Aug 3, 2022

garlick left a comment

garlick Aug 8, 2022

garlick Aug 8, 2022

garlick Aug 8, 2022

garlick Aug 8, 2022 •

edited

garlick Aug 8, 2022

garlick Aug 8, 2022

garlick Aug 8, 2022

garlick Aug 8, 2022

chu11 commented Aug 8, 2022

chu11 commented Aug 8, 2022 •

edited

chu11 commented Aug 8, 2022

codecov bot commented Aug 8, 2022

garlick left a comment

		\|\| flux_future_then (f, -1, checkpoint_get_continuation, cache) < 0) {
		flux_log_error (h, "%s: checkpoint backing get", __FUNCTION__);

content-{sqlite,files,s3}: route checkpoint-get and checkpoint-put through broker #4463

content-{sqlite,files,s3}: route checkpoint-get and checkpoint-put through broker #4463

Conversation

chu11 commented Aug 3, 2022

garlick left a comment

Choose a reason for hiding this comment

garlick Aug 8, 2022

Choose a reason for hiding this comment

garlick Aug 8, 2022

Choose a reason for hiding this comment

garlick Aug 8, 2022

Choose a reason for hiding this comment

garlick Aug 8, 2022 • edited

Choose a reason for hiding this comment

garlick Aug 8, 2022

Choose a reason for hiding this comment

garlick Aug 8, 2022

Choose a reason for hiding this comment

garlick Aug 8, 2022

Choose a reason for hiding this comment

garlick Aug 8, 2022

Choose a reason for hiding this comment

chu11 commented Aug 8, 2022

chu11 commented Aug 8, 2022 • edited

chu11 commented Aug 8, 2022

codecov bot commented Aug 8, 2022

Codecov Report

garlick left a comment

Choose a reason for hiding this comment

garlick Aug 8, 2022 •

edited

chu11 commented Aug 8, 2022 •

edited