Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: forward content.checkpoint-{get,put} RPCs to rank 0 #4519

Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Aug 24, 2022

Per a discussion in #4492, there was a mistake in PR #4463 were non-rank 0 brokers did not forward checkpoint RPCs to rank 0. Instead, they were forwarded to rank 0's backing module. It has little effect / meaning before the work in #4492, but felt it wise to split it out into its own PR as it solves a specific thing. I also put some of the work from #4492 into this PR, since it made sense. Most notably, the movement of the content-checkpoint code out of content-cache and into its own content-checkpoint files.

Also added a helper lib for the content tests, b/c I was getting irritated constantly modifying all three content tests all of the time.

@chu11 chu11 force-pushed the issue4267_content_checkpoint_forward branch 4 times, most recently from aac39d2 to 903a887 Compare August 24, 2022 17:45
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits, otherwise good.

const flux_msg_t *msgcpy = flux_msg_incref (msg);
const flux_msg_t *msgref = flux_msg_incref (msg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest instead of renaming the variable, we get rid of it and fix the refcounting problem on the error paths, as discussed in the other PR.

While you're at it, may as well fix the future leak that occurs when aux_set or then fails.

Comment on lines 130 to 143
test_expect_success HAVE_JQ 'checkpoint-put on rank 1 forwards to rank 0' '
o=$(checkpoint_put_msg rankone rankref) &&
jq -j -c -n ${o} | flux exec -r 1 ${RPC} content.checkpoint-put
'

test_expect_success HAVE_JQ 'checkpoint-get on rank 1 forwards to rank 0' '
echo rankref >rankref.exp &&
o=$(checkpoint_get_msg rankone) &&
jq -j -c -n ${o} \
| flux exec -r 1 ${RPC} content.checkpoint-get \
| jq -r .value | jq -r .rootref > rankref.out &&
test_cmp rankref.exp rankref.out
'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests really don't need to be repeated in every backing store since the code they are testing is in the broker not the backing stores... Just pick one?

Problem: The checkpoint_put() and checkpoint_get() helper
functions are copied between three files.

Solution: Create a new helper library content/content-helper.sh
to put common functions associated with content checkpointing.
Problem: In the near future, code in the broker outside of the
content cache will need to determine if a backing module is loaded.
However, there is no way to tell if one is loaded outside of the
content cache.

Solution: Add a new function content_cache_backing_loaded() that
will return true if a backing module is loaded.
Problem: Including the file content-cache.h assumes you have
already included attr.h and that struct content_cache has been
definied elsewhere.

Solution: Include attr.h and add forward declaration of struct
content_cache to avoid compiler failures.
Problem: The content-cache code is becoming large and the content
checkpoint code is largely independent of the content-cache.  In
addition, some upcoming features will bloat the checkpoint code
even more.

Solution: Split out the content-checkpoint code into its own file.
Problem: The content.checkpoint-{get,put} rpcs are only
handled locally.  This can lead to unexpected ENOSYS
errors.

Solution: Forward content.checkpoint-{get,put} rpgs to rank 0
on rank > 0 brokers.  Only forward the RPC to the backing module on
rank 0.
Problem: In several functions in the checkpoint code, a double
decref of a message reference can occur.  This occurs due to
a direct call to flux_msg_decref() and another which can
occur when the future is destroyed.

Solution: Correct the error and make cleanup easier by placing
forwarding requests into their own functions.  In addition,
set an errstr with an explanation of the error that can be
returned to callers.

of a msg reference pointer could occur in an error path.
Problem: There were no content checkpoint tests on rank > 0
to ensure RPCs are forwarded to rank 0 appropriately.

Solution: Add some non-rank 0 checkpoint tests to
t0012-content-sqlite.t.
@chu11 chu11 force-pushed the issue4267_content_checkpoint_forward branch from 903a887 to 8a70430 Compare August 25, 2022 14:55
@chu11
Copy link
Member Author

chu11 commented Aug 25, 2022

re-pushed, removed those extra tests per comments above. Instead of the msgcpy -> msgref commit, I just added a commit that refactored things to remove the temp variable and fixup the flux future double decref issue.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit b4cc5e0 into flux-framework:master Aug 25, 2022
@chu11 chu11 deleted the issue4267_content_checkpoint_forward branch August 25, 2022 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants