New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: forward content.checkpoint-{get,put} RPCs to rank 0 #4519
broker: forward content.checkpoint-{get,put} RPCs to rank 0 #4519
Conversation
aac39d2
to
903a887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nits, otherwise good.
src/broker/content-cache.c
Outdated
const flux_msg_t *msgcpy = flux_msg_incref (msg); | ||
const flux_msg_t *msgref = flux_msg_incref (msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest instead of renaming the variable, we get rid of it and fix the refcounting problem on the error paths, as discussed in the other PR.
While you're at it, may as well fix the future leak that occurs when aux_set or then fails.
t/t0024-content-s3.t
Outdated
test_expect_success HAVE_JQ 'checkpoint-put on rank 1 forwards to rank 0' ' | ||
o=$(checkpoint_put_msg rankone rankref) && | ||
jq -j -c -n ${o} | flux exec -r 1 ${RPC} content.checkpoint-put | ||
' | ||
|
||
test_expect_success HAVE_JQ 'checkpoint-get on rank 1 forwards to rank 0' ' | ||
echo rankref >rankref.exp && | ||
o=$(checkpoint_get_msg rankone) && | ||
jq -j -c -n ${o} \ | ||
| flux exec -r 1 ${RPC} content.checkpoint-get \ | ||
| jq -r .value | jq -r .rootref > rankref.out && | ||
test_cmp rankref.exp rankref.out | ||
' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests really don't need to be repeated in every backing store since the code they are testing is in the broker not the backing stores... Just pick one?
Problem: The checkpoint_put() and checkpoint_get() helper functions are copied between three files. Solution: Create a new helper library content/content-helper.sh to put common functions associated with content checkpointing.
Problem: In the near future, code in the broker outside of the content cache will need to determine if a backing module is loaded. However, there is no way to tell if one is loaded outside of the content cache. Solution: Add a new function content_cache_backing_loaded() that will return true if a backing module is loaded.
Problem: Including the file content-cache.h assumes you have already included attr.h and that struct content_cache has been definied elsewhere. Solution: Include attr.h and add forward declaration of struct content_cache to avoid compiler failures.
Problem: The content-cache code is becoming large and the content checkpoint code is largely independent of the content-cache. In addition, some upcoming features will bloat the checkpoint code even more. Solution: Split out the content-checkpoint code into its own file.
Problem: The content.checkpoint-{get,put} rpcs are only handled locally. This can lead to unexpected ENOSYS errors. Solution: Forward content.checkpoint-{get,put} rpgs to rank 0 on rank > 0 brokers. Only forward the RPC to the backing module on rank 0.
Problem: In several functions in the checkpoint code, a double decref of a message reference can occur. This occurs due to a direct call to flux_msg_decref() and another which can occur when the future is destroyed. Solution: Correct the error and make cleanup easier by placing forwarding requests into their own functions. In addition, set an errstr with an explanation of the error that can be returned to callers. of a msg reference pointer could occur in an error path.
Problem: There were no content checkpoint tests on rank > 0 to ensure RPCs are forwarded to rank 0 appropriately. Solution: Add some non-rank 0 checkpoint tests to t0012-content-sqlite.t.
903a887
to
8a70430
Compare
re-pushed, removed those extra tests per comments above. Instead of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Per a discussion in #4492, there was a mistake in PR #4463 were non-rank 0 brokers did not forward checkpoint RPCs to rank 0. Instead, they were forwarded to rank 0's backing module. It has little effect / meaning before the work in #4492, but felt it wise to split it out into its own PR as it solves a specific thing. I also put some of the work from #4492 into this PR, since it made sense. Most notably, the movement of the content-checkpoint code out of
content-cache
and into its owncontent-checkpoint
files.Also added a helper lib for the content tests, b/c I was getting irritated constantly modifying all three content tests all of the time.