Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: have two request handlers for commits & fences #1349

Merged
merged 9 commits into from Mar 6, 2018

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Mar 5, 2018

Instead of sending all commit and fence RPCs to the same kvs handler, have one for commits (kvs.commit) and one for fences (kvs.fence).

This makes the code far clearer on how commits and fences are different.

A) commits do not require "nprocs" and "name" to be sent in their RPCs, so they are now removed.

B) a nice chunk of code is not needed for commits b/c nprocs is always 1 (i.e. don't need to do a lookup), so all that code is now removed in the commit path.

This completes issue #1344.

I was curious if this would improve performance, as commits are now far simpler than fences (less allocations, etc.). A soak test showed:

before

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.210   0.230   0.240   0.238   0.240   0.300 

after

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2100  0.2300  0.2400  0.2364  0.2400  0.3000 

Granted only 1 run over 1000 jobs, but mean is a tad better.

In relaycommit_request_cb(), fix corner case in which errno was
not set correctly in error path.
In order to differentiate commits from fences better, create
a new set of request callbacks kvs.fence and kvs.relayfence.

In libkvs, have commits call kvs.commit and have fences call
kvs.fence appropriately.
As nprocs is always 1 with commits, remove it from commit RPCs.

Adjust callers in libkvs appropriately.
Commit requests are each unique, therefore there is
no need to lookup if a commit already exists.  In addition, there
is no need to check that flags & nprocs are identical to prior
calls.
@chu11 chu11 added this to To do in multi-user parallel jobs via automation Mar 5, 2018
@chu11 chu11 removed this from To do in multi-user parallel jobs Mar 5, 2018
@chu11 chu11 added this to To do in multi-user parallel jobs via automation Mar 5, 2018
@chu11 chu11 moved this from To do to In progress in multi-user parallel jobs Mar 5, 2018
@chu11 chu11 self-assigned this Mar 5, 2018
@chu11
Copy link
Member Author

chu11 commented Mar 6, 2018

Hmmm, saw a

PASS: t0015-cron.t 21 - module load with sync
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

and also hit #1025. Restarted builders.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 78.813% when pulling 7b38b2e on chu11:issue1344-part3 into 3e73946 on flux-framework:master.

@chu11
Copy link
Member Author

chu11 commented Mar 6, 2018

Diff coverage doesn't look to good (63.82%) but looking through the coverage, its predominantly "impossible" error path coverage being missed. Not sure I'll be able to eek out better coverage with unit tests.

@garlick
Copy link
Member

garlick commented Mar 6, 2018

Good cleanup!

I'd probably drop the commit that diverts fence to commit if fence nprocs=1, since that left turn isn't likely justified by any real world benefit.

In the commit request callback, it looks like a uuid is still being created, but on the server side. Is that necessary? If it is, it might actually be better to do it on the client side to distribute the workload rather than concentrating it on the leader rank.

Slightly unrelated to this PR, but the duplication of the rpc call in the client (my fault) looks to be just to avoid creating an empty ops array? I'm not sure that works out to be helpful, since jansson probably just creates it internally from [] before serializing anyway. It's a bit ugly and might be useful to fix while you're in there.

@chu11
Copy link
Member Author

chu11 commented Mar 6, 2018

In the commit request callback, it looks like a uuid is still being created, but on the server side. Is that
necessary

Yeah. It's b/c there is shared code for commits & fences returning success/error codes to the original callers. I need a name to attach to an RPC message for later lookup. This is all in the finalize_transactions_bynames() path.

If it is, it might actually be better to do it on the client side to distribute the workload rather
than concentrating it on the leader rank.

That sounds like a good idea.

Slightly unrelated to this PR, but the duplication of the rpc call in the client (my fault) looks to be just to avoid creating an empty ops array?

Yup.

I'm not sure that works out to be helpful, since jansson probably just creates it internally from [] before serializing anyway. It's a bit ugly and might be useful to fix while you're in there.

Are you suggesting we create an empty ops array if the user passes in txn == NULL? I suppose that would clean it up a bit overall.

@garlick
Copy link
Member

garlick commented Mar 6, 2018

Are you suggesting we create an empty ops array if the user passes in txn == NULL?

I was, but now that you mention it, why not require txn != NULL? If the user wants to commit/fence an empty transaction, requiring them to create it doesn't seem unreasonable...

@chu11
Copy link
Member Author

chu11 commented Mar 6, 2018

I was, but now that you mention it, why not require txn != NULL? If the user wants to commit/fence an empty transaction, requiring them to create it doesn't seem unreasonable...

Sounds good, although I'll leave that to a different PR. I'll make an issue.

In internal function of kvstxn, refactor kvstxn_create() to take
parameters instead of a treq_t.
In kvstxn_mgr_process_transaction_request(), do not check/set
processing flag and do not check if transaction count has
reached nprocs.  Make that something the caller is required
to do.

This separates treq logic out of the kvstxn API far more.

Adjust core KVS file appropriately.  Most notably, in fences
calls to treq_count_reached() must be checked, but in commits
they do not.
Refactor kvstxn_mgr_process_transaction_request() into
kvstxn_mgr_add_transaction().  The new function takes direct
parameters needed for a ready transaction (i.e. name, ops, flags).

With this refactoring, treq_t transactions are now completely
decoupled from the kvstxn_t API.  So remove now unnecessary header
includes and object linking.
In relaycommit_request_cb(), do not create a treq data structure.
As commits are individually unique, it is not necessary.  Instead,
pass ops, name, and flags to kvstxn_mgr_add_transaction()
directly.

In commit_request_cb(), ops do not need to be stored as commits
are unique.
@chu11
Copy link
Member Author

chu11 commented Mar 6, 2018

Just re-pushed with recommended changes. To my surprise, no conflicts when deleting the two patches implementing the issues listed above.

@chu11
Copy link
Member Author

chu11 commented Mar 6, 2018

hit #1311, restart builder

@codecov-io
Copy link

Codecov Report

Merging #1349 into master will decrease coverage by 0.03%.
The diff coverage is 59.13%.

@@            Coverage Diff             @@
##           master    #1349      +/-   ##
==========================================
- Coverage   78.55%   78.51%   -0.04%     
==========================================
  Files         162      162              
  Lines       29619    29686      +67     
==========================================
+ Hits        23266    23307      +41     
- Misses       6353     6379      +26
Impacted Files Coverage Δ
src/modules/kvs/kvs.c 64.92% <53.22%> (-0.56%) ⬇️
src/common/libkvs/kvs_commit.c 73.52% <64.7%> (-12.19%) ⬇️
src/modules/kvs/kvstxn.c 78.07% <78.57%> (+0.04%) ⬆️
src/common/libflux/content.c 86.66% <0%> (-3.34%) ⬇️
src/common/libflux/request.c 87.17% <0%> (-1.29%) ⬇️
src/common/libflux/message.c 81.36% <0%> (-0.36%) ⬇️
src/broker/overlay.c 74.45% <0%> (ø) ⬆️
src/modules/connector-local/local.c 74.59% <0%> (+0.2%) ⬆️
src/broker/module.c 84.07% <0%> (+0.27%) ⬆️
src/common/libkvs/kvs_watch.c 91.41% <0%> (+0.42%) ⬆️
... and 5 more

@garlick
Copy link
Member

garlick commented Mar 6, 2018

Merging. Thanks!

@garlick garlick merged commit 4d30efa into flux-framework:master Mar 6, 2018
multi-user parallel jobs automation moved this from In progress to Done Mar 6, 2018
@grondo grondo mentioned this pull request May 10, 2018
@chu11 chu11 deleted the issue1344-part3 branch June 5, 2021 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants