Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modules/barrier: cleanup and add guest support #2215

Merged
merged 15 commits into from Jul 11, 2019

Conversation

@garlick
Copy link
Member

commented Jun 30, 2019

Peeled off of job shell work (and not required by it):

This PR removes exit-on-ENOMEM behavior in the barrier module, and adds support for guests using the service (such as from the job shell). Other minor cleanup and an explanatory block comment at the top of the module source thrown in also.

Users can make overlapping use of names without contention (internally the barrier state is now hashed by "userid:barrier_name", not just "barrier_name")

There is also one broker patch for an unchecked return code that seemed important, however unlikely.

Copy link
Contributor

left a comment

overall LGTM, just two small nits I noticed

free (sender);
return;
error:
flux_respond_error (ctx->h, msg, errno, NULL);
}

This comment has been minimized.

Copy link
@chu11

chu11 Jul 8, 2019

Contributor

need to free (sender) here too. (Later noticed you have added it in a8103f8, but should really be in commit b2c4613)

* is started to open a short window in time, within which concurrent
* requests may be batched. After expiration, a combined request is
* sent upstream.
* - The timer is re-armed if another received is received. This process

This comment has been minimized.

Copy link
@chu11

chu11 Jul 8, 2019

Contributor

typo, "received is received"

@grondo
grondo approved these changes Jul 8, 2019
Copy link
Contributor

left a comment

This looks like a great improvement to me. I didn't see anything beyond what @chu11 already caught.

FOREACH_ZHASH (ctx->barriers, key, b) {
if (zhash_lookup (b->clients, sender)) {
if (exit_event_send (h, b->name, ECONNABORTED) < 0)
if (b->owner != userid) {

This comment has been minimized.

Copy link
@grondo

grondo Jul 8, 2019

Contributor

Just noticed this: does the disconnect handler really need to check that the userid of the barrier.disconnect msg matches the disconnecting barrier "owner" userid? Doesn't the api module send all disconnect messages, and thus disconnect msgs will have userid of the instance owner?

I'm not sure if this is a valid test but ctrl-c out of a barrier with as a non-owner (the overloaded owner here might get confusing) does seem to fail, leaving the barrier "active":

$ FLUX_HANDLE_ROLEMASK=0x02 FLUX_HANDLE_USERID=1000 t/barrier/tbarrier --nprocs=2 guest-test 
^C
2019-07-08T22:23:57.853232Z barrier.err[0]: client userid mismatch 6885 != 1000

This comment has been minimized.

Copy link
@garlick

garlick Jul 11, 2019

Author Member

Good catch! Yeah, it should always allow the owner role, but if that's not set, it still probably should check the userid since a disconnect message is not special and could be sent by anyone.

garlick added 10 commits Jun 29, 2019
Problem: barrier module contains gratuitous typedefs
for structures.

These internal typedefs add nothing so drop them.
Problem: Barrier context creation exits on ENOMEM error.
Also, it is stored in the aux hash, but never accessed
outside of mod_main().

Make the context creation and destruction more explicitly
the job of mod_main(), and handle the ENOMEM error path.
Problem: barrier module sets a timer for each barrier
so that it can print something if it still
exists after one second.  This adds complexity and
doesn't seem all that useful.

Drop debug code.
Problem: there is a single timer for the barrier module,
started when a barrier is entered (if not yet armed),
cleared when all counts have been sent upstream.  The
purpose is to batch up multiple barrier counts to send
upstream at once.  Sharing the timer between multiple
barriers is confusing and causes unpredictable latencies.

Create a per-barrier timer instead.  Now when multiple
barriers are active, their timings are independent.
Problem: modservice_register() is the first use of a
module's "implicit" reactor, thus triggers its creation.
If that fails, it will not be detected.

Check for flux_get_reactor () == NULL.
If zhash_insert() fails in barrier_add_client(), duplicate
message is leaked.

Free the message on exit.
Problem: barrier creation exits on out of memory error.

Restructure barrier.enter handler so that an ENOMEM
error is returned to the user.

Also, move the hash insertion out of the barrier creation
function for clarity.

This is the last use of oom(), xzmalloc(), and xstrdup(),
so drop those includes.
Problem: internal and external barrier.enter are
combined in one message handler, but one must be
available to users and the other restricted to
instance owner.

Make that two message handlers, and drop the
"internal" flag from the request.
Problem: only the instance owner can access the barrier
service.

Allow FLUX_ROLE_USER to access the barrier.enter and
barrier.disconnect methods.

Use the "owner:name" as the key into the global barrier
hash, so different users can independently use the same
barrier names.
@garlick garlick force-pushed the garlick:barrier_cleanup branch from 511214c to 1a64eef Jul 11, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2019

Addressed review comments, squashed, and rebased on current master.

garlick added 3 commits Jul 11, 2019
Ensure that the barrier.exit event is generated when
a client that has entered the barrier disconnects
prematurely.

Repeat the test as a guest user to ensure the disconnect
credential check allows the owner-generated disconnect
message (from connector-local module) to abort a barrier
owned by the guest user.
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2019

I added a couple more tests to try to cover disconnects, including the guest case that @grondo caught.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

I added a couple more tests to try to cover disconnects, including the guest case that @grondo caught.

Nice!

If you want to bump coverage a little more you could add a test for the "abort due to double entry" case. However, not necessary in my opinion.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2019

Good idea, i'm on it.

@codecov-io

This comment has been minimized.

Copy link

commented Jul 11, 2019

Codecov Report

Merging #2215 into master will increase coverage by 0.02%.
The diff coverage is 83.21%.

@@            Coverage Diff             @@
##           master    #2215      +/-   ##
==========================================
+ Coverage   80.71%   80.74%   +0.02%     
==========================================
  Files         202      202              
  Lines       32259    32295      +36     
==========================================
+ Hits        26039    26075      +36     
  Misses       6220     6220
Impacted Files Coverage Δ
src/common/libflux/barrier.c 81.08% <100%> (ø) ⬆️
src/broker/modservice.c 70.67% <100%> (ø) ⬆️
src/modules/barrier/barrier.c 80% <82.96%> (-0.54%) ⬇️
src/broker/module.c 75.17% <0%> (+0.23%) ⬆️
src/modules/connector-local/local.c 74.85% <0%> (+1.02%) ⬆️
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 11, 2019

Ok, maybe this is ready?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

Yes, I think so. Nice work!

@grondo grondo merged commit 3496084 into flux-framework:master Jul 11, 2019
4 checks passed
4 checks passed
Summary 1 potential rule
Details
codecov/patch 83.21% of diff hit (target 80.71%)
Details
codecov/project 80.74% (+0.02%) compared to fe713fe
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick garlick deleted the garlick:barrier_cleanup branch Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.