Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
modules/barrier: cleanup and add guest support #2215
Peeled off of job shell work (and not required by it):
This PR removes exit-on-ENOMEM behavior in the barrier module, and adds support for guests using the service (such as from the job shell). Other minor cleanup and an explanatory block comment at the top of the module source thrown in also.
Users can make overlapping use of names without contention (internally the barrier state is now hashed by "userid:barrier_name", not just "barrier_name")
There is also one broker patch for an unchecked return code that seemed important, however unlikely.
Problem: barrier module contains gratuitous typedefs for structures. These internal typedefs add nothing so drop them.
Problem: Barrier context creation exits on ENOMEM error. Also, it is stored in the aux hash, but never accessed outside of mod_main(). Make the context creation and destruction more explicitly the job of mod_main(), and handle the ENOMEM error path.
Problem: barrier module sets a timer for each barrier so that it can print something if it still exists after one second. This adds complexity and doesn't seem all that useful. Drop debug code.
Problem: there is a single timer for the barrier module, started when a barrier is entered (if not yet armed), cleared when all counts have been sent upstream. The purpose is to batch up multiple barrier counts to send upstream at once. Sharing the timer between multiple barriers is confusing and causes unpredictable latencies. Create a per-barrier timer instead. Now when multiple barriers are active, their timings are independent.
Problem: modservice_register() is the first use of a module's "implicit" reactor, thus triggers its creation. If that fails, it will not be detected. Check for flux_get_reactor () == NULL.
If zhash_insert() fails in barrier_add_client(), duplicate message is leaked. Free the message on exit.
Problem: barrier creation exits on out of memory error. Restructure barrier.enter handler so that an ENOMEM error is returned to the user. Also, move the hash insertion out of the barrier creation function for clarity. This is the last use of oom(), xzmalloc(), and xstrdup(), so drop those includes.
Problem: internal and external barrier.enter are combined in one message handler, but one must be available to users and the other restricted to instance owner. Make that two message handlers, and drop the "internal" flag from the request.
Problem: only the instance owner can access the barrier service. Allow FLUX_ROLE_USER to access the barrier.enter and barrier.disconnect methods. Use the "owner:name" as the key into the global barrier hash, so different users can independently use the same barrier names.
Ensure that the barrier.exit event is generated when a client that has entered the barrier disconnects prematurely. Repeat the test as a guest user to ensure the disconnect credential check allows the owner-generated disconnect message (from connector-local module) to abort a barrier owned by the guest user.
@@ Coverage Diff @@ ## master #2215 +/- ## ========================================== + Coverage 80.71% 80.74% +0.02% ========================================== Files 202 202 Lines 32259 32295 +36 ========================================== + Hits 26039 26075 +36 Misses 6220 6220