Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
libflux: reclaim leaked matchtags from late arriving responses #2153
If an RPC future is destroyed before it has received all its responses, the matchtag is leaked. This PR reclaims the matchtag if future is destroyed:
This code cannot reclaim matchtags from abandoned mrpcs (but they are few), nor will it reclaim matchtags from late arriving responses after the dispatcher is destroyed, which will happen when the last message handler is unregistered. (I'm trying to figure out how to close that last gap)
To make this work, I had to add a new message flag to distinguish streaming responses from non-streaming responses in the absence of any other state since the future's message handler was destroyed earlier. I think this might be useful later on if we need to hold state in the broker for communicating endpoints for recovery purposes.
While I was mucking around with message flags, I also cleaned up the accessors for nodeid in messages, which included limited flags access which seems a bit silly.
In a future PR, if I can get some issues sorted out with the design, the destructor for an RPC will issue a generic cancellation request if it is destroyed with outstanding requests, and then this code would mop up the leaked matchtag once the final response is received. At least that is my plan.
This still needs some tests, but feedback welcome!
Add unit test coverage for the dispatch code for recovering lost matchtags.
This was not strictly true. If another message handler is registered after the dispatcher is destroyed, the dispatcher is recreated and lost matchtags from before are handled properly.
(The dispatcher is destroyed on last use so that it can unregister its "handle watcher" and allow the reactor to terminate naturally when the last watcher is stopped/destroyed.)
With the changes proposed thus far, there is an increased likelihood that someone might encounter an improperly recycled matchtag due to the increased aggressiveness reclaiming them.
To counter that, I've added checks to the streaming RPC services in kvs-watch and job-info that fail a request with EPROTO if it doesn't include the FLUX_MSGFLAG_STREAMING flag in the request. An immediate error return ensures that if the response is "orphaned", the matchtag reclaim logic will do the right thing even without the STREAMING flag set in the response.
To make checking the flag a bit easier I added the
In our discussion face to face earlier today, it was suggested that we might want to add a flag that would allow message handlers to be registered with built-in error handling for missing FLUX_MSGFLAG_STREAMING. It's a good idea, but I'm not sure it's going to make the problem that much easier, and as there were only two live cases of services needing this and one required a conditional to be tested to determine whether it was streaming or not, I elected to go with the simpler approach for now. (this does not apply to the "streaming" responses in the job manager interfaces, as those do not use matchtags)
I added a bit of manpage commentary also.
Hmm got a valgrind timeout in travis:
I'm guessing that's got nothing to do with this PR...
Problem: flux_msg_get_nodeid() and flux_msg_set_nodeid() accept both nodeid and flags, but there are now separate accessors for flags. The flags were added to the nodeid accessors to support FLUX_MSGFLAG_UPSTREAM before the general flags accessors were added. Drop the flags argument from the nodeid accessors. Update users and tests to make use of the flags accessors where they were formerly accessed via the nodeid accessor.
Problem: when looking at an orphaned response message, one cannot decide whether it is safe to retire the matcthag without knowing whether it is a "regular" RPC (any response terminates) or "streaming" RPC (error response terminates). Set FLUX_MSGFLAG_STREAMING in the request/response message if it is from a streaming RPC. Add convenient accessors for FLUX_MSGFLAG_STREAMING: flux_msg_is_streaming() flux_msg_set_streaming() Add check for bad params in flux_msg_get_flags() flux_msg_set_flags(). Add unit test coverage.
Problem: we need a way to test if a matchtag belongs to an mrpc or a regular rpc to decide whether an orphaned one can be retired, but there is no way to ask what type it is because details are hidden in the tagpool "class". Add tagpool_group(), wrapped with flux_matchtag_group(). Add unit tests.
Problem: if an RPC future is destroyed before its last response is received, the RPC code leaks the matchtag. If the response is later received, no attempt is made to reclaim the matchtag. Look at unclaimed response messages before dropping them and see if it contains a live matchtag. If we have enough info in the response to determine if it is the last one, free the matchtag.
Problem: There are two (unlikely) cases not handled in the current code where a matchtag is leaked unnecessarily. 1) If rpc is destroyed before request is sent due to failure between matchtag allocation and flux_send(); 2) If the future fulfillment backlog contains a terminating response that has not yet been consumed. Handle these cases.
Add unit test coverage for the dispatcher reclaiming lost matchtags from orphaned responses from both normal and streaming RPCs. In additon: - ensure that an orphaned response to a streaming RPC does not trigger reclaim unless it is an error response - ensure that an oprphaned response to an mrpc does not trigger reclaim, since that is currently impossible to do safely.
Problem: if a request is sent to a "streaming" RPC service without the FLUX_MSGFLAG_STREAMING flag, an orphaned, non-error response might prematurely trigger matchtag reclaimation. Services that send multiple responses should check for the flag early (before sending non-error responses), and respond with an EPROTO error response if the check fails. An error response terminates both streaming and non-streaming RPCs, so the fact that the STREAMING flag is not set in the EPROTO response causes no confusion. Add this check to the following services: - rpc unit test (rpctest.multi) - job-info.eventlog-watch - kvs-watch.lookup (with FLUX_KVS_WATCH only)
@@ Coverage Diff @@ ## master #2153 +/- ## ========================================== + Coverage 80.43% 80.47% +0.03% ========================================== Files 200 200 Lines 31763 31817 +54 ========================================== + Hits 25549 25605 +56 + Misses 6214 6212 -2