Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modules/kvs: Fix cut & paste error #969

Merged
merged 1 commit into from Feb 3, 2017

Conversation

Projects
None yet
5 participants
@chu11
Copy link
Contributor

chu11 commented Feb 3, 2017

Initialize check watcher with flux_check_watcher_create() instead
of flux_prepare_watcher_create().

modules/kvs: Fix cut & paste error
Initialize check watcher with flux_check_watcher_create() instead
of flux_prepare_watcher_create().
@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Feb 3, 2017

Saw this while looking at #813 , seems like a cut & paste error?

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Feb 3, 2017

I take it that means the KVS's check watcher was being executed at prepare time. Ugh, I guess that would mean every time the KVS wakes up to do anything, it enters the commit handler at prepare time. Then when there was commit work to do, it also enters the commit handler at check time.

Potentially a very good catch, if I understand!

@coveralls

This comment has been minimized.

Copy link

coveralls commented Feb 3, 2017

Coverage Status

Coverage increased (+0.01%) to 76.175% when pulling 5e5ee99 on chu11:kvswatchertypo into 2163e13 on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Feb 3, 2017

Codecov Report

Merging #969 into master will increase coverage by 0.01%.

@@            Coverage Diff            @@
##           master    #969      +/-   ##
=========================================
+ Coverage   75.89%   75.9%   +0.01%     
=========================================
  Files         152     152              
  Lines       25937   25937              
=========================================
+ Hits        19684   19687       +3     
+ Misses       6253    6250       -3
Impacted Files Coverage Δ
src/modules/kvs/kvs.c 80.09% <100%> (+0.12%)
src/common/libflux/message.c 83.57% <ø> (ø)
src/common/libflux/handle.c 85.67% <ø> (+0.26%)
src/common/libutil/veb.c 98.86% <ø> (+0.56%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2163e13...5e5ee99. Read the comment docs.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Feb 3, 2017

Tests are passing so I say this one goes in! Thanks @chu11 for finding a nasty one.

@garlick garlick merged commit c44b98b into flux-framework:master Feb 3, 2017

4 checks passed

codecov/patch 100% of diff hit (target 75.89%)
Details
codecov/project 75.9% (+0.01%) compared to 2163e13
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.01%) to 76.175%
Details
@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Feb 3, 2017

Ugh, I guess that would mean every time the KVS wakes up to do anything, it enters the commit handler at prepare time.

Wouldn't it be that it enters the commit handler after the KVS wakes up to do anything? i.e. after the reactor has processed recently incoming KVS messages and starts the next event-loop cycle, at which the prepare callback is called?

Then when there was commit work to do, it also enters the commit handler at check time.

Not sure what you mean by this, sort of conflicts with the prior sentence?

Actually, I'm a little confused now looking at the code again. From my understanding, in the primary event loop (via flux_reactor_run/ev_run) prepare callbacks are invoked before waiting for events. Check callbacks are invoked after waiting for events, but before event callbacks are invoked (assuming all equal priority, which I think that's what the reactor code is doing).

Since the event callbacks would process an actual kvs commit message, the check callback wouldn't accomplish anything until the next time through the event loop. So perhaps we want this to be a prepare watcher so it gets executed a little bit earlier? Renaming the functions would help to clear this up (I saw the "check" variables names and assumed it was supposed to be a check watcher).

Hmm, if the primary commit handler is a "prepare" watcher, I suppose we don't need both prepare watchers.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Feb 3, 2017

Oops, my response was sitting here when I went to lunch, then I came back and clicked comment. Didn't know you already merged :-) I should say that I think it can go either way. What I said above may only be a minor possible optimization.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Feb 3, 2017

I may have misspoken and/or been unclear above. We do need both a prepare and a check watcher.

The way it is supposed to work is

  1. prepare watcher (executing at "end" of event loop, just before blocking) checks if a commit is ready and if so starts idle watcher
  2. if idle watcher is enabled, event loop immediately unblocks and starts processing events again
  3. check watcher (executing after unblocking) processes the commit and stops idle watcher

This prepare/check/idle pattern is described in the libev manual.

I think what you are proposing is to do the commit handling in one prepare watcher, which is effectively what we had going on with "check" initialized as a prepare watcher, so it clearly works. It seems like it doesn't really matter whether the commit is handled before or after the unblocking but I think I'm forgetting something important (pertaining to keeping the event loop responsive to other events) and don't have time to delve in right now.

Maybe an experiment could shed some light?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Feb 3, 2017

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Feb 3, 2017

Related, since the code was restructured in this way to make that happen. I think the question is just whether it makes any difference to finalize the commit in a prepare versus a check watcher. It would be a little simpler to do it the way @chu11 is proposing and since (due to error) that's how it was working anyway when throughput was increased, maybe it's OK?

In fact now that it's been "fixed" I wonder if that's affected performance at all? That might be an experiment to try with the soak test or something.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Feb 3, 2017

@garlick I think we're on the same page but perhaps looking at it differently. Here's the loop logic from libev's documentation about how ev_run() works. From my perspective I viewed the prepare function called at the beginning of the loop and not the end.

So a kvs message comes in and is noticed at [A] and the callback handler for it is queued at [B]. The check watcher is called at [C] before the kvs message has been processed. Because of that, we need to go through one more loop iteration for the check watcher to complete the processing.

As an experiment I commented out all the other watchers and set the check watcher to be a prepare one and flux did pass the unit tests. So it does seem to work in principle.

   - Increment loop depth.
   - Reset the ev_break status.
   - Before the first iteration, call any pending watchers.
   LOOP:
   - If EVFLAG_FORKCHECK was used, check for a fork.
   - If a fork was detected (by any means), queue and call all fork watchers.
   - Queue and call all prepare watchers.
   - If ev_break was called, goto FINISH.
   - If we have been forked, detach and recreate the kernel state
     as to not disturb the other process.
   - Update the kernel state with all outstanding changes.
   - Update the "event loop time" (ev_now ()).
   - Calculate for how long to sleep or block, if at all
     (active idle watchers, EVRUN_NOWAIT or not having
     any active watchers at all will result in not sleeping). [D]
   - Sleep if the I/O and timer collect interval say so.
   - Increment loop iteration counter.
   - Block the process, waiting for any events. [A]
   - Queue all outstanding I/O (fd) events. [B]
   - Update the "event loop time" (ev_now ()), and do time jump adjustments.
   - Queue all expired timers.
   - Queue all expired periodics.
   - Queue all idle watchers with priority higher than that of pending events.
   - Queue all check watchers.
   - Call all queued watchers in reverse order (i.e. check watchers first). [C]
     Signals and child watchers are implemented as I/O watchers, and will
     be handled here by queueing them when their watcher gets executed. 
   - If ev_break has been called, or EVRUN_ONCE or EVRUN_NOWAIT
     were used, or there are no active watchers, goto FINISH, otherwise
     continue with step LOOP.
   FINISH:
   - Reset the ev_break status iff it was EVBREAK_ONE.
   - Decrement the loop depth.
   - Return.
@garlick

This comment has been minimized.

Copy link
Member

garlick commented Feb 4, 2017

Uh oh, that's not what I thought was happening. It seems like if prepare comes first then there would be a tendency to get blocked in the event loop before the commit happens... Since kvs subscribes to the heartbeat and is otherwise generally busy, maybe this wouldn't be noticed? Still, that's quite bad.

Almost seems like the test that was in prepare ought to be in the code called from message handlers (e.g. start idle process) and then check should finalize the commit as was originally designed?

I need some time to stare at the code before I say that's right and I'm about to head out, but it does seem like you may have uncovered a problem.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Feb 4, 2017

@garlick On the next loop iteration the prepare callback will set the idle watcher because it notices that a kvs commit message came in earlier (on the ready list), so we're guaranteed not to block (per point [D] which I've added above).

So I think everything works per design, it's just a matter of whether the errored way I fixed happened to be faster/better. I'll start a new issue we can track this from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.