Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libschedutil: destroy pending futures on scheduler unload #2226

Merged
merged 6 commits into from Sep 14, 2019

Conversation

@SteVwonder
Copy link
Member

commented Jul 12, 2019

Refactor schedutil to track outstanding messages and automatically respond with ENOSYS when the schedutil_t is destroyed.

Having a count of the outstanding messages will also be useful for the ongoing simulator work.

Closes #2193

Still TODO:

  • Track the future associated with the respond and destroy that too
  • Comments
  • Squash the refactoring commit mess that I've made
  • Rebase onto master
    • Will fix python3.4 failure in Travis
  • Improve commit messages (will do on rebase)
  • switch to using flux module debug
@SteVwonder SteVwonder force-pushed the SteVwonder:schedutil-refactor branch from ea84fd7 to ed7bf6a Jul 12, 2019
Copy link
Contributor

left a comment

Just found a couple minor issues when testing on my system.

One other question: I probably missed it, but how is the scheduler "unpaused" after a pause request?

src/common/libschedutil/ops.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2019

With this patch the pr passes make check:

diff --git a/src/common/libschedutil/alloc.c b/src/common/libschedutil/alloc.c
index 7974a81d2..eec1200a3 100644
--- a/src/common/libschedutil/alloc.c
+++ b/src/common/libschedutil/alloc.c
@@ -125,7 +125,6 @@ static void alloc_continuation (flux_future_t *f, void *arg)
         flux_log_error (h, "alloc response");
         goto error;
     }
-    alloc_destroy (ctx);
     flux_future_destroy (f);
     return;
 error:
diff --git a/src/common/libschedutil/init.c b/src/common/libschedutil/init.c
index 1ddde84fe..8333df6bc 100644
--- a/src/common/libschedutil/init.c
+++ b/src/common/libschedutil/init.c
@@ -106,6 +106,8 @@ int schedutil_add_outstanding_msg (schedutil_t *s, const flux_msg_t *msg)
 
 int schedutil_remove_outstanding_msg (schedutil_t *s, const flux_msg_t *msg)
 {
-    zlistx_detach (s->outstanding_msgs, (void*)msg);
+    if (!zlistx_find (s->outstanding_msgs, (void *) msg))
+       return -1;
+    zlistx_detach_cur (s->outstanding_msgs);
     return 0;
 }
Copy link
Member

left a comment

Some comments inline.

I see you have a TODO for comments, but to be sure we're on the same page, expanded commit messages would be helpful.

src/common/libschedutil/init.h Show resolved Hide resolved
src/common/libschedutil/hello.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/ops.c Outdated Show resolved Hide resolved
src/common/libschedutil/ops.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/ops.c Outdated Show resolved Hide resolved
src/common/libschedutil/ops.c Outdated Show resolved Hide resolved
src/common/libschedutil/ops.c Outdated Show resolved Hide resolved
src/common/libschedutil/ops.c Outdated Show resolved Hide resolved
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

One other question: I probably missed it, but how is the scheduler "unpaused" after a pause request?

This interface was purely for testing, to force the outstanding messages so that the test is deterministic. So I wasn't planning on supporting an "unpause". Maybe "pause" is the wrong term. Maybe "test" or "hang" is better? If this is too heavy-handed, I'd be happy to remove it. The associated test will just have to be tweaked to accept any response as valid (ENOSYS or the actual response) since it'll be racy otherwise.

@garlick

This comment has been minimized.

Copy link
Member

commented Jul 21, 2019

I think it would be OK to squash the incremental development and ask for another review when you're feeling good about this one.

You can take or leave this suggestion as I'm not really sure has a compelling benefit, and it's a bit late(!): One of the changes here is to replace a flux_t argument in the schedutil API with a context. Could that be avoided and thus reduce the impact on sched if it remained flux_t and the context was stored in its aux hash?

@SteVwonder SteVwonder force-pushed the SteVwonder:schedutil-refactor branch 6 times, most recently from 2ee57ef to a460d97 Jul 25, 2019
@SteVwonder SteVwonder marked this pull request as ready for review Jul 31, 2019
@SteVwonder SteVwonder changed the title [WIP] schedutil: refactor to track outsanding messages schedutil: refactor to track outsanding messages Jul 31, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Jul 31, 2019

"outstanding" is misspelled in the title, but maybe the title should be slightly more descriptive like "libschedutil: destroy pending futures on scheduler unload"?

@SteVwonder SteVwonder changed the title schedutil: refactor to track outsanding messages schedutil: refactor to track outstanding messages Aug 1, 2019
@SteVwonder SteVwonder changed the title schedutil: refactor to track outstanding messages libschedutil: destroy pending futures on scheduler unload Aug 1, 2019
@SteVwonder SteVwonder force-pushed the SteVwonder:schedutil-refactor branch from a460d97 to b09266e Aug 1, 2019
Copy link
Member

left a comment

Looks like things went a little haywire in one function, unless i'm misreading it after a beer!
I'll have a closer look tomorrow.

src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
src/common/libschedutil/init.c Outdated Show resolved Hide resolved
Copy link
Member

left a comment

Looks like things went a little haywire in one function, unless i'm misreading it after a beer!
I'll have a closer look tomorrow.

Edit: not haywire, I misread! Sorry!

@SteVwonder SteVwonder force-pushed the SteVwonder:schedutil-refactor branch 2 times, most recently from 8af3ffc to 42c81c6 Aug 28, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Aug 28, 2019

Rebased and addressed @garlick most recent comments. Once Travis goes green, I think this is ready for another review.

SteVwonder added 5 commits Jul 10, 2019
replace hardcoded `#!/usr/bin/lua` with `#!/usr/bin/env lua` to avoid
using multiple Lua interpreters and encountering the following error:
`/usr/bin/lua: multiple Lua VMs detected`
add a schedutil_t struct, which will enable the tracking of outstanding
messages/futures, which, in turn, will enable automatic replies of
ENOSYS on unload and pave the way for simulators to determine when the
schedutil library is idle.

remove ops_context struct and move members into the schedutil_t struct.

refactor all of the public schedutil interfaces to take a schedutil_t
rather than a flux_t as their first argument.

refactor the schedutil library to automatically register services when
the schedutil_t ctx is created. That is, schedutil users no longer need
to call schedutil_ops_register after initializing the library.
track outstanding requests to the scheduler that are handled
asynchronously by schedutil (currently only alloc and free requests)
along with the futures that these request responses are waiting on.

respond to oustanding requests with ENOSYS and destroy the associated
futures automatically on schedutil destruction (e.g., on a sched unload)
ensure that outstandings requests receive an ENOSYS response when the
scheduler is unloaded.

add a debug bit to schedutil (0x8000).  When the bit is set, no
continuations are registered for alloc and free requests serviced by
schedutil, forcing outstanding messages to exist.
@SteVwonder SteVwonder force-pushed the SteVwonder:schedutil-refactor branch from 42c81c6 to 2d70331 Sep 14, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Sep 14, 2019

@garlick: I rebased onto master, squashed incremental work, used a higher-order bit for the new debug flag, addressed the >80 character lines, and changed the multi-line // comments to the /* style. Hopefully Travis goes green and then this is ready for your re-review.

@codecov-io

This comment has been minimized.

Copy link

commented Sep 14, 2019

Codecov Report

Merging #2226 into master will increase coverage by <.01%.
The diff coverage is 80.46%.

@@            Coverage Diff             @@
##           master    #2226      +/-   ##
==========================================
+ Coverage   80.84%   80.85%   +<.01%     
==========================================
  Files         218      216       -2     
  Lines       34540    34369     -171     
==========================================
- Hits        27925    27790     -135     
+ Misses       6615     6579      -36
Impacted Files Coverage Δ
src/common/libschedutil/ready.c 61.53% <100%> (ø) ⬆️
src/common/libschedutil/alloc.c 73.01% <100%> (+3.37%) ⬆️
src/common/libschedutil/free.c 100% <100%> (ø) ⬆️
src/common/libschedutil/ops.c 56.89% <64.44%> (-2.04%) ⬇️
src/common/libschedutil/hello.c 53.12% <71.42%> (-2.05%) ⬇️
src/common/libschedutil/init.c 86.95% <86.95%> (ø)
src/modules/sched-simple/sched.c 73.26% <90.9%> (ø) ⬆️
src/shell/pmi.c 75.69% <0%> (-1.43%) ⬇️
src/broker/boot_pmi.c 64.61% <0%> (-1.06%) ⬇️
src/modules/connector-local/local.c 73.41% <0%> (-0.87%) ⬇️
... and 31 more
Copy link
Member

left a comment

Thanks for all the changes! Looks good! I think this should go in now.

I think sched will need a PR for the API changes. I'll open an issue over there if there isn't one already.

@garlick garlick merged commit e7c28a2 into flux-framework:master Sep 14, 2019
2 checks passed
2 checks passed
Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member

commented Sep 14, 2019

I think sched will need a PR for the API changes. I'll open an issue over there if there isn't one already.

Oh duh, not yet since this library isn't currently exported.

@SteVwonder

This comment has been minimized.

Copy link
Member Author

commented Sep 15, 2019

Thanks @garlick!

@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.