Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve efficiency of asynchronous futures #1840

Merged
merged 6 commits into from Nov 21, 2018

Conversation

Projects
None yet
4 participants
@grondo
Copy link
Contributor

grondo commented Nov 16, 2018

As described in #1839, this PR improves efficiency of asynchronous use of flux_future_t by eliminating the prepare watcher and only starting the check and idle watchers at the time of fulfillment instead of immediately when flux_future_then(3) is called. This reduces the number of active watchers significantly when there are many unfulfilled futures associated with the reactor loop.

This PR should be carefully examined and tested to ensure I haven't missed some subtle use case that is not covered in our testsuite. During development, I did find one case that was missed by the unit tests and luckily caught by another test in make check. I'll see if I can figure out what that particular use case was, and codify it in the future_t unit tests.

@grondo grondo force-pushed the grondo:future-efficiency branch from 5e17ad8 to 1159c30 Nov 16, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Nov 16, 2018

Codecov Report

Merging #1840 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1840      +/-   ##
==========================================
+ Coverage   79.91%   79.91%   +<.01%     
==========================================
  Files         196      196              
  Lines       35267    35263       -4     
==========================================
- Hits        28185    28182       -3     
+ Misses       7082     7081       -1
Impacted Files Coverage Δ
src/common/libflux/future.c 87.29% <100%> (-0.17%) ⬇️
src/common/libflux/response.c 79.62% <0%> (-1.24%) ⬇️
src/common/libflux/message.c 81.51% <0%> (-0.13%) ⬇️
src/broker/module.c 83.83% <0%> (+0.27%) ⬆️
src/common/libflux/mrpc.c 87.89% <0%> (+1.17%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Nov 16, 2018

Ok, I've pushed some updates to libflux/test/future.c that I think exercise the case I hit in development of this PR. The main case, iirc, was a multiple result future where the first result is obtained synchronously. In one version of this PR the subsequent async continuation was never called because watchers were not started (I don't remember the exact reason why, sorry). This case was luckily exercised by t/kvs/commit_order.c.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Nov 17, 2018

Here's a little test that indicates this PR has a positive impact on scaling of concurrent RPCs, versus current master (results are wall clock, based on one sample, run on my single-core Ubuntu VM, no flux-security):

$ time flux job submitbench --fanout=FANOUT --repeat=4096 basic.yaml
fanout master 8c23603 (sec) future-efficiency (sec)
256 14.156 13.101
512 14.844 12.116
1024 15.781 12.772
2048 16.235 11.336
4096 16.813 10.101

(each run was in a fresh instance, so KVS content was not cumulative)

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Nov 17, 2018

My vote is to put this in. It might be good to get one more set of eyes on it though first - @chu11?

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Nov 17, 2018

Thanks for taking an extra careful look @garlick, @chu11!

@grondo grondo force-pushed the grondo:future-efficiency branch from a63fe70 to 873cc7f Nov 19, 2018

@garlick garlick requested a review from chu11 Nov 20, 2018

@grondo grondo force-pushed the grondo:future-efficiency branch from 873cc7f to d457dc9 Nov 20, 2018

@chu11

This comment has been minimized.

Copy link
Contributor

chu11 commented Nov 20, 2018

took a look and everything LGTM

@chu11

This comment has been minimized.

Copy link
Contributor

chu11 commented Nov 20, 2018

restarted a builder that hit

  python/t0009-security.py:  PASS: N=2   PASS=2   FAIL=0 SKIP=0 XPASS=0 XFAIL=0
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

@grondo if you're happy with it i can hit the button

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Nov 20, 2018

History might look cleaner if #1850 goes in first, so we don't have a future sandwich between two kvs improvements. ;-)

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Nov 20, 2018

Mmm, sandwich. One builder hit this valgrind error. I'll go ahead and restart it.

==1624== HEAP SUMMARY:
==1624==     in use at exit: 6,346,975 bytes in 182 blocks
==1624==   total heap usage: 952,580 allocs, 952,398 frees, 223,263,756 bytes allocated
==1624== 
==1624== 1,048,593 bytes in 1 blocks are possibly lost in loss record 99 of 102
==1624==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==1624==    by 0x4E74898: cbuf_create (cbuf.c:233)
==1624==    by 0x4E639FC: flux_buffer_create (buffer.c:80)
==1624==    by 0x4E6F0AC: remote_channel_setup (remote.c:354)
==1624==    by 0x4E6F69B: remote_setup_stdio (remote.c:443)
==1624==    by 0x4E6F69B: subprocess_remote_setup (remote.c:493)
==1624==    by 0x4E72C1A: flux_rexec (subprocess.c:677)
==1624==    by 0xB7CD6CB: spawn_exec_handler (job.c:694)
==1624==    by 0xB7CD6CB: runevent_continuation (job.c:757)
==1624==    by 0x4E88E12: ev_invoke_pending (ev.c:3314)
==1624==    by 0x4E8C3D8: ev_run (ev.c:3717)
==1624==    by 0x4E589E2: flux_reactor_run (reactor.c:140)
==1624==    by 0xB7CE10F: mod_main (job.c:938)
==1624==    by 0x1144EB: module_thread (module.c:157)
==1624==    by 0x55BC6DA: start_thread (pthread_create.c:463)
==1624==    by 0x636388E: clone (clone.S:95)
@chu11

This comment has been minimized.

Copy link
Contributor

chu11 commented Nov 20, 2018

@garlick hmmm, appears to be new. Don't know if it's a new variant of #1641

grondo added some commits Nov 15, 2018

libflux: don't run prep/check for unready futures
Problem: futures run in asynchronous mode have their prepare and
check watchers started immediately when `flux_future_then(3)`
is called. This means that the `prepare_cb` and `check_cb` are
run for every unfulfilled future on every reactor loop iteration.
In a process with many futures (e.g. thousands of outstanding
RPCs) this can result in a large slowdown.

Instead of starting the prepare and check watchers at the time
`flux_future_then` is called, start the watchers only after the
future has been fulfilled (with result or fatal error) by
calling `then_context_start` from `post_fulfill`

Fixes #1839
libflux: abstract ready test for futures
Problem: several places in libflux/future.c test if a future
is ready or not ready by checking both f->result_valid *and*
f->fatal_errnum_valid. This requirement could too easily lead to a
future maintainer (hah) forgetting one of these checks, so abstract
this simple test into a convenience function and use it throughout
the code.

This change also cleans up `flux_future_is_ready()` to use the new
function. Though the function handily used `flux_future_wait_for (f, 0.)`
to test for readiness, in the end that amounted to the same check
implemented in the new `future_is_ready`, and use of that function
is more clear.
libflux: eliminate prepare watcher for futures
The flux_future_t prepare watcher callback is currently used only
to start the idle watcher. Eliminate the middle man and start
the idle watcher directly in `then_context_start`.
test: libflux: test fatal errors on futures in async mode
Add unit tests to ensure fatal errors in flux_future_t are
handled in asynchronous mode (then context) both before and after
a synchronous get of the error.
test: libflux: free reactor in future unit tests
Clean up leaked flux_reactor_t in libflux/test/future.c: test_simple().
test: libflux: cover queued result futures in async mode
Ensure a case where a multiple-result future is use first
synchronously then asynchronously is covered in the unit tests.

@grondo grondo force-pushed the grondo:future-efficiency branch from d457dc9 to 21cf90a Nov 20, 2018

@chu11

chu11 approved these changes Nov 20, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Nov 21, 2018

Hit another "no output received" timeout after python/t0009-security.py and restarted

@chu11

This comment has been minimized.

Copy link
Contributor

chu11 commented Nov 21, 2018

man, another hang, restared

@chu11

This comment has been minimized.

Copy link
Contributor

chu11 commented Nov 21, 2018

finally it all passed!

@chu11 chu11 merged commit bbe885e into flux-framework:master Nov 21, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@grondo grondo deleted the grondo:future-efficiency branch Feb 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.