Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase KVS commit window #780

Merged
merged 11 commits into from Aug 23, 2016

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Aug 23, 2016

This PR increases the KVS commit aggregation window from 1ms to 50ms and switches the timeout logic to use flux_reactor_now() instead of clock_gettime().

There's still a problem with the lua kvs test failing on my desktop, and @grondo offered to have a look, so I'm posting this PR earlier than I normally would.

The failure is the kvs watcher test:

PASS: lua/t1002-kvs.t 103 - kvswatcher: 2nd callback has correct result
FAIL: lua/t1002-kvs.t 104 - reactor exited normally with Timed out after 1.5s!
PASS: lua/t1002-kvs.t 105 - Can remove kvswatcher without error

it's odd because the KVS watcher worked as expected, but the 1.5s timer fires also before the reactor exits, causing the error.

libflux/reactor: add flux_reactor_now()
Add interface to libev's ev_now() function.

@garlick garlick added the review label Aug 23, 2016

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Aug 23, 2016

I'm able to reproduce the lua timer failure mentioned above on hype, so I'm taking a look now.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Aug 23, 2016

Coverage Status

Coverage increased (+0.04%) to 75.139% when pulling 67fbace on garlick:kvs_commit_window into 1df7925 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Aug 23, 2016

Ok, this is something to be aware of -- libev bases timer offsets on the current loop's ev_now timestamp, not the actual current time (the ev_time()). In the case of the lua kvs tests, the same reactor is used multiple times in the test, and at the time the timer is registered, the reactor has not been active for a bit, therefore ev_now has not been recently updated.

To test, I changed the libcompat version of the timer to create the timer with

    double after = (1E-3 * msec) + (ev_time () - flux_reactor_now (r));

and the test now passes. I'm guessing a longer kvs_commit window caused the test script as a whole to run just enough longer to cause the timer to fire before that specific test has a change to complete. (?)

Actually, I suggest that since you've added flux_reactor_now (r), you also add flux_reactor_now_update() wrapper for ev_now_update() so the ev_loop time can be updated to current time if needed.

In this case I'm not sure what the correct fix is. Probably to destroy the original reactor in the test script and create a new one just before registering the timer.

I can propose something if you like

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Aug 23, 2016

This allows t1002-kvs.lua test to pass.

In the two places timers are used we force creation of a new handle for the test, to get a most recent timestamp for ev_loop as possible.

diff --git a/t/lua/t1002-kvs.t b/t/lua/t1002-kvs.t
index 3de306e..bbca047 100755
--- a/t/lua/t1002-kvs.t
+++ b/t/lua/t1002-kvs.t
@@ -146,6 +146,13 @@ end
 is (n, #keys, "keys() iterator returned correct number of keys")

 -- KVS watcher creation
+
+--
+-- Force creation of  new handle and reactor here so that
+--  reactor time is guarateed to be updated, and our timeout
+--  used below is relative to now and not last active reactor time.
+--
+local f = check_userdata_return ("flux.new", flux.new())
 local data = { key = "testkey", value = "testvalue" }

 local count = 0
@@ -187,6 +194,13 @@ ok (kw:remove(), "Can remove kvswatcher without error")


 --
+-- Again, force creation of  new handle and reactor here so that
+--  reactor time is guarateed to be updated, and our timeout
+--  used below is relative to now and not last active reactor time.
+--
+local f = check_userdata_return ("flux.new", flux.new())
+
+--
 -- Again, but this time ensure callback is not called more than the
 --  expected number of times.
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Aug 23, 2016

I do wonder, though, if this is confusing enough behavior for timers that we should always base flux timer watchers off current ev_time() as above?

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Aug 23, 2016

Excellent find, thank you for the help!

I do wonder, though, if this is confusing enough behavior for timers that we should always base flux timer watchers off current ev_time() as above?

Hmm, that's a tricky one. In general adopting libev's semantics has been good policy in the past, and I also worry that introducing subtle deviations may throw people off that are consulting the libev docs to see how flux reactor works in finer detail than we're presenting in man pages.

What about just adding a warning to flux_timer_watcher_create (3)?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Aug 23, 2016

What about just adding a warning to flux_timer_watcher_create (3)?

Yeah, that is fine with me. I wasn't arguing one way or the other but just wondering aloud. I think you make a good point to follow libev semantics. Perhaps reference the libev(3) manpage, specifically the section "The special problem of time updates"

garlick added some commits Aug 23, 2016

test/reactor: add tests for early timers
Ensure that reactor timers fire at or after the scheduled timeout.

Use the flux_reactor_now() and flux_reactor_now_update() interfaces.

Clean up some test output and internal variables.
doc/flux_timer_watcher_create(3): add warning
Warn that timer "after" values are referenced to "reactor time"
and cite the libev manual and flux_reactor_now(3).

@garlick garlick force-pushed the garlick:kvs_commit_window branch from 67fbace to 483bad8 Aug 23, 2016

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Aug 23, 2016

Current coverage is 74.80% (diff: 90.47%)

Merging #780 into master will increase coverage by 0.02%

@@             master       #780   diff @@
==========================================
  Files           145        145          
  Lines         25022      25032    +10   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          18712      18726    +14   
+ Misses         6310       6306     -4   
  Partials          0          0          
Diff Coverage File Path
•••••• 66% src/common/libflux/reactor.c
•••••••••• 100% src/broker/hello.c
•••••••••• 100% src/broker/heartbeat.c
•••••••••• 100% src/modules/kvs/kvs.c

Powered by Codecov. Last update 1df7925...8514b36

garlick added some commits Aug 22, 2016

test/wreck: ls test allows out of order listing
flux-wreck ls sometimes returns results out of numerical
jobid order.  Sort the output numerically before checking
that the test job appears last in the list.

Note: this was failing intermittently when the KVS commit
interval was increased, which may have affected timing of
KVS state updates for job run in the previous test.
test/lua/kvs: ensure accuracy of timer offsets
Recreate reactor before tests that create timer watchers,
so that the generated offsets (which are relative to ev_now)
are not already old by the time the reactor starts again.

Thanks grondo -- see #780.
modules/kvs: increase commit aggregation window + cleanup
Change the commit aggregation window from 1ms to 50ms,
to make a better compromise between commit latency and
scalability.  Commits arriving within the window will
be aggregated and applied as one commit.

Change the logic so that commits that don't generate an event
still update the commit timestamp.

Fix a case where the "coalesced N commits" LOG_DEBUG message
was not generated even though N > 1.

Use flux_reactor_now() rather than monotime() to measure time
since last commit.  This is updated once per event loop iteration
so should be lighter weight compared to monotime's use of the
clock_gettime() system call.

Rename the ctx->commit_time to ctx->commit_timestamp for clarity.
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Aug 23, 2016

Just forced an update with the following changes

  • @grondo's fix to lua test (thanks!)
  • add flux_reactor_now_update()
  • add flux_reactor_now(3) and flux_reactor_now_update(3) man page
  • update flux_timer_watcher_create(3) man page with "reactor time" gotcha
  • slightly improved reactor timer testing, and use (if not exactly coverage) of new functions

By the way I didn't explain in the description that the point of increasing the KVS commit aggregation window is to reduce the load when many commits arrive at rank 0 simultaneously. If they arrive within the window, they are aggregated into one commit, which only requires one event and root update, and reduces the amount of intermediate state written to the content store. The flip side is that some commits will gain a small amount (up to 50ms) of latency.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Aug 23, 2016

Coverage Status

Coverage decreased (-0.02%) to 75.083% when pulling 483bad8 on garlick:kvs_commit_window into 1df7925 on flux-framework:master.

garlick added some commits Aug 23, 2016

broker/hello: ensure accuracy of timer values
Call flux_reactor_now_update() before creating timer watcher
that will report updates to the nodeset as the session wires up.
If we don't do this, there may be updates reported prematurely.

Also, use flux_reactor_now() to calculate elapsed time rather
than clock_gettime(2).
broker/heartbeat: ensure accuracy of timer
Call flux_reactor_now_update() prior to creating the timer
that will generate heartbeats.  If we don't do this the first
heartbeat may be generated prematurely, although doing so
shouldn't have any adverse effects.
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Aug 23, 2016

I just added a couple of other minor fixes where timers were used incorrectly.

The "hello" update timer which prints messages during wireup had me puzzled the other day. When testing at scale I noticed that it seem to fire too early but couldn't figure out why.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Aug 23, 2016

Coverage Status

Coverage decreased (-0.003%) to 75.097% when pulling 18ac9f1 on garlick:kvs_commit_window into 1df7925 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Aug 23, 2016

For completeness, should we also add a wrapper for ev_time ()? e.g.

double flux_reactor_time (void);

This would be required, for instance, to take the libflux-equivalent approach suggested in the libev(3) manpage to avoid updating the loop now time, for example:

 flux_watcher_t *w = flux_timer_watcher_create (r, timeout + (flux_reactor_time () - flux_reactor_now (r), 0., callback, NULL);
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Aug 23, 2016

Sure, I'll add that.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Aug 23, 2016

Coverage Status

Coverage increased (+0.03%) to 75.126% when pulling 8514b36 on garlick:kvs_commit_window into 1df7925 on flux-framework:master.

@grondo grondo merged commit 752bcbf into flux-framework:master Aug 23, 2016

4 checks passed

codecov/patch 90.47% of diff hit (target 74.78%)
Details
codecov/project 74.80% (+0.02%) compared to 1df7925
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.03%) to 75.126%
Details

@grondo grondo removed the review label Aug 23, 2016

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Aug 23, 2016

Thanks! Merged!

@garlick garlick referenced this pull request Oct 26, 2016

Closed

0.5.0 release notes #879

@garlick garlick deleted the garlick:kvs_commit_window branch Oct 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.