need heartbeat RFC #129

garlick · 2018-06-20T15:02:25Z

As noted in #128, we should add a RFC for the instance heartbeat, and consider expanding the API to include time since last heartbeat, heartbeat period, and heartbeat start time.

Since FLUIDs may make use of a synchronized clock, we should consider how one can be derived from the heartbeat, and explore its properties and constraints.

grondo · 2018-06-20T16:15:59Z

Very good idea!

garlick · 2018-06-21T14:17:50Z

Just thinking through the clock synchronization idea:

Assuming the heartbeat period is fixed and well known, the tick value (let's not call it epoch to avoid confusion with UNIX epoch) is a very low resolution, synchronized, monotonic clock.

To get a high resolution clock value that is synchronized, each rank should record a CLOCK_MONOTONIC timestamp for the last event received. The high res clock value is just

(tick * period) + (now - timestamp)

where now is the current CLOCK_MONOTONIC value.

If tick, period, timestamp were made available via RPC, a module or command running on the same rank can obtain an accurate high res clock value unaffected by the RPC round-trip time (but affected by system call latency) if it calls CLOCK_MONOTONIC locally rather than asking for it in the RPC.

To avoid awkward situations when a late-joining broker first starts up and no heartbeat event has been received yet, the current heartbeat state should be obtained as part of the "hello" bootstrap protocol.

In cases where the RPC latency is undesireable, one could establish a "heartbeat follower" in a module or command that subscribes to the heartbeat event and allows the clock or high res time info to be obtained without an RPC (at least after the first heartbeat event is received).

Finally, all this could be wrapped in an API that can be used with a local heartbeat follower or a remote one.

garlick · 2018-06-21T14:33:34Z

Couple of other random thoughts:

RFC should cover security implications of spoofing a heartbeat event. (Follower should check that sender is instance owner)
Nested instances should sync heartbeats, if instance heartbeat period is a multiple of the enclosing instance's period, to reduce system noise.
An instance with EPGM event distribution will have superior synchronization compared to TBON distribution.

garlick · 2018-06-21T19:53:52Z

Of course, if the clock Is "corrected" at each heartbeat, one has to keep the result of the last query in order to maintain montonicity. For example if a heartbeat is "late", (tick + old_period) + (now - old_timestamp) might be greater than (tick + new_period) + (now - new_timetstamp).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need heartbeat RFC #129

need heartbeat RFC #129

garlick commented Jun 20, 2018

grondo commented Jun 20, 2018

garlick commented Jun 21, 2018 •

edited

garlick commented Jun 21, 2018

garlick commented Jun 21, 2018

need heartbeat RFC #129

need heartbeat RFC #129

Comments

garlick commented Jun 20, 2018

grondo commented Jun 20, 2018

garlick commented Jun 21, 2018 • edited

garlick commented Jun 21, 2018

garlick commented Jun 21, 2018

garlick commented Jun 21, 2018 •

edited