Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need heartbeat RFC #129

Open
garlick opened this issue Jun 20, 2018 · 4 comments
Open

need heartbeat RFC #129

garlick opened this issue Jun 20, 2018 · 4 comments

Comments

@garlick
Copy link
Member

garlick commented Jun 20, 2018

As noted in #128, we should add a RFC for the instance heartbeat, and consider expanding the API to include time since last heartbeat, heartbeat period, and heartbeat start time.

Since FLUIDs may make use of a synchronized clock, we should consider how one can be derived from the heartbeat, and explore its properties and constraints.

@grondo
Copy link
Contributor

grondo commented Jun 20, 2018

Very good idea!

@garlick
Copy link
Member Author

garlick commented Jun 21, 2018

Just thinking through the clock synchronization idea:

Assuming the heartbeat period is fixed and well known, the tick value (let's not call it epoch to avoid confusion with UNIX epoch) is a very low resolution, synchronized, monotonic clock.

To get a high resolution clock value that is synchronized, each rank should record a CLOCK_MONOTONIC timestamp for the last event received. The high res clock value is just

(tick * period) + (now - timestamp)

where now is the current CLOCK_MONOTONIC value.

If tick, period, timestamp were made available via RPC, a module or command running on the same rank can obtain an accurate high res clock value unaffected by the RPC round-trip time (but affected by system call latency) if it calls CLOCK_MONOTONIC locally rather than asking for it in the RPC.

To avoid awkward situations when a late-joining broker first starts up and no heartbeat event has been received yet, the current heartbeat state should be obtained as part of the "hello" bootstrap protocol.

In cases where the RPC latency is undesireable, one could establish a "heartbeat follower" in a module or command that subscribes to the heartbeat event and allows the clock or high res time info to be obtained without an RPC (at least after the first heartbeat event is received).

Finally, all this could be wrapped in an API that can be used with a local heartbeat follower or a remote one.

@garlick
Copy link
Member Author

garlick commented Jun 21, 2018

Couple of other random thoughts:

  • RFC should cover security implications of spoofing a heartbeat event. (Follower should check that sender is instance owner)

  • Nested instances should sync heartbeats, if instance heartbeat period is a multiple of the enclosing instance's period, to reduce system noise.

  • An instance with EPGM event distribution will have superior synchronization compared to TBON distribution.

@garlick
Copy link
Member Author

garlick commented Jun 21, 2018

Of course, if the clock Is "corrected" at each heartbeat, one has to keep the result of the last query in order to maintain montonicity. For example if a heartbeat is "late", (tick + old_period) + (now - old_timestamp) might be greater than (tick + new_period) + (now - new_timetstamp).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants