Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand broker internal documentation to cover bootstrap phase #3618

Merged
merged 3 commits into from Apr 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
159 changes: 2 additions & 157 deletions src/broker/doc/README.md
@@ -1,159 +1,4 @@
## Broker Design notes

### Bootstrap

Each broker must determine its place in the Flux instance: rank,
size, and URI of TBON parent. It may determine this by reading
a static set of config files, or using the Process Management Interface (PMI).

### PMI

When Flux is launched by Flux, by another resource manager, or by
`flux start [--testsize=N] ...`, PMI provides the broker rank and
size straight away, and the PMI KVS is used to share broker URIs via
global exchange.

Each broker:
1) binds to TCP connection, claiming a random port number,
2) writes the URI to the PMI KVS using its rank as the key,
3) executes PMI barrier,
4) calculates the rank of its TBON parent from rank and branching factor,
5) reads the parent URI from PMI KVS using the parent rank as the key.

#### Config File

When Flux is launched by systemd, a TOML array of host entries is consulted.
The identical configuration is assumed to be replicated across the cluster.

Each broker:
1) locates its own entry by matching its hostname,
2) determines size and rank from array size and entry index,
3) binds to the URI specified in its entry,
4) calculates the rank of its TBON parent from rank and branching factor,
5) reads the parent URI from the array entry at parent rank index

### State Machine

After bootstrap, each broker comprising a Flux instance begins executing an
identical state machine. Although there is synchronization between brokers
in some states, there is no distributed agreement on a global state for the
instance.

_Events_ drive the state machine.

Entry to a new state triggers an _action_.

Actions may differ across broker ranks. For example, entering CLEANUP state
on rank 0 launches a process and an event is generated upon process termination,
while on rank > 0, entering CLEANUP does not launch a process, and immediately
generates an event.

![broker state machine picture](states.png)

#### States

**abbrev** | **state** | **action when transitioning into state**
:-- | :-- | :--
J | JOIN | wait for parent to enter QUORUM state
1 | INIT | run rc1 script
B | QUORUM | wait for quorum of brokers to reach this point
2 | RUN | run initial program (rank 0)
C | CLEANUP | run cleanup (rank 0)
S | SHUTDOWN | wait for children to finalize and exit
3 | FINALIZE | run rc3 script
E | EXIT | exit broker

### Normal State Transitions

It may be helpful to walk through the state transitions that occur when
a Flux instance runs to completion without encountering exceptional conditions.

![broker state machine picture - normal](states_norm.png)

green = common path; blue = rank 0 deviation; red = leaf node deviation

#### startup

The broker ranks > 0 wait for the parent to enter QUORUM state (_parent-ready_)
then enters INIT state. Rank 0 immediately enters INIT (_parent-none_).
Upon enering INIT, the rc1 script is executed, then on completion, QUORUM
state is entered (_rc1-success_). Because each TBON tree level waits for the
upstream level to enter QUORUM state before entering INIT state, rc1 executes
in upstream-to-downstream order. This ensures upstream service leaders are
loaded before downstream followers.

Once a configured set of brokers have reached QUORUM state (default all),
RUN state is entered (_quorum-full_). Rank 0 then starts the initial program.

All ranks remain in RUN state until the initial program completes.

#### shutdown

When the initial program completes, rank 0 transitions to CLEANUP state
(_rc2-success_) and runs any cleanup script(s). Cleanups execute while the
other broker ranks remain in RUN state. Upon completion of cleanups, rank 0
enters SHUTDOWN state (_cleanup-success_).

The broker ranks > 0 monitor parent state transitions. The parent
transitioning to SHUTDOWN causes a transition from RUN to CLEANUP
(_shutdown-abort_). They transition through CLEANUP (_cleanup-none_)
to SHUTDOWN state.

All brokers with children remain in SHUTDOWN until their children disconnect
(_children-complete_). If they have no children (leaf node), they transition
out of SHUTDOWN immediately (_children-none_). The next state is FINALIZE,
where the rc3 script is executed. Upon completion of rc3 (_rc3-success_),
brokers transition to EXIT and disconnect from the parent.

Because each TBON tree level waits for the downstream level to disconnect
before entering FINALIZE state, rc3 executes in downstream-to-upstream order.
This ensures downstream service followers are unloaded before upstream leaders.

The rank 0 broker is the last to exit.

#### variation: no rc2 script (initial program)

A system instance does not define an initial program. In this case, the
rank 0 broker transitions through the RUN state like other ranks. Instead
of publishing the shutdown message when rc2 completes, rank 0 subscribes
to the shutdown message and remains in RUN until it transitions out with
_shutdown-abort_.

The instance runs until it is administratively shut down. The shutdown
message is published by the `flux admin shutdown` command.

#### variation: no rc1, rc3, or cleanup scripts

In test sometimes we eliminate the rc1, cleanup, and/or rc3 scripts to simplify
or speed up a test environment. In these cases, entry into INIT, CLEANUP,
and FINALIZE states generates a _rc1-none_, _cleanup-none_, or _rc3-none_ event,
which causes an immediate transition to the next state.

### Events

**event** | **description**
:-- | :--
_parent-ready_ | parent has entered BARRIER state
_parent-none_ | this broker has no parent
_parent-fail_ | parent has ended communication with this broker
_parent-timeout_ | parent has not responded within timeout period
_rc1-none_ | rc1 script is defined on this broker
_rc1-success_ | rc1 script completed successfully
_rc1-fail_ | rc1 script completed with errors
_quorum-full_ | configured quorum of brokers reached
_quorum-timeout_ | configured quorum not reached within timeout period
_rc2-none_ | no rc2 script (initial program) is defined on this broker
_rc2-success_ | rc2 script completed successfully
_rc2-fail_ | rc2 script completed with errors
_shutdown-abort_ | broker received shutdown event
_signal-abort_ | broker received terminating signal
_cleanup-none_ | no cleanup script is defined on this broker
_cleanup-success_ | cleanup script completed successfully
_cleanup-fail_ | cleanup script completed with errors
_children-complete_ | all children have disconnected from this broker
_children-none_ | this broker has no children
_children-timeout_ | children did not disconnected within timeout period
_rc3-none_ | no rc3 script is defined on this broker
_rc3-success_ | rc3 script completed successfully
_rc3-fail_ | rc3 script completed with errors

* [Broker Bootstrap Sequence](bootstrap.md)
* [Broker State Machine](state_machine.md)
111 changes: 111 additions & 0 deletions src/broker/doc/bootstrap.md
@@ -0,0 +1,111 @@
## Broker Bootstrap Sequence

Each broker executes a bootstrap sequence to
1. get the size of the Flux instance
2. get its rank within the Flux instance
3. get its level within the hierarchy of Flux instances
4. get the mapping of broker ranks to nodes
5. compute the broker ranks of its peers (TBON parent and children)
6. get URI(s) of peers
7. get public key(s) of peers to initialize secure communication

An instance bootstraps using one of two mechansims: PMI or Config File.

### PMI

When Flux is launched by Flux, by another resource manager, or as a
standalone test instance, PMI is used for bootstrap.

The broker PMI client uses a subset of PMI capabilities, abstracted for
different server implementations in `pmiutil.c`. The bootstrap sequence
itself is implemented in `boot_pmi.c`. The sequence roughly follows the
steps outlined above.

Steps 1-4 involve accessing parameters directly provided by the PMI server.

In step 5, the fixed tree topology (rank, size, k) is used to compute the
ranks of the TBON parent and TBON children, if any.

If a broker has TBON children, it binds to a 0MQ URI that the children will
connect to. If step 4 indicates that all brokers mapped to a single node,
the socket is bound to a local IPC path in the broker rundir. If brokers are
mapped to multiple nodes, the socket is bound to a TCP address on the interface
hosting the default route and a randomly assigned port. These URIs cannot
be predicted by peers, so they must be exchanged via PMI.

In addition, each broker generates a unique public, private CURVE keypair.
The public keys must be shared with peers to enable secure communication.

In step 6-7, each broker rank stores a "business card" containing its 0MQ URI
and public key to the PMI KVS under its rank. A PMI barrier is executed.
Finally, the business cards for any peers are loaded from the PMI KVS.
The bootstrap process is complete and overlay initialization may commence.

Debugging: set `FLUX_PMI_DEBUG=1` in the broker's environment for a trace of
the broker's client PMI calls on stderr.

#### Flux booting Flux as a job

When Flux launches Flux in the job environment, Flux is providing both the
PMI server via the Flux shell `pmi` plugin, and the PMI client in the broker.

The shell `pmi` plugin offers the PMI-1 wire protocol to the broker, by
passing an open file descriptor to each broker via the `PMI_FD` environment
variable. The broker client uses the PMI-1 wire protocol client code in
`src/common/libpmi/simple_client.c` to execute the bootstrap sequence.

Debugging: set the shell option `verbose=2` for a server side trace on stderr
from the shell `pmi` plugin.

#### Booting Flux as a standalone test instance

An instance of size N may be launched on a single node using
`flux-start --test-size=N`. In this case, the PMI server is embedded in
the start command, and the PMI-1 wire protocol is used as described above.

Debugging: use the `flux start --trace-pmi-server` option for a server side
trace on stderr from the start command.

#### Booting Flux as a job in a foreign resource manager

The code in `pmiutil.c` attempts to adapt the broker's PMI client to different
situations if the PMI-1 wire protocol is not available. Its logic is roughly:
1. if `PMI_FD` is set, use the PMI-1 wire protocol (for bootstrap under Flux)
2. if configured with `--enable-pmix-bootstrap`, and `PMIX_SERVER_URI`
or `PMIX_SERVER_URI2` is set, use PMIx
3. if `libpmi.so` can be dynamically loaded, use PMI-1 (a specific library
name/path may be forced by setting `PMI_LIBRARY`)
4. assume singleton (rank = 0, size = 1)

The following symbols must be defined in the PMI library for option 3 to be
successful:
* `PMI_Init()`, `PMI_Finalize()`
* `PMI_Get_size()`, `PMI_Get_rank()`
* `PMI_Barrier()`
* `PMI_KVS_Get_my_name()`,
* `PMI_KVS_Put()`, `PMI_KVS_Commit()`, `PMI_KVS_Get()`

### Config File

When Flux is launched by systemd, the brokers go through a similar bootstrap
process as under PMI, except that information is read from a set of identical
TOML configuration files replicated across the cluster. The TOML configuration
contains a host array.

In step 1-2, the broker scans the host array for an entry with a matching
hostname. The array index of the matching entry is the broker's rank,
and also contains its URI. The array size is the instance size.

Steps 3-4 are satisfied by assuming that in this mode, there is one broker
per node, and the instance is at the top (level 0) of the instance hierarchy.

In step 5, as above, the fixed tree topology (rank, size, k) is used to
compute the ranks of the TBON parent and TBON children, if any.

Instead of generating a unique CURVE keypair per broker, an instance
bootstrapped in this way shares a single CURVE keypair replicated across
the cluster.

So steps 6-7 are satisfied by simply accessing the CURVE key certificate
on disk, and looking up the peer rank indices in the hosts array.
The bootstrap process is complete and overlay initialization may commence.
125 changes: 125 additions & 0 deletions src/broker/doc/state_machine.md
@@ -0,0 +1,125 @@
## Broker State Machine

After bootstrap, each broker comprising a Flux instance begins executing an
identical state machine. Although there is synchronization between brokers
in some states, there is no distributed agreement on a global state for the
instance.

_Events_ drive the state machine.

Entry to a new state triggers an _action_.

Actions may differ across broker ranks. For example, entering CLEANUP state
on rank 0 launches a process and an event is generated upon process termination,
while on rank > 0, entering CLEANUP does not launch a process, and immediately
generates an event.

![broker state machine picture](states.png)

#### States

**abbrev** | **state** | **action when transitioning into state**
:-- | :-- | :--
J | JOIN | wait for parent to enter QUORUM state
1 | INIT | run rc1 script
B | QUORUM | wait for quorum of brokers to reach this point
2 | RUN | run initial program (rank 0)
C | CLEANUP | run cleanup (rank 0)
S | SHUTDOWN | wait for children to finalize and exit
3 | FINALIZE | run rc3 script
E | EXIT | exit broker

### Normal State Transitions

It may be helpful to walk through the state transitions that occur when
a Flux instance runs to completion without encountering exceptional conditions.

![broker state machine picture - normal](states_norm.png)

green = common path; blue = rank 0 deviation from common path; red = leaf node deviation from common path

#### startup

The broker ranks > 0 wait for the parent to enter QUORUM state (_parent-ready_)
then enters INIT state. Rank 0 immediately enters INIT (_parent-none_).
Upon enering INIT, the rc1 script is executed, then on completion, QUORUM
state is entered (_rc1-success_). Because each TBON tree level waits for the
upstream level to enter QUORUM state before entering INIT state, rc1 executes
in upstream-to-downstream order. This ensures upstream service leaders are
loaded before downstream followers.

Once a configured set of brokers have reached QUORUM state (default all),
RUN state is entered (_quorum-full_). Rank 0 then starts the initial program.

All ranks remain in RUN state until the initial program completes.

#### shutdown

When the initial program completes, rank 0 transitions to CLEANUP state
(_rc2-success_) and runs any cleanup script(s). Cleanups execute while the
other broker ranks remain in RUN state. Upon completion of cleanups, rank 0
enters SHUTDOWN state (_cleanup-success_).

The broker ranks > 0 monitor parent state transitions. The parent
transitioning to SHUTDOWN causes a transition from RUN to CLEANUP
(_shutdown-abort_). They transition through CLEANUP (_cleanup-none_)
to SHUTDOWN state.

All brokers with children remain in SHUTDOWN until their children disconnect
(_children-complete_). If they have no children (leaf node), they transition
out of SHUTDOWN immediately (_children-none_). The next state is FINALIZE,
where the rc3 script is executed. Upon completion of rc3 (_rc3-success_),
brokers transition to EXIT and disconnect from the parent.

Because each TBON tree level waits for the downstream level to disconnect
before entering FINALIZE state, rc3 executes in downstream-to-upstream order.
This ensures downstream service followers are unloaded before upstream leaders.

The rank 0 broker is the last to exit.

#### variation: no rc2 script (initial program)

A system instance does not define an initial program. In this case, the
rank 0 broker transitions through the RUN state like other ranks. Instead
of publishing the shutdown message when rc2 completes, rank 0 subscribes
to the shutdown message and remains in RUN until it transitions out with
_shutdown-abort_.

The instance runs until it is administratively shut down. The shutdown
message is published by the `flux admin shutdown` command.

#### variation: no rc1, rc3, or cleanup scripts

In test sometimes we eliminate the rc1, cleanup, and/or rc3 scripts to simplify
or speed up a test environment. In these cases, entry into INIT, CLEANUP,
and FINALIZE states generates a _rc1-none_, _cleanup-none_, or _rc3-none_ event,
which causes an immediate transition to the next state.

### Events

**event** | **description**
:-- | :--
_parent-ready_ | parent has entered BARRIER state
_parent-none_ | this broker has no parent
_parent-fail_ | parent has ended communication with this broker
_parent-timeout_ | parent has not responded within timeout period
_rc1-none_ | rc1 script is defined on this broker
_rc1-success_ | rc1 script completed successfully
_rc1-fail_ | rc1 script completed with errors
_quorum-full_ | configured quorum of brokers reached
_quorum-timeout_ | configured quorum not reached within timeout period
_rc2-none_ | no rc2 script (initial program) is defined on this broker
_rc2-success_ | rc2 script completed successfully
_rc2-fail_ | rc2 script completed with errors
_shutdown-abort_ | broker received shutdown event
_signal-abort_ | broker received terminating signal
_cleanup-none_ | no cleanup script is defined on this broker
_cleanup-success_ | cleanup script completed successfully
_cleanup-fail_ | cleanup script completed with errors
_children-complete_ | all children have disconnected from this broker
_children-none_ | this broker has no children
_children-timeout_ | children did not disconnected within timeout period
_rc3-none_ | no rc3 script is defined on this broker
_rc3-success_ | rc3 script completed successfully
_rc3-fail_ | rc3 script completed with errors