Every running system needs to be managed.
On a services level, this is the job of
systemd which starts application services in the right order, keeps track of logs and restarts services if they crash.
On top of this default service management, an application is needed that follows custom logic for the many intricacies of the various application components.
The BitBox Base Supervisor
bbbsupervisor is custom-built to monitor application logs and other system metrics, watches for very specific application messages, knows how to interpret them and can take the required action.
The Base Supervisor combines many small monitoring tasks. Contrary to the Middleware, its task is not about relaying application communication but to keep the running system in an operational state without user interaction.
See the full documentation at https://base.shiftcrypto.ch for handled events.
The application is written in Go, compiled within Docker when using the top
make command and the resulting binary is copied to the Armbian image during build.
The Base Supervisor is started and managed using a simple systemd unit file:
[Unit] Description=BitBox Base Supervisor After=local-fs.target [Service] Type=simple ExecStart=/usr/local/sbin/bbbsupervisor Restart=always RestartSec=10 [Install] WantedBy=multi-user.target
Currently, two so-called watchers are implemented. A watcher watches specific resource and triggers events. For some events, actions are defined that are being taken. These two watchers are implemented right now:
logWatcherwatching systemd logs for a specific service (e.g.
prometheusWatcherwatching a specific measurement exposed via the Prometheus API
For each systemd service, a
logWatcher is started in its own goroutine. It starts to
--follow the systemd log of that unit via
stdout output is written to an
eventWriter which parses a line (sometimes also multiple lines, known issue) into an event by performing string matching on it. Each
watcherEvent gets assigned a trigger. The event is passed into a event channel called
stderr output is written to an
errWritter which passes all line(s) read into an error channel called
For each Prometheus value to watch a
prometheusWatcher is started in its own goroutine. The
prometheusWatcher queries a specific
expression. It passes a watcherEvent into the
events channel with the
measure and the measured
value. The watcher then sleeps and queries again after waking back up. The query
interval can be set for each
Events are indefinitely read from the channels (
events) in the
eventLoop() function. First errors from the
errs channel are read (if existent) and a panic is thrown (currently not recovered yet). Then
events is read and the triggers are handled in the respective handle functions. Then the event handling loop restarts.
Currently handled event triggers (business logic)
There are currently three triggers handled:
triggerPrometheusBitcoindIDB. I propose to document and propose triggers (including action and rationale) in a table.
|trigger||fired when||action performed||rationale|
||Electrs log reports
||Restart electrs.||Free memory after initiall full sync|
||Electrs log reports
||restart electrs||lost connection to
||Middleware log reports
||Restarts Middleware||lost connection to
||read Prometheus measure
||initial trigger or value has changed: run `bbbconfig.sh set bitcoin_idb <true||false>`; not changed: nothing|
For some triggers, a (previous) state is needed. For example
triggerPrometheusBitcoindIDB needs the previous measurement to detect a change from idb to no-idb. For logWatcher triggers, a flood control is implemented. I.e a trigger is only handled again after a definable
minDelay to prevent multiple handling actions being executed at roughly the same time.
Adding a new trigger
To add a new trigger this procedure can be followed:
- Add the trigger to the constants and add the name to the
- When adding a new trigger for a measurement on Prometheus a new
prometheusWatcherhas to be set up in
- When adding a new trigger for an existing
logWatchera new string matcher has to be added in parseEvent().
- Add a switch case for the new trigger in
eventLoop()and add a handling function.
- Handling functions shouldn't hardcode multiple (more than two) commands. Consider writing a shell script that is being run in the handling function.
Next steps for the supervisor could be (in no particular order):
- If needed create a watcher that checks if a service is not running
- Properly split incoming stdout lines at a
bbbsupervisor.gogrows refactor it into multiple files
- Implement proper logging
- Write unit tests for e.g.
minDelayfor the flood control, query intervals, ... from a config file (maybe a JSON file as in the other Shift projects)
- Implement proper error handling and panic recovery (bbbsupervisor should not crash on an error)
- Handle system signals stopping the execution (e.g. SIGINT, SIGQUIT, SIGTERM)
prometheusWatcher.query()to query for strings, ints ... (currently only