The tradeoff we make for having all our process management centralized is dealing with failures of that central management point.
Below is a description of several failures scenarios and how to protect against them.
Trond frequently writes it's internal state to disk. When it starts up, that state is loaded up so we can continue from where we left off. As long as the state file is current, tron will preserve it's knowledge of future scheduled jobs, and the history of past runs.
If trond dies, from whatever cause (bug, machine failure, manually killing), any running process will be marked as
We don't know whether the process succeeded or failed. If each and every run is important to you, having separate logging and monitoring of your processes is important.
Tron does run your process through SSH. So when tron dies, so does the SSH connection. Thankfully,
sshd will continue
to run your process even when the connection closes. So trond failure shouldn't kill any running processes.
If the management node dies hard, you have two potential issues:
These are hard problems. There are potential solutions, but they don't exist in tron yet.
Preserving your config is most likely the most important consideration. Thankfully, tronfig is very flexible. You can always store your configuration outside of tron under some revision control system and non-interactively apply the configuration on changes.
The best mitigation for job recovery is to ensure your jobs are recoverable manually. If you have a job that absolutely can't be run twice, it's wise to control that state outside of tron through some other highly available mechanism.
Your processes should have enough logging to recover what happened on the actual working nodes. Having your logs centralized in a highly available manner is also important.
Backups on the management node would be a good idea. Whether it's a nightly full machine backup or even just a rsync job for tron's working directory (run out of tron of course) saving off your state to another machine would likely be very helpful during recovery.