Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

HighAvailability

Rhett Garber edited this page · 1 revision
Clone this wiki locally

High Availability Tron

The tradeoff we make for having all our process management centralized is dealing with failures of that central management point.

Below is a description of several failures scenarios and how to protect against them.

Trond failure and state

Trond frequently writes it's internal state to disk. When it starts up, that state is loaded up so we can continue from where we left off. As long as the state file is current, tron will preserve it's knowledge of future scheduled jobs, and the history of past runs.

Trond failure while a process is running

If trond dies, from whatever cause (bug, machine failure, manually killing), any running process will be marked as UNKNOWN state.

We don't know whether the process succeeded or failed. If each and every run is important to you, having separate logging and monitoring of your processes is important.

Tron does run your process through SSH. So when tron dies, so does the SSH connection. Thankfully, sshd will continue to run your process even when the connection closes. So trond failure shouldn't kill any running processes.

Management node hard machine failure

If the management node dies hard, you have two potential issues:

  • Each tron event is not ACID. So we are not guaranteed that any changes to internal state (job completion, starting etc) have been written to disk yet.
  • If you lose the disks on which a tron management node runs, configuration, state and everything go with it.

These are hard problems. There are potential solutions, but they don't exist in tron yet.

Preserving your config is most likely the most important consideration. Thankfully, tronfig is very flexible. You can always store your configuration outside of tron under some revision control system and non-interactively apply the configuration on changes.

The best mitigation for job recovery is to ensure your jobs are recoverable manually. If you have a job that absolutely can't be run twice, it's wise to control that state outside of tron through some other highly available mechanism.

Your processes should have enough logging to recover what happened on the actual working nodes. Having your logs centralized in a highly available manner is also important.

Backups on the management node would be a good idea. Whether it's a nightly full machine backup or even just a rsync job for tron's working directory (run out of tron of course) saving off your state to another machine would likely be very helpful during recovery.

Something went wrong with that request. Please try again.