High availability is based on a quorum model and the low-level replication of write cache blocks across a pipeline of services. A highly available service exposes an RMI interface using Apache River and establishes watchers (that reflect) and actors (that influence) the distributed quorum state in Apache zookeeper. Sockets are used for efficient transfer of write cache blocks along the write pipeline. The services publish themselves through zookeeper. Services register with the quorum for a given logical service. A majority of services must form a consensus around the last commit point on the database. One of those services is elected as the leader and the others are elected as followers (collectively, these are referred to as the joined services – the services that are joined with the met quorum). Once a quorum meets, the leader services write requests while reads may be served by the leader or any of the followers.
Write replication occurs at the level of 1MB cache blocks. Each cache blocks typically contain many records, as well as indicating records that have been released. Writes are coalesced in the cache on the leader, leading to a very significant reduction in disk and network IO. Followers receive and relay write cache blocks and also lay them down on the local backing store. In addition, both the leaders and the followers write the cache blocks onto a HALog file. The write pipeline is flushed before each commit to ensure that all services are synchronized at each commit point. A 2-phase commit protocol is used. If a majority of the joined services votes for a commit, then the root blocks are applied. Otherwise the write set is discarded. This provides an ACID guarantee for the highly available replication cluster.
HALog files play an important role in the HA architecture. Each HALog file contains the entire write set for a commit point, together with the opening and closing of root blocks for that commit point. HALog files provide the basis for both incremental backup, online resynchronization of services after a temporary disconnect, and online disaster recovery of a service from the other services in a quorum. HALog files are retained until the later of (a) their capture by an online backup mechanism, and (b) a fully met quorum.
Online resynchronization is achieved by replaying the HALog files from the leader for the missing commit points. The service will go through a local commit point for each HALog file it replays. Once it catches up it will join the already met quorum. If any HALog files are unavailable or corrupt, then an online rebuild replicates the leader’s committed state and then enters the resynchronization protocol. These processes are automatic.
Online backup uses the same mechanisms. Incremental backups request any new HALog files, and write them into a locally accessible directory. Full backups request a copy of the leader’s backing store. The replication cluster remains online during backups. Restore is an offline process. The desired full backup and any subsequent HALog files are copied into the data directory of the service. When the service starts, it will apply all HALog files for commit points more recent than the last commit point on the Journal. Once the HALog files have been replayed, the service will seek a consensus (if no quorum is met) or attempt to resynchronize and join an already met quorum.