Skip to content

Service Stability

Jeremy Ho edited this page Aug 21, 2020 · 2 revisions

Service Stability

As software projects grow in maturity, questions along the lines of how reliable and available a service is become more critical. There are many facets to this question, but they all come down to a set of core principles around implementing services designed with High Availability in mind (wikipedia).

Core Concepts

In order for a service to be considered highly available, the application needs to be able to ensure availability and accessibility to a certain degree of operational performance. This can be achieved many ways in a containerized environment. For some concrete examples such as an HA Postgres DB, refer to High Availability.

The core concepts of making a service highly available are as follows:

  • Fault Tolerance - Applications should be able to reasonably catch unforseen issues and gracefully fail as to not produce erroneous results.
  • Fail Fast - Services must be able to detect when they have gone into an invalid application state, and quickly terminate itself to allow the container orchestrator to restore the service.
  • Multiple Instances - A deployment config or stateful set should have at least 2 or more replicas so that in the event one of the instances fails, the remaining instances can still service requests.

Common Service Stability Approaches

Our common services are designed with the above concepts at its core. Below we outline the steps and engineering we have done in order to adhere to the core concepts of high availability.

The Common Hosted Email Service logically consists of the main node application, a Postgres database, and Redis. In order to make CHES fault tolerant, we have designed our email queueing system to run at a best-attempt level. Queue status can be determined through the appropriate API call, so clients can immediately tell whether their email has been processed and/or dispatched correctly. These transactions have been designed to be as atomic and "stateless" as possible in order to ensure that unique messages do not interfere with each other.

At this time, both the main node application and Postgres database are implemented in high availability mode on Openshift, where there are at least 2 concurrently running pods at any given time. Postgres leverages Patroni in order to internally manage master-slave database history and timeline states. However, Redis has not yet been upgraded to be highly available yet as of August 20, 2020. It is currently running as a single instance on Openshift. It is on our upcoming roadmap to make Redis highly available soon.

Just because our components are highly available does not mean that the connections between them are reliable. There are instances where a component can intermittently drop a connection or be in the middle of switching master instances. While these situations are infrequent, they can be the source for strange edge cases in application behavior. In order to mitigate potential hiccups in service stability, the CHES node application has been designed to fail fast.

In order to achieve fail fast behavior above the standard Kubernetes/Openshift health check and readiness check behavior, we have the node application also periodically polling its connections to both Postgres and Redis. Specifically, we are checking every ten seconds, the same rate as the container platform. To check the Redis connection, we ensure that our existing Redis network connection is still valid and is flagged as ready. To check our email SMTP connection, we ping the SMTP server to verify if it is reachable.

To check the Database connection, we ensure that the existing Postgres connection is valid, and that the instance we are connected to is the intended master instance. This is important because Patroni will automatically promote and demote discrete Postgres instances to be the master instance to communicate with. All other instances of Postgres will be flagged as read-only and will only listen to the timeline specified by the master through its internal WAL history propagation.

If the application detects that the email server is unreachable, or if the redis connection is no longer ready, or if the database connection now routes to a read-only database, or is outright unavailable, the probe will tell the application to gracefully shutdown because the CHES application will no longer be able to ensure error-free behavior. We then defer to the container orchestrator to reinitialize a new instance of the CHES application and allow the service to self-heal.

The Common Document Generation Service logically consists of just a main node application. In order to make CDOGS fault tolerant, all document generation transaction requests are designed to be atomic in nature. At this time, CDOGS is implemented in high availability mode on Openshift, with at least 2 concurrently running pods at any given time. Both pods share a shared PVC cache directory in order to handle file caching. As multiple servers are using the same shared directory, we have a semaphore lock system in place to prevent file race conditions.

Clone this wiki locally