Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High availability (HA) deployment with scale-out query execution #2235

Open
nblumhardt opened this issue Jul 17, 2024 · 0 comments
Open

High availability (HA) deployment with scale-out query execution #2235

nblumhardt opened this issue Jul 17, 2024 · 0 comments

Comments

@nblumhardt
Copy link
Member

The next major Seq version will introduce multi-node high-availability deployment, including the ability to use all available cluster resources when executing queries. This builds on the currently-available disaster recovery features (#1102) by removing the need for manual fail-over, supporting more than two cluster nodes, and improving query throughput as more machines are added.

Development is currently in-progress and the final shape of this feature subject to change. Keeping that in mind :-) I've included some information below on our design goals and rationale. Feedback is welcome!

HA

The goal of HA is to keep Seq available in the face of one or more node failures. In practice, we're aiming to support clusters of two to seven nodes, using a separate highly-available database such as PostgreSQL or SQL Server to mediate leader election for purposes such as ingestion processing, alerting, and running output apps.

Scale-out

The goal of scale-out is to make running a Seq cluster cost-effective: nodes belonging to the cluster should be able to participate in query execution so that the available query throughput is proportional to the cluster size rather than the size of a single node.

In the initial implementation we don't expect to scale out ingest, alerting, or app execution, which will place some limits on possible cluster sizes/stored data volume.

DR

Clustering will provide a similar improvement in the durability of stored log data to what is provided by DR today. We don't plan to default to or encourage synchronous replication, because we don't expect the increase in ingestion latency/reduction in capacity will be a desirable trade-off for the majority of customers.

As with DR today, this will mean that there is a short time window between ingestion of an event batch, and durability in the face of a leader failure. We plan to make this delay readily monitorable and proactively alert in cases where the replication delay exceeds reasonable tolerances.

Continued from #861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant