Skip to content

[Doc] Clarify recommended topology for same-city dual-site RocketMQ deployments #10427

@liumingjian

Description

@liumingjian

Search before creation

  • I had searched in the issues and found no similar issues.

Documentation Related

I would like to request clearer official documentation for RocketMQ deployments across two same-city availability zones or data centers.

This is not intended as environment-specific consulting. The goal is to make the official documentation more explicit about the supported production topology, minimum node count, failure recovery boundaries, and active-active limitations for same-city dual-site deployments.

The scenario is:

  • two availability zones or data centers in the same city;
  • both sites are expected to serve production traffic;
  • producers and consumers may connect to either site;
  • the deployment should tolerate single-node failures and, where possible, one-site failures;
  • the architecture should avoid split-brain, ambiguous failover behavior, or message availability assumptions that are not officially supported.

It would be very helpful if the documentation could clarify the recommended approach for this scenario, especially:

  1. Whether RocketMQ recommends a single logical cluster stretched across the two sites, or separate RocketMQ clusters with replication/application-level routing.
  2. The minimum production-ready node count for NameServer, Broker, Controller, and/or DLedger-based deployments in this scenario.
  3. Whether a third failure domain, arbitration node, or witness-like deployment is required for quorum and split-brain avoidance.
  4. How Broker master/slave replicas, Controller nodes, DLedger groups, and NameServer nodes should be distributed across the two sites.
  5. Which failure scenarios can recover automatically, for example single Broker failure, Controller leader failure, one-site failure, NameServer failure, cross-site network partition, or loss of an arbitration node.
  6. Whether active-active writes to the same logical topic from both sites are supported, discouraged, or intentionally out of scope.
  7. If active-active writes are not recommended, what the official alternative is, such as active-passive disaster recovery, dual clusters with application-level routing, or another documented pattern.
  8. Whether there are special limitations for ordered messages, transactional messages, delayed messages, consumer offset consistency, and message duplication during failover.

A reference architecture or decision matrix in the documentation would be valuable. For example, it could compare:

  • a single RocketMQ cluster deployed across two same-city sites;
  • a two-site deployment plus a third quorum or arbitration failure domain;
  • two independent RocketMQ clusters with replication or application-level routing;
  • active-passive disaster recovery;
  • patterns that are not recommended, such as unsupported active-active writes to the same logical topic.

This clarification would help production users avoid incorrect assumptions about quorum, failover, data consistency, and message availability when designing same-city dual-site RocketMQ architectures.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions