Skip to content

[Bug] TOCTOU race condition in MQClientInstance.brokerVersionTable access #10214

@daguimu

Description

@daguimu

Before Creating the Bug Report

  • I found a bug, not just asking a question, which should be created in GitHub Discussions.

  • I have searched the GitHub Issues and GitHub Discussions of this repository and believe that this is not a duplicate.

  • I have confirmed that this bug belongs to the current repository, not other repositories of RocketMQ.

Runtime platform environment

All platforms

RocketMQ version

develop branch (latest)

JDK Version

JDK 8+

Describe the Bug

In MQClientInstance.java, the brokerVersionTable (a ConcurrentHashMap) is accessed using a non-atomic check-then-act pattern in multiple locations:

Location 1 - sendHeartbeatToBroker() (line ~657):

if (!this.brokerVersionTable.containsKey(brokerName)) {
    this.brokerVersionTable.put(brokerName, new ConcurrentHashMap<>(4));
}
this.brokerVersionTable.get(brokerName).put(addr, version);

Location 2 - sendHeartbeatToAllBrokerV2() (line ~737):
Same pattern as Location 1.

Location 3 - findBrokerVersion() (line ~1304):

if (this.brokerVersionTable.containsKey(brokerName)) {
    if (this.brokerVersionTable.get(brokerName).containsKey(brokerAddr)) {
        return this.brokerVersionTable.get(brokerName).get(brokerAddr);
    }
}

Between containsKey() and get(), another thread could remove the entry, causing a NullPointerException when the result of get() is dereferenced.

Notably, a similar operation in the same class at line ~814 already uses the correct pattern:

ConcurrentHashMap<String, Integer> inner = MQClientInstance.this.brokerVersionTable
    .computeIfAbsent(brokerName, k -> new ConcurrentHashMap<>(4));
inner.put(brokerAddr, version);

Steps to Reproduce

The race condition occurs under concurrent heartbeat sending and broker disconnection. When a broker is removed from the version table concurrently with a heartbeat send or version lookup, a NullPointerException can occur.

What Did You Expect to See?

Thread-safe access to brokerVersionTable using atomic operations like computeIfAbsent and local variable caching.

What Did You See Instead?

Non-atomic check-then-act pattern that can result in NullPointerException under concurrent access.

Additional Context

Related to previously reported #8743 (closed by stale bot without fix).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions