Skip to content

[Bug] ConcurrentModificationException on TransactionState.publishVersionTasks under parallel publish version #63169

@Larborator

Description

@Larborator

Search before asking

  • I had searched in the issues and found no similar issues.

Version

4.0/4.1 (any version with enable_parallel_publish_version enabled; default is true).

What's Wrong?

TransactionState.publishVersionTasks is declared as a plain HashMap:

private Map<Long, List<PublishVersionTask>> publishVersionTasks;

this.publishVersionTasks = Maps.newHashMap();

Before 4.0, publish ran entirely on the single-threaded MasterDaemon, so a non-thread-safe map was fine. Starting in 4.0, PublishVersionDaemon runs publish in parallel through a per-db pool (dbExecutors, sized by Config.publish_thread_pool_num, default 128, each pool has corePoolSize = 1). When parallel publish is enabled, tryFinishOneTxn hands
tryFinishTxnSync off to a worker while the master loop keeps going. The same map is then touched concurrently by:

  • Master daemon thread — iterates via forEach / reads keySet() in PublishVersionDaemon.tryFinishOneTxn.
  • PUBLISH_VERSION_EXEC worker for that txn's db — routed by dbId % publish_thread_pool_num to a single-thread pool, iterates values().forEach in PublishVersionDaemon.tryFinishTxnSync, and calls clear() in TransactionState.pruneAfterVisible after the txn becomes VISIBLE.

(addPublishVersionTask is also called by the master daemon, but it runs once per txn during the initial dispatch in traverseReadyTxnAndDispatchPublishVersionTask, guarded by hasSendTask, strictly before any worker iteration — so it does not participate in this race.)

The race: the master daemon iterates one txn's map while the worker (from the previous round) runs pruneAfterVisible() -> clear() on the same map. The HashMap fail-fast iterator detects the modCount change and throws ConcurrentModificationException.

The CME is caught at an outer layer so FE does not crash, but that publish round aborts and the txn stays in COMMITTED until a later daemon round re-publishes it successfully. Recurring CMEs increase publish latency; combined with other factors (e.g. table-lock contention or executor saturation) this can further evolve into larger publish backlog — but those
secondary effects are out of scope for this issue.

Sample stack from a production FE:

java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
at java.util.HashMap$EntryIterator.next(HashMap.java:1630)
at org.apache.doris.transaction.PublishVersionDaemon.tryFinishOneTxn(PublishVersionDaemon.java:191)

What You Expected?

Publish should not abort the daemon round due to CME. publishVersionTasks must be safe for concurrent access by the master daemon and the per-db publish worker.

How to Reproduce?

The race depends on timing and is most easily observed when:

  • enable_parallel_publish_version = true (4.0 default);
  • heavy publish workload (many concurrent stream loads across many dbs/tables);
  • several publish-ready txns per daemon round, so the master is iterating one txn's map at the same moment a worker is inside pruneAfterVisible() on the same txn.

On a busy cluster under steady load, CME typically appears in fe.warn.log within hours.

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions