[Bug] ConcurrentModificationException on TransactionState.publishVersionTasks under parallel publish version

### Search before asking

- [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues.


### Version

4.0/4.1 (any version with enable_parallel_publish_version enabled; default is true).

### What's Wrong?

  `TransactionState.publishVersionTasks` is declared as a plain `HashMap`:

  ```java
  private Map<Long, List<PublishVersionTask>> publishVersionTasks;
  ```
  this.publishVersionTasks = Maps.newHashMap();

  Before 4.0, publish ran entirely on the single-threaded MasterDaemon, so a non-thread-safe map was fine. Starting in 4.0, PublishVersionDaemon runs publish in parallel through a per-db pool (dbExecutors, sized by Config.publish_thread_pool_num, default 128, each pool has corePoolSize = 1). When parallel publish is enabled, tryFinishOneTxn hands
  tryFinishTxnSync off to a worker while the master loop keeps going. The same map is then touched concurrently by:

  - Master daemon thread — iterates via forEach / reads keySet() in PublishVersionDaemon.tryFinishOneTxn.
  - PUBLISH_VERSION_EXEC worker for that txn's db — routed by dbId % publish_thread_pool_num to a single-thread pool, iterates values().forEach in PublishVersionDaemon.tryFinishTxnSync, and calls clear() in TransactionState.pruneAfterVisible after the txn becomes VISIBLE.

  (addPublishVersionTask is also called by the master daemon, but it runs once per txn during the initial dispatch in traverseReadyTxnAndDispatchPublishVersionTask, guarded by hasSendTask, strictly before any worker iteration — so it does not participate in this race.)

  The race: the master daemon iterates one txn's map while the worker (from the previous round) runs pruneAfterVisible() -> clear() on the same map. The HashMap fail-fast iterator detects the modCount change and throws ConcurrentModificationException.

  The CME is caught at an outer layer so FE does not crash, but that publish round aborts and the txn stays in COMMITTED until a later daemon round re-publishes it successfully. Recurring CMEs increase publish latency; combined with other factors (e.g. table-lock contention or executor saturation) this can further evolve into larger publish backlog — but those
   secondary effects are out of scope for this issue.

  Sample stack from a production FE:

  java.util.ConcurrentModificationException
      at java.util.HashMap$HashIterator.nextNode(HashMap.java:1597)
      at java.util.HashMap$EntryIterator.next(HashMap.java:1630)
      at org.apache.doris.transaction.PublishVersionDaemon.tryFinishOneTxn(PublishVersionDaemon.java:191)

### What You Expected?

Publish should not abort the daemon round due to CME. publishVersionTasks must be safe for concurrent access by the master daemon and the per-db publish worker.

### How to Reproduce?

  The race depends on timing and is most easily observed when:

  - enable_parallel_publish_version = true (4.0 default);
  - heavy publish workload (many concurrent stream loads across many dbs/tables);
  - several publish-ready txns per daemon round, so the master is iterating one txn's map at the same moment a worker is inside pruneAfterVisible() on the same txn.

  On a busy cluster under steady load, CME typically appears in fe.warn.log within hours.

### Anything Else?

_No response_

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] ConcurrentModificationException on TransactionState.publishVersionTasks under parallel publish version #63169

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] ConcurrentModificationException on TransactionState.publishVersionTasks under parallel publish version #63169

Description

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions