Skip to content

fix(fe): clean DynamicPartitionScheduler.runtimeInfos on DROP TABLE#62884

Open
horus-leonardo wants to merge 1 commit intoapache:masterfrom
horus-leonardo:fix/dynamic-partition-scheduler-runtimeinfos-leak
Open

fix(fe): clean DynamicPartitionScheduler.runtimeInfos on DROP TABLE#62884
horus-leonardo wants to merge 1 commit intoapache:masterfrom
horus-leonardo:fix/dynamic-partition-scheduler-runtimeinfos-leak

Conversation

@horus-leonardo
Copy link
Copy Markdown

@horus-leonardo horus-leonardo commented Apr 27, 2026

What problem does this PR solve?

Issue Number: close #62883

Related PR: none

Problem Summary:

DynamicPartitionScheduler.runtimeInfos accumulates entries indefinitely. The map is keyed by tableId and gets a new entry every time the scheduler runs against a table with dynamic_partition.enable=true or partitionRetentionCount > 0.

removeRuntimeInfo(long tableId) is called in exactly one place: ShowDynamicPartitionCommand.doRun(), which only fires when a user issues SHOW DYNAMIC PARTITION and only for tables still present in the catalog that have lost their dynamic_partition property. No catalog mutation path calls it — DROP TABLE, DROP DATABASE, and tables that turn off dynamic_partition or zero out partitionRetentionCount all leave permanent entries. In automated ETL workloads where nobody runs SHOW, the map grows unbounded.

This patch wires removeRuntimeInfo() into the three canonical cleanup points:

  1. InternalCatalog.unprotectDropTable() — alongside db.unregisterTable().
  2. executeDynamicPartition() db == null branch — after iterator.remove().
  3. executeDynamicPartition() olapTable invalid/lost-properties branch — after iterator.remove().

Found via heap dump analysis after an FE OOM on 4.0.5-rc01 today (2026-04-27) in a high-DDL-churn ETL workload. The map had reached ~1.5M entries / 554 MB retained heap. We are rolling out a patched build to production now and will follow up on the issue thread with steady-state retention numbers after a week of uptime.

Full bug report and heap dump details in #62883.

Release note

Fix FE memory leak in DynamicPartitionScheduler.runtimeInfos for tables that are dropped, lose their dynamic_partition.enable property, or have partitionRetentionCount reset to 0.

Check List (For Author)

  • Test
    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:

Manual test: heap dump analysis on a 4.0.5-rc01 FE that OOMed under an ETL workload doing ~24K DDL/hour against dynamic_partition tables. The dump showed runtimeInfos holding ~1M–1.5M stale entries (2,097,152-bucket ConcurrentHashMap$Node[], 554 MB retained on DynamicPartitionScheduler, 17% of live heap post-GC walk). The patched build is being deployed today; I will report steady-state heap numbers in the issue thread after a week of production uptime.

A unit test reproducing the leak would need to drive the dynamic-partition scheduler against a synthetic catalog and assert runtimeInfos.size() after DROP. Happy to add one if maintainers prefer that over the production validation.

  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

DynamicPartitionScheduler.runtimeInfos accumulates entries indefinitely
when tables are dropped or lose their dynamic_partition properties.
removeRuntimeInfo(tableId) is called from ShowDynamicPartitionCommand
but only opportunistically: it requires a user to issue
SHOW DYNAMIC PARTITION and only catches tables still present in the
catalog that have lost their dynamic_partition property. No catalog
mutation path calls it.

Fix:
- Call removeRuntimeInfo() in InternalCatalog.unprotectDropTable() so
  the entry is cleared when a table is dropped.
- Call removeRuntimeInfo() in executeDynamicPartition() at the two
  cleanup points where the iterator removes a table from the scheduling
  set (db gone, olapTable null/MTMV/no-dynamic-partition).

In a high-DDL-churn workload (CREATE/DROP loops on tables with
dynamic_partition.enable=true or partitionRetentionCount > 0) this map
can grow unbounded and cause FE OOM after extended uptime.

Closes apache#62883

Signed-off-by: Leonardo Constanski <leonardo@horusbi.com.br>
@zclllyybb zclllyybb requested a review from Copilot April 29, 2026 06:51
@zclllyybb
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@zclllyybb zclllyybb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your meaningful fix! please add regression test to keep the behaviour

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses an FE memory leak where DynamicPartitionScheduler.runtimeInfos retains per-table runtime entries long after tables are no longer eligible for dynamic partition scheduling (notably after DROP TABLE / missing DB / invalid table cases), which can lead to unbounded heap growth in high-DDL-churn workloads.

Changes:

  • Remove runtimeInfos entries during DROP TABLE in InternalCatalog.unprotectDropTable().
  • Remove runtimeInfos entries when executeDynamicPartition() prunes tables due to missing DB or invalid/non-eligible tables.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/datasource/InternalCatalog.java Removes dynamic-partition runtime info when a table is dropped from the catalog.
fe/fe-core/src/main/java/org/apache/doris/clone/DynamicPartitionScheduler.java Removes runtime info when the scheduler evicts a table from its working set due to missing DB or invalid/non-eligible table state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1045 to +1047
db.unregisterTable(table.getId());
// Fix DynamicPartitionScheduler.runtimeInfos leak on DROP TABLE.
Env.getCurrentEnv().getDynamicPartitionScheduler().removeRuntimeInfo(table.getId());
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing FE unit tests around dynamic partition scheduling/runtime info (e.g. DynamicPartitionTableTest#testRuntimeInfo). Since this change is intended to prevent a memory leak, it would be good to add a unit test that creates runtime info for a table, drops the table via the catalog path, and asserts the relevant runtime info keys no longer return the previously stored values (i.e. the entry was actually removed).

Copilot uses AI. Check for mistakes.
Comment on lines 692 to 696
|| !olapTable.getTableProperty().getDynamicPartitionProperty().getEnable())
&& olapTable.getPartitionRetentionCount() <= 0) {
iterator.remove();
removeRuntimeInfo(tableId);
continue;
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new cleanup in executeDynamicPartition() only runs when the scheduler itself iterates a (dbId, tableId) entry and decides to iterator.remove(). However, the normal DDL path for disabling dynamic partition / setting partition.retention_count back to 0 calls DynamicPartitionUtil.registerOrRemoveDynamicPartitionTable(...), which removes the table from dynamicPartitionTableInfo via removeDynamicPartitionTable(...) without clearing runtimeInfos. Once removed from the set, executeDynamicPartition() will never visit it again, so its runtimeInfos entry can still become permanent. Consider clearing runtimeInfos as part of the removal path as well (e.g., in removeDynamicPartitionTable(...) or in DynamicPartitionUtil.registerOrRemoveDynamicPartitionTable() when unregistering).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue should be considered. It may be better to handle this issue at the end of the relevant DML statements, rather than within the scheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] (dynamic-partition) DynamicPartitionScheduler.runtimeInfos leaks entries on DROP TABLE, causing FE OOM

3 participants