[flink] support combined mode for orphan files clean #6551

XiaoHongbo-Hope · 2025-11-06T17:58:31Z

This PR

Supports combined mode processing when orphan files clean:

Processes multiple tables within a single DataStream during job graph construction, instead of creating one DataStream per table, which significantly reduces JobGraph construction time and complexity, avoiding timeout, stack overflow, and resource allocation failures
Only applies when --mode combined is specified

Adds configuration:

[--mode <divided|combined>]: Processing mode (default: combined)
- divided: Create one DataStream per table (original behavior)
- combined: Process all tables in a single DataStream
[--tables <table1>] [--tables <table2>]: multiple parameters for table names
--table and --tables cannot be used together

Tests:

testCombinedMode: Combined mode with multiple tables
testCombinedModeWithBranch: Combined mode with multiple branches

… clean

… listing only

yuzelin · 2025-11-13T10:39:34Z

...link-common/src/main/java/org/apache/paimon/flink/action/RemoveOrphanFilesActionFactory.java

                        + "--database <database_name> \\\n"
-                        + "--table <table_name> \\\n"
+                        + "[--table <table_name>] \\\n"
+                        + "[--tables <table1,table2,...>] \\\n"


The usage of --tables is confused. Is this a single parameter or a multi-parameter?

The usage of --tables is confused. Is this a single parameter or a multi-parameter?

updated

yuzelin · 2025-11-13T11:43:55Z

...flink-common/src/main/java/org/apache/paimon/flink/orphan/CombinedFlinkOrphanFilesClean.java

+        for (T cleaner : cleaners) {
+            FileStoreTable table = cleaner.getTable();
+            Identifier identifier = table.catalogEnvironment().identifier();
+            if (identifier == null) {


Why not pass identifier from ActionFactory?

yuzelin

+1

JingsongLi · 2025-11-14T07:01:40Z

Is combine mode any bad case? Why not just enable it by default

XiaoHongbo-Hope · 2025-11-14T10:56:47Z

Is combine mode any bad case? Why not just enable it by default

Sure，we can enable it by default.

…nal path

…ays 1

XiaoHongbo-Hope · 2025-11-17T11:28:20Z

Is combine mode any bad case? Why not just enable it by default

Updated and tested in our application.

Copilot

Pull Request Overview

This PR adds support for combined mode when cleaning orphan files in Flink, allowing multiple tables to be processed within a single DataStream during job graph construction instead of creating one DataStream per table. This significantly reduces JobGraph construction time and complexity when processing thousands of tables.

Key changes:

Introduces CombinedFlinkOrphanFilesClean class to process multiple tables in a single DataStream
Adds --mode parameter with divided (original behavior) and combined (new behavior) options, defaulting to combined
Adds --tables parameter to specify multiple tables explicitly
Refactors existing code to extract common utilities into OrphanFilesCleanUtil

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
RemoveOrphanFilesActionITCaseBase.java	Adds comprehensive tests for combined mode including multi-table, branch, and external path scenarios
OrphanFilesCleanUtil.java	New utility class extracting common Flink environment configuration and input handling logic
FlinkOrphanFilesClean.java	Refactored to extract common logic and expose methods for combined mode usage
CombinedFlinkOrphanFilesClean.java	New implementation for combined mode orphan file cleaning across multiple tables
RemoveOrphanFilesActionFactory.java	Updated to support `--tables` and `--mode` parameters with validation
RemoveOrphanFilesAction.java	Modified to support both single and multi-table modes with configurable processing mode
OrphanFilesClean.java	Changed visibility of methods from protected to public for combined mode access

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-18T15:52:29Z

...aimon-flink-common/src/main/java/org/apache/paimon/flink/action/RemoveOrphanFilesAction.java


    private String olderThan = null;
    private boolean dryRun = false;
+    private MultiTablesSinkMode mode = COMBINED;


The default mode is set to COMBINED in the code, but the PR description and the help text in RemoveOrphanFilesActionFactory state the default is combined. However, MultiTablesSinkMode.fromString(null) returns DIVIDED, not COMBINED. This creates an inconsistency: when the --mode parameter is not provided, fromString(null) will return DIVIDED, but the field default is COMBINED. This means the actual default depends on whether the parameter is provided. Consider either: (1) removing the field initializer and letting mode be null by default, then using mode == null || mode == COMBINED in the condition, or (2) changing the help text and PR description to accurately reflect that when --mode is not specified, the behavior defaults to DIVIDED per the fromString method.

Suggested change

private MultiTablesSinkMode mode = COMBINED;

@Nullable private MultiTablesSinkMode mode;

Copilot · 2025-11-18T15:52:29Z

...link-common/src/main/java/org/apache/paimon/flink/action/RemoveOrphanFilesActionFactory.java

+                        + "during job graph construction, instead of creating one dataStream per table. "
+                        + "This significantly reduces job graph construction time, when processing "
+                        + "thousands of tables (jobs may fail to start within timeout limits). "
+                        + "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. "


Corrected spelling of 'dataStream' to 'DataStream' for consistency with Flink terminology and corrected 'stack over flow' to 'stack overflow'.

Suggested change

+ "during job graph construction, instead of creating one dataStream per table. "

+ "This significantly reduces job graph construction time, when processing "

+ "thousands of tables (jobs may fail to start within timeout limits). "

+ "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. "

+ "during job graph construction, instead of creating one DataStream per table. "

+ "This significantly reduces job graph construction time, when processing "

+ "thousands of tables (jobs may fail to start within timeout limits). "

+ "It also reduces JobGraph complexity and avoids stack overflow issue and resource allocation failures during job running. "

Copilot · 2025-11-18T15:52:29Z

...link-common/src/main/java/org/apache/paimon/flink/action/RemoveOrphanFilesActionFactory.java

+                        + "during job graph construction, instead of creating one dataStream per table. "
+                        + "This significantly reduces job graph construction time, when processing "
+                        + "thousands of tables (jobs may fail to start within timeout limits). "
+                        + "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. "


Corrected spelling of 'stack over flow' to 'stack overflow'.

Suggested change

+ "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. "

+ "It also reduces JobGraph complexity and avoids stack overflow issue and resource allocation failures during job running. "

Copilot · 2025-11-18T15:52:30Z

...k-common/src/test/java/org/apache/paimon/flink/action/RemoveOrphanFilesActionITCaseBase.java

+    /**
+     * Stringify the given {@link InternalRow}. This is a simplified version that handles basic
+     * types. For complex types (Array, Map, Row), it falls back to toString().
+     *
+     * <p>This method is implemented locally to avoid dependency on paimon-common's test-jar, which
+     * may not be available in CI environments.
+     */


The documentation states the method handles complex types by falling back to toString(), but the actual implementation doesn't handle complex types at all—it only handles basic types using FieldGetter. The comment is misleading and should be updated to accurately reflect what the method actually does: 'This is a simplified version that handles basic types only. Complex types (Array, Map, Row) are not explicitly handled and rely on the default object representation from FieldGetter.'

Copilot · 2025-11-18T15:52:30Z

...flink-common/src/main/java/org/apache/paimon/flink/orphan/CombinedFlinkOrphanFilesClean.java

+        DataStream<String> usedFiles =
+                usedManifestFiles
+                        .getSideOutput(manifestOutputTag)
+                        .keyBy(tuple2 -> tuple2.f0) // Use Identifier object directly as key


[nitpick] The comment 'Use Identifier object directly as key' is misleading. The code uses tuple2.f0 which is indeed an Identifier, but the comment should clarify that Identifier objects are used as keys for grouping by table. Consider: 'Group by table identifier to process manifests per table'.

Suggested change

.keyBy(tuple2 -> tuple2.f0) // Use Identifier object directly as key

.keyBy(tuple2 -> tuple2.f0) // Group by table identifier to process manifests per table

Copilot · 2025-11-18T15:52:30Z

...k-common/src/test/java/org/apache/paimon/flink/action/RemoveOrphanFilesActionITCaseBase.java

+        writeToBranch(branchTable2, GenericRow.of(3L, BinaryString.fromString("World"), 30));
+
+        // Create orphan files in both branch snapshot directories
+        // This is key: same table, multiple branches - will trigger bug in


The comment 'will trigger bug in' is incomplete and unclear. It appears to be referring to a bug that was fixed, but the comment doesn't explain what bug or what the expected behavior is. This should either be completed or removed, e.g., 'This tests that combined mode correctly handles multiple branches within the same table'.

Suggested change

// This is key: same table, multiple branches - will trigger bug in

// This tests that combined mode correctly handles multiple branches within the same table and removes orphan files from each branch.

yuzelin · 2025-11-20T12:02:50Z

...k-common/src/test/java/org/apache/paimon/flink/action/RemoveOrphanFilesActionITCaseBase.java

+        writeToBranch(branchTable2, GenericRow.of(3L, BinaryString.fromString("World"), 30));
+
+        // Create orphan files in both branch snapshot directories
+        // This is key: same table, multiple branches - will trigger bug in


Is this comment fixed?

Is this comment fixed?

fixed and removed expired comment

yuzelin · 2025-11-20T12:03:07Z

...nk/paimon-flink-common/src/main/java/org/apache/paimon/flink/utils/OrphanFilesCleanUtil.java

+/** Utility class for orphan files clean operations in Flink. */
+public class OrphanFilesCleanUtil {
+
+    protected static final Logger LOG = LoggerFactory.getLogger(FlinkOrphanFilesClean.class);


OrphanFilesCleanUtil.class

xiaohongbo added 4 commits November 5, 2025 20:36

use listTableDetails to replace getTable one by one

3b2952f

support batch table processing when build data stream of orphan files…

e995499

… clean

add test case

f148995

do not filter table type when listTableDetails

1195f02

XiaoHongbo-Hope closed this Nov 7, 2025

XiaoHongbo-Hope reopened this Nov 7, 2025

xiaohongbo added 11 commits November 7, 2025 11:53

revert FlinkOrphanFilesClean orphan files clean behavior

571848f

fix multi table check branch

83b4f77

fix path mismatch issue

aa3afd0

use relative instead of absolute path

81b4632

fix test case failure

fe1e935

make dryRun default false again

f9e49ab

clean code

4735671

clean code

2abb939

add test case for manifest-list orphan files clean

ae3aabc

try to fix test case failure

8069b8c

make orphan files clean batch mode work in catalog supports paginated…

0bc01a3

… listing only

XiaoHongbo-Hope closed this Nov 9, 2025