-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[flink] support combined mode for orphan files clean #6551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flink] support combined mode for orphan files clean #6551
Conversation
| + "--database <database_name> \\\n" | ||
| + "--table <table_name> \\\n" | ||
| + "[--table <table_name>] \\\n" | ||
| + "[--tables <table1,table2,...>] \\\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of --tables is confused. Is this a single parameter or a multi-parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of --tables is confused. Is this a single parameter or a multi-parameter?
updated
| for (T cleaner : cleaners) { | ||
| FileStoreTable table = cleaner.getTable(); | ||
| Identifier identifier = table.catalogEnvironment().identifier(); | ||
| if (identifier == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass identifier from ActionFactory?
yuzelin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Is combine mode any bad case? Why not just enable it by default |
Sure,we can enable it by default. |
Updated and tested in our application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for combined mode when cleaning orphan files in Flink, allowing multiple tables to be processed within a single DataStream during job graph construction instead of creating one DataStream per table. This significantly reduces JobGraph construction time and complexity when processing thousands of tables.
Key changes:
- Introduces
CombinedFlinkOrphanFilesCleanclass to process multiple tables in a single DataStream - Adds
--modeparameter withdivided(original behavior) andcombined(new behavior) options, defaulting tocombined - Adds
--tablesparameter to specify multiple tables explicitly - Refactors existing code to extract common utilities into
OrphanFilesCleanUtil
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| RemoveOrphanFilesActionITCaseBase.java | Adds comprehensive tests for combined mode including multi-table, branch, and external path scenarios |
| OrphanFilesCleanUtil.java | New utility class extracting common Flink environment configuration and input handling logic |
| FlinkOrphanFilesClean.java | Refactored to extract common logic and expose methods for combined mode usage |
| CombinedFlinkOrphanFilesClean.java | New implementation for combined mode orphan file cleaning across multiple tables |
| RemoveOrphanFilesActionFactory.java | Updated to support --tables and --mode parameters with validation |
| RemoveOrphanFilesAction.java | Modified to support both single and multi-table modes with configurable processing mode |
| OrphanFilesClean.java | Changed visibility of methods from protected to public for combined mode access |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| private String olderThan = null; | ||
| private boolean dryRun = false; | ||
| private MultiTablesSinkMode mode = COMBINED; |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default mode is set to COMBINED in the code, but the PR description and the help text in RemoveOrphanFilesActionFactory state the default is combined. However, MultiTablesSinkMode.fromString(null) returns DIVIDED, not COMBINED. This creates an inconsistency: when the --mode parameter is not provided, fromString(null) will return DIVIDED, but the field default is COMBINED. This means the actual default depends on whether the parameter is provided. Consider either: (1) removing the field initializer and letting mode be null by default, then using mode == null || mode == COMBINED in the condition, or (2) changing the help text and PR description to accurately reflect that when --mode is not specified, the behavior defaults to DIVIDED per the fromString method.
| private MultiTablesSinkMode mode = COMBINED; | |
| @Nullable private MultiTablesSinkMode mode; |
| + "during job graph construction, instead of creating one dataStream per table. " | ||
| + "This significantly reduces job graph construction time, when processing " | ||
| + "thousands of tables (jobs may fail to start within timeout limits). " | ||
| + "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. " |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'dataStream' to 'DataStream' for consistency with Flink terminology and corrected 'stack over flow' to 'stack overflow'.
| + "during job graph construction, instead of creating one dataStream per table. " | |
| + "This significantly reduces job graph construction time, when processing " | |
| + "thousands of tables (jobs may fail to start within timeout limits). " | |
| + "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. " | |
| + "during job graph construction, instead of creating one DataStream per table. " | |
| + "This significantly reduces job graph construction time, when processing " | |
| + "thousands of tables (jobs may fail to start within timeout limits). " | |
| + "It also reduces JobGraph complexity and avoids stack overflow issue and resource allocation failures during job running. " |
| + "during job graph construction, instead of creating one dataStream per table. " | ||
| + "This significantly reduces job graph construction time, when processing " | ||
| + "thousands of tables (jobs may fail to start within timeout limits). " | ||
| + "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. " |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'stack over flow' to 'stack overflow'.
| + "It also reduces JobGraph complexity and avoids stack over flow issue and resource allocation failures during job running. " | |
| + "It also reduces JobGraph complexity and avoids stack overflow issue and resource allocation failures during job running. " |
| /** | ||
| * Stringify the given {@link InternalRow}. This is a simplified version that handles basic | ||
| * types. For complex types (Array, Map, Row), it falls back to toString(). | ||
| * | ||
| * <p>This method is implemented locally to avoid dependency on paimon-common's test-jar, which | ||
| * may not be available in CI environments. | ||
| */ |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states the method handles complex types by falling back to toString(), but the actual implementation doesn't handle complex types at all—it only handles basic types using FieldGetter. The comment is misleading and should be updated to accurately reflect what the method actually does: 'This is a simplified version that handles basic types only. Complex types (Array, Map, Row) are not explicitly handled and rely on the default object representation from FieldGetter.'
| DataStream<String> usedFiles = | ||
| usedManifestFiles | ||
| .getSideOutput(manifestOutputTag) | ||
| .keyBy(tuple2 -> tuple2.f0) // Use Identifier object directly as key |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment 'Use Identifier object directly as key' is misleading. The code uses tuple2.f0 which is indeed an Identifier, but the comment should clarify that Identifier objects are used as keys for grouping by table. Consider: 'Group by table identifier to process manifests per table'.
| .keyBy(tuple2 -> tuple2.f0) // Use Identifier object directly as key | |
| .keyBy(tuple2 -> tuple2.f0) // Group by table identifier to process manifests per table |
| writeToBranch(branchTable2, GenericRow.of(3L, BinaryString.fromString("World"), 30)); | ||
|
|
||
| // Create orphan files in both branch snapshot directories | ||
| // This is key: same table, multiple branches - will trigger bug in |
Copilot
AI
Nov 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment 'will trigger bug in' is incomplete and unclear. It appears to be referring to a bug that was fixed, but the comment doesn't explain what bug or what the expected behavior is. This should either be completed or removed, e.g., 'This tests that combined mode correctly handles multiple branches within the same table'.
| // This is key: same table, multiple branches - will trigger bug in | |
| // This tests that combined mode correctly handles multiple branches within the same table and removes orphan files from each branch. |
| writeToBranch(branchTable2, GenericRow.of(3L, BinaryString.fromString("World"), 30)); | ||
|
|
||
| // Create orphan files in both branch snapshot directories | ||
| // This is key: same table, multiple branches - will trigger bug in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment fixed?
fixed and removed expired comment
| /** Utility class for orphan files clean operations in Flink. */ | ||
| public class OrphanFilesCleanUtil { | ||
|
|
||
| protected static final Logger LOG = LoggerFactory.getLogger(FlinkOrphanFilesClean.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OrphanFilesCleanUtil.class
This PR
Supports combined mode processing when orphan files clean:
--mode combinedis specifiedAdds configuration:
[--mode <divided|combined>]: Processing mode (default:combined)divided: Create one DataStream per table (original behavior)combined: Process all tables in a single DataStream[--tables <table1>] [--tables <table2>]: multiple parameters for table names--tableand--tablescannot be used togetherTests:
testCombinedMode: Combined mode with multiple tablestestCombinedModeWithBranch: Combined mode with multiple branches