[HUDI-9653] Add Spark Procedures for showing requested, completed cleans#13719
Merged
vinothchandar merged 11 commits intoapache:masterfrom Aug 21, 2025
Conversation
- Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards
rahil-c
reviewed
Aug 18, 2025
...-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCleansProcedure.scala
Show resolved
Hide resolved
rahil-c
reviewed
Aug 18, 2025
...-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCleansProcedure.scala
Show resolved
Hide resolved
rahil-c
reviewed
Aug 18, 2025
...-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCleansProcedure.scala
Show resolved
Hide resolved
Collaborator
|
@vamshikrishnakyatham can you check github ci when you get a chance for issues. |
…rk-procedures-for-showing-requested-completed-cleans
This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering
…rk-procedures-for-showing-requested-completed-cleans
This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering
rahil-c
reviewed
Aug 19, 2025
...rk/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCleansPlanProcedure.scala
Outdated
Show resolved
Hide resolved
Collaborator
|
@yihua @nsivabalan @dannyhchen @jonvex I did an initial pass of the PR and the core logic looks good to me. Was wondering if I can have a committer take a look as well since I will not have the ability to merge it. |
- Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards
- Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards
- Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards
vinothchandar
approved these changes
Aug 20, 2025
Member
vinothchandar
left a comment
There was a problem hiding this comment.
thanks for the review @rahil-c .. Waiting on CI.
Good job , @vamshikrishnakyatham !
Collaborator
|
@vamshikrishnakyatham I think the failures on the CI are likely flaky test failures. You will need to either ask one of the hudi commmitters to rerun the failed jobs or push some other change to trigger ci. |
…rk-procedures-for-showing-requested-completed-cleans
- Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards
Member
|
rekicked CI |
rahil-c
approved these changes
Aug 21, 2025
Collaborator
There was a problem hiding this comment.
Thanks @vamshikrishnakyatham for this change! I have approved from my side.
Collaborator
alexr17
pushed a commit
to alexr17/hudi
that referenced
this pull request
Aug 25, 2025
…ans (apache#13719) * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add support for showArchived in the clean procedures * [HUDI-9653] Add SQL filter expression support for Spark procedures This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering * [HUDI-9653] Add SQL filter expression support for Spark procedures This commit introduces SQL filter expression support for Hudi Spark procedures, enabling users to apply standard SQL expressions to filter timeline data. Key features: - Added parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures - Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000') - Automatic expression parsing and validation using Spark's SQL parser - Proper column binding and type conversion for expression evaluation - Comprehensive error handling with descriptive error messages Examples: - call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000') - call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED') - call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0') Implementation: - Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation - Added filter validation to prevent invalid column references and syntax errors - Enhanced existing procedures with optional filter parameter (default empty) - Added comprehensive test coverage for time-based and state-based filtering * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards * [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures - Implement ShowCleansProcedure for displaying completed clean operations - Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules - Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata - Register new procedures in HoodieProcedures.scala - Add comprehensive test suite TestShowCleansProcedures.scala with edge cases - Support limit parameter for result pagination - Support showArchived parameter to support querying both active and archived cleans - Include proper error handling for non-existent tables - Follow existing procedure patterns and coding standards
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the pull request
This PR implements new SQL procedures for viewing Hudi cleaning operations to provide comprehensive visibility into table maintenance activities:
show_cleans: Display completed clean operations with timing and file deletion statisticsshow_clean_plans: Display clean plans (both requested and completed) with retention policies and state informationshow_cleans_metadata: Display partition-level clean metadata with detailed statisticsshowArchivedparameter: Access both active and archived timeline dataThese procedures help users monitor cleaning performance across the entire timeline history, debug storage issues with precise filtering, and understand table maintenance operations using familiar SQL syntax.
Brief change log
showArchivedparameter to all procedures for accessing archived timeline dataChange Logs
Context: Added three new SQL procedures to provide comprehensive visibility into Hudi's cleaning operations, including historical data from archived timelines, which was previously only available through CLI tools or direct timeline inspection.
Summary:
show_cleansprocedure displays completed cleaning operations with metadata like timing, files deleted, and retention policiesshow_clean_plansprocedure shows clean operations in all states (REQUESTED, INFLIGHT, COMPLETED) with state fieldshow_cleans_metadataprocedure provides partition-level cleaning details for debuggingshowArchivedparameter: Access both active and archived timeline data usingmetaClient.getArchivedTimeline.mergeTimeline(metaClient.getActiveTimeline)Impact
Public API Changes:
show_cleans,show_clean_plans,show_cleans_metadatalimitparameter for paginationshowArchivedparameter for accessing archived timeline dataCALLstatements in Spark SQLUser-Facing Features:
call show_clean_plans(table => 'my_table', limit => 5)call show_cleans(table => 'my_table', showArchived => true, limit => 5)Performance Impact:
Risk level (write none, low medium or high below)
Low
This is a new feature addition that only adds SQL procedures without modifying existing functionality. The changes are isolated, well-tested, and follow established patterns. No existing APIs or behaviors are modified. The SQL filter expressions use Spark's built-in SQL parser for safety and consistency.
Documentation Update
Website Update Required:
Config Changes:
Contributor's checklist