Skip to content

[HUDI-9653] Add Spark Procedures for showing requested, completed cleans#13719

Merged
vinothchandar merged 11 commits intoapache:masterfrom
vamshikrishnakyatham:HUDI-9653-Add-Spark-procedures-for-showing-requested-completed-cleans
Aug 21, 2025
Merged

[HUDI-9653] Add Spark Procedures for showing requested, completed cleans#13719
vinothchandar merged 11 commits intoapache:masterfrom
vamshikrishnakyatham:HUDI-9653-Add-Spark-procedures-for-showing-requested-completed-cleans

Conversation

@vamshikrishnakyatham
Copy link
Contributor

@vamshikrishnakyatham vamshikrishnakyatham commented Aug 14, 2025

What is the purpose of the pull request

This PR implements new SQL procedures for viewing Hudi cleaning operations to provide comprehensive visibility into table maintenance activities:

  • show_cleans: Display completed clean operations with timing and file deletion statistics
  • show_clean_plans: Display clean plans (both requested and completed) with retention policies and state information
  • show_cleans_metadata: Display partition-level clean metadata with detailed statistics
  • Enhanced with showArchived parameter: Access both active and archived timeline data

These procedures help users monitor cleaning performance across the entire timeline history, debug storage issues with precise filtering, and understand table maintenance operations using familiar SQL syntax.

Brief change log

  • Implement ShowCleansProcedure for displaying completed clean operations
  • Implement ShowCleansPlanProcedure for displaying clean plans with state information (REQUESTED, INFLIGHT, COMPLETED)
  • Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
    • Support limit parameter for result pagination
  • Add showArchived parameter to all procedures for accessing archived timeline data
  • Register new procedures in HoodieProcedures.scala
  • Add comprehensive test suite TestShowCleansProcedures.scala
  • Include proper error handling for non-existent tables and invalid filter expressions
  • Follow existing procedure patterns and coding standards

Change Logs

Context: Added three new SQL procedures to provide comprehensive visibility into Hudi's cleaning operations, including historical data from archived timelines, which was previously only available through CLI tools or direct timeline inspection.

Summary:

  • New show_cleans procedure displays completed cleaning operations with metadata like timing, files deleted, and retention policies
  • New show_clean_plans procedure shows clean operations in all states (REQUESTED, INFLIGHT, COMPLETED) with state field
  • New show_cleans_metadata procedure provides partition-level cleaning details for debugging
  • showArchived parameter: Access both active and archived timeline data using metaClient.getArchivedTimeline.mergeTimeline(metaClient.getActiveTimeline)
  • All procedures follow existing Hudi procedure patterns and include comprehensive error handling

Impact

Public API Changes:

  • Adds three new SQL procedures: show_cleans, show_clean_plans, show_cleans_metadata
  • All procedures accept optional parameters:
    • limit parameter for pagination
    • showArchived parameter for accessing archived timeline data
  • New procedures are registered and available via CALL statements in Spark SQL

User-Facing Features:

  • Users can now monitor cleaning operations directly via SQL across entire timeline history
  • Active data access: call show_clean_plans(table => 'my_table', limit => 5)
  • Archived data access: call show_cleans(table => 'my_table', showArchived => true, limit => 5)
  • Better debugging capabilities for storage and maintenance issues
  • Consistent interface with other Hudi procedures

Performance Impact:

  • Minimal performance impact - procedures only read timeline metadata
  • No impact on write operations or table performance
  • Procedures use existing timeline APIs with no additional I/O overhead
  • Archived timeline access may have slight performance cost for large archived datasets

Risk level (write none, low medium or high below)

Low

This is a new feature addition that only adds SQL procedures without modifying existing functionality. The changes are isolated, well-tested, and follow established patterns. No existing APIs or behaviors are modified. The SQL filter expressions use Spark's built-in SQL parser for safety and consistency.

Documentation Update

Website Update Required:

  • New SQL procedures need to be documented on Hudi website under SQL procedures section
  • Will create follow-up JIRA ticket for website documentation update
  • Code includes comprehensive ScalaDoc documentation for all new classes and methods

Config Changes:

  • None - no new configurations added or existing defaults changed

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable including showArchived and filtering capabilities
  • CI passed

- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Aug 14, 2025
@yihua yihua self-assigned this Aug 14, 2025
@rahil-c
Copy link
Collaborator

rahil-c commented Aug 18, 2025

@vamshikrishnakyatham can you check github ci when you get a chance for issues.

Error:  /home/runner/work/hudi/hudi/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCleansPlanProcedure.scala:83: error: type mismatch;
Error:   found   : scala.collection.mutable.Buffer[org.apache.hudi.common.table.timeline.HoodieInstant]
Error:   required: Seq[org.apache.hudi.common.table.timeline.HoodieInstant]
Error:      cleanInstants.sortWith((a, b) => comparator.compare(a, b) < 0)
Error:                            ^

This commit introduces SQL filter expression support for Hudi Spark procedures,
enabling users to apply standard SQL expressions to filter timeline data.

Key features:
- Added  parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures
- Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000')
- Automatic expression parsing and validation using Spark's SQL parser
- Proper column binding and type conversion for expression evaluation
- Comprehensive error handling with descriptive error messages

Examples:
- call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000')
- call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED')
- call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0')

Implementation:
- Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation
- Added filter validation to prevent invalid column references and syntax errors
- Enhanced existing procedures with optional filter parameter (default empty)
- Added comprehensive test coverage for time-based and state-based filtering
…rk-procedures-for-showing-requested-completed-cleans
This commit introduces SQL filter expression support for Hudi Spark procedures,
enabling users to apply standard SQL expressions to filter timeline data.

Key features:
- Added  parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures
- Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000')
- Automatic expression parsing and validation using Spark's SQL parser
- Proper column binding and type conversion for expression evaluation
- Comprehensive error handling with descriptive error messages

Examples:
- call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000')
- call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED')
- call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0')

Implementation:
- Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation
- Added filter validation to prevent invalid column references and syntax errors
- Enhanced existing procedures with optional filter parameter (default empty)
- Added comprehensive test coverage for time-based and state-based filtering
rahil-c

This comment was marked as duplicate.

@rahil-c
Copy link
Collaborator

rahil-c commented Aug 19, 2025

@yihua @nsivabalan @dannyhchen @jonvex I did an initial pass of the PR and the core logic looks good to me. Was wondering if I can have a committer take a look as well since I will not have the ability to merge it.

- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review @rahil-c .. Waiting on CI.

Good job , @vamshikrishnakyatham !

@vinothchandar vinothchandar assigned vinothchandar and unassigned yihua Aug 20, 2025
@rahil-c
Copy link
Collaborator

rahil-c commented Aug 20, 2025

@vamshikrishnakyatham I think the failures on the CI are likely flaky test failures. You will need to either ask one of the hudi commmitters to rerun the failed jobs or push some other change to trigger ci.

…rk-procedures-for-showing-requested-completed-cleans
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards
@vinothchandar
Copy link
Member

rekicked CI

Copy link
Collaborator

@rahil-c rahil-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vamshikrishnakyatham for this change! I have approved from my side.

@apache apache deleted a comment from hudi-bot Aug 21, 2025
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@vinothchandar vinothchandar merged commit e11ee97 into apache:master Aug 21, 2025
61 checks passed
alexr17 pushed a commit to alexr17/hudi that referenced this pull request Aug 25, 2025
…ans (apache#13719)

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures

- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add support for showArchived in the clean procedures

* [HUDI-9653] Add SQL filter expression support for Spark procedures

This commit introduces SQL filter expression support for Hudi Spark procedures,
enabling users to apply standard SQL expressions to filter timeline data.

Key features:
- Added  parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures
- Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000')
- Automatic expression parsing and validation using Spark's SQL parser
- Proper column binding and type conversion for expression evaluation
- Comprehensive error handling with descriptive error messages

Examples:
- call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000')
- call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED')
- call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0')

Implementation:
- Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation
- Added filter validation to prevent invalid column references and syntax errors
- Enhanced existing procedures with optional filter parameter (default empty)
- Added comprehensive test coverage for time-based and state-based filtering

* [HUDI-9653] Add SQL filter expression support for Spark procedures

This commit introduces SQL filter expression support for Hudi Spark procedures,
enabling users to apply standard SQL expressions to filter timeline data.

Key features:
- Added  parameter to show_clean_plans, show_cleans, and show_cleans_metadata procedures
- Supports standard SQL expressions for filtering procedure results (e.g., plan_time > '20250101000000')
- Automatic expression parsing and validation using Spark's SQL parser
- Proper column binding and type conversion for expression evaluation
- Comprehensive error handling with descriptive error messages

Examples:
- call show_clean_plans(table => 'my_table', filter => 'plan_time > 20250101000000')
- call show_cleans(table => 'my_table', filter => 'clean_time >= 20250101 AND state = COMPLETED')
- call show_clean_plans(table => 'my_table', filter => 'total_partitions_to_clean > 0')

Implementation:
- Created HoodieProcedureFilterUtils with SQL expression parsing and evaluation
- Added filter validation to prevent invalid column references and syntax errors
- Enhanced existing procedures with optional filter parameter (default empty)
- Added comprehensive test coverage for time-based and state-based filtering

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards

* [HUDI-9653] Add show_cleans and show_clean_plans SQL procedures
- Implement ShowCleansProcedure for displaying completed clean operations
- Implement ShowCleansPlanProcedure for displaying requested clean plans and schedules
- Add ShowCleansPartitionMetadataProcedure for partition-level clean metadata
- Register new procedures in HoodieProcedures.scala
- Add comprehensive test suite TestShowCleansProcedures.scala with edge cases
- Support limit parameter for result pagination
- Support showArchived parameter to support querying both active and archived cleans
- Include proper error handling for non-existent tables
- Follow existing procedure patterns and coding standards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants