Skip to content

[SPARK-56521][SQL] Support PartitionPredicate in runtime filters#55382

Closed
szehon-ho wants to merge 8 commits intoapache:masterfrom
szehon-ho:partition-predicate-runtime-filter
Closed

[SPARK-56521][SQL] Support PartitionPredicate in runtime filters#55382
szehon-ho wants to merge 8 commits intoapache:masterfrom
szehon-ho:partition-predicate-runtime-filter

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho commented Apr 17, 2026

What changes were proposed in this pull request?

This PR introduces PartitionPredicate support in runtime filters for DataSource V2 scans. Currently, PartitionPredicate is only used in static filter pushdown and metadata-only delete paths. This extends the same mechanism to runtime filters (Dynamic Partition Pruning and scalar subqueries).

Changes:

  • SupportsRuntimeV2Filtering: Added supportsIterativeFiltering() and pushedPredicates() default methods. When a scan returns true from supportsIterativeFiltering(), Spark may call filter() multiple times — first with translated V2 predicates, then with PartitionPredicate instances derived from runtime filters. The pushedPredicates() method (mirroring SupportsPushDownV2Filters) allows Spark to determine which predicates were already accepted in the first pass, avoiding duplicate pushdown.
  • BatchScanExec: After the existing runtime filter pushdown, if the scan supports iterative filtering, derives PartitionPredicate instances from DPP expressions and literalized scalar subqueries and pushes them in a second filter() call.
  • PushDownUtils: Refactored pushRuntimeFilters() to track which runtime filter expressions were translated to V2 predicates. Uses pushedPredicates() to exclude filters already accepted in the first pass from PartitionPredicate derivation. Candidates are further gated by filterAttributes() — only runtime filters whose referenced columns are declared in filterAttributes() are eligible for PartitionPredicate derivation, consistent with PartitionPruning's planning-time check.

Why are the changes needed?

Runtime filters (DPP and scalar subqueries) currently push V2 predicates to connectors, but connectors have no way to receive partition-level predicates with evaluable functions. PartitionPredicate wraps a Catalyst expression that connectors can evaluate directly against partition keys, enabling more efficient partition pruning at runtime without needing to translate expressions into the connector's native predicate format.

The pushedPredicates() method is needed to prevent the same logical filter from being pushed twice — once as a translated V2 predicate and again as a PartitionPredicate. The filterAttributes() gate ensures that only filters on declared filterable columns are considered, aligning runtime behavior with the static planning-time checks in PartitionPruning.

This is a sub-task of the DSV2 Enhanced Partition Stats Filtering umbrella (SPARK-55596).

Does this PR introduce any user-facing change?

Yes. Connectors implementing SupportsRuntimeV2Filtering can now:

  • Override supportsIterativeFiltering() to return true and receive PartitionPredicate instances via filter() during runtime filtering.
  • Override pushedPredicates() to report which predicates were accepted, so Spark avoids redundant pushdown.

This is an opt-in API addition; existing connectors are unaffected.

How was this patch tested?

Added DataSourceV2EnhancedRuntimePartitionFilterSuite with 12 numbered test cases (and subcases) covering all combinations of:

  • DPP: single partition column, non-first partition column (multi-column)
  • Scalar subquery: translatable, untranslatable (RLIKE, UDF), complex expressions, multiple partition columns
  • First-pass acceptance: DPP translated+accepted, scalar translated+accepted (no PartitionPredicate generated)
  • Mixed scenarios: one filter accepted in first pass + one untranslatable leading to PartitionPredicate
  • Negative tests: scalar subquery on data column (no PartitionPredicate), iterative filtering disabled, partition column not in filterAttributes()

Supporting test infrastructure: InMemoryEnhancedRuntimePartitionFilterTable (with configurable accept-v2-predicates and filter-attributes table properties) and InMemoryTableEnhancedRuntimePartitionFilterCatalog.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4)

@szehon-ho szehon-ho force-pushed the partition-predicate-runtime-filter branch 2 times, most recently from c9c87bd to 939e128 Compare April 17, 2026 01:55
@szehon-ho szehon-ho force-pushed the partition-predicate-runtime-filter branch from 939e128 to 0ba0445 Compare April 17, 2026 01:58
…panion object

Extract runtime filter pushing logic from filteredPartitions into a
companion object method with a pattern match guard, removing the
asInstanceOf cast.
Move the runtime filter pushing logic from the BatchScanExec companion
object to PushDownUtils, co-locating it with the related partition
predicate helpers.
Add a pushedPredicates() API to SupportsRuntimeV2Filtering, mirroring
SupportsPushDownV2Filters. Use it in pushRuntimeFilters to exclude
already-pushed predicates from the second pass and to determine whether
replanning is needed.
…pushedPredicates dedup, and comprehensive tests

- Use pushedPredicates() to avoid deriving PartitionPredicates from
  runtime filters whose V2 translation was already accepted in the
  first filter() pass, preventing duplicate pushdown.
- Gate PartitionPredicate candidates on filterAttributes(), consistent
  with PartitionPruning's planning-time check, using ExprId-based
  AttributeSet.subsetOf comparison.
- Reorganize test suite into 12 numbered cases (with subcases)
  covering all combinations of DPP/scalar, translated/untranslatable,
  accepted/rejected, partition/data column, and filterAttributes.
- Add configurable test table properties (accept-v2-predicates,
  filter-attributes) for targeted scenario testing.
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a regression here that will affect every existing SupportsRuntimeV2Filtering / SupportsRuntimeFiltering implementation. The PR description says "existing connectors are unaffected", but reading pushRuntimeFilters and BatchScanExec.filteredPartitions together, the new code path decides whether to re-plan partitions based on pushedPredicates().nonEmpty — and pushedPredicates() has a default that returns an empty array. Existing connectors don't override it, so filtered is false, scan.toBatch.planInputPartitions() is never re-called, and inputPartitions (lazily evaluated before filter()) is what BatchScanExec scans. The connector's filter() side-effect is effectively dropped. Details inline on F1.

Existing DPP V2 test suites don't catch this because they check runtimeFilters on BatchScanExec and final query answers — post-scan FilterExec preserves correctness. Partition pruning effectiveness is not asserted for V2. Case 11 in the new suite looks like it covers this, but the assertPushedPartitionPredicates helper falls through to Seq.empty for any scan class other than the new test table, so its == 0 assertion is trivially true (F5).

Other findings inline are smaller: a redundant/weaker runtime filterAttributes re-check (F2), a naming inconsistency with the sibling static interface (F3), a missing contract detail in the class Javadoc (F4), a scan-local conf leak in the test (F7), a Scaladoc imprecision (F6), and a minor double-call (F8).

}
}

filterableScan.pushedPredicates().nonEmpty
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKING] The old BatchScanExec always re-called scan.toBatch.planInputPartitions() whenever filter() had been invoked; this return value is meant to replace that signal. But pushedPredicates() is a newly added default-returning-empty method. Every existing SupportsRuntimeV2Filtering / SupportsRuntimeFiltering implementation that doesn't override it will have this return false, causing BatchScanExec.filteredPartitions to fall into the else branch and use the pre-filter inputPartitions — the connector's filter() side effect (partitions pruned in its internal state) is then invisible to Spark. This breaks runtime partition pruning for every existing V2 runtime-filter implementation (including the in-repo InMemoryBatchScan and InMemoryV2FilterBatchScan).

The signal should be "was filter() actually called", not "did the scan self-report". Something like:

var filterCalled = false
if (filtersToTranslated.nonEmpty) {
  filterableScan.filter(filtersToTranslated.values.toArray)
  filterCalled = true
}
if (filterableScan.supportsIterativeFiltering()) {
  // ...
  if (partPredicates.nonEmpty) {
    filterableScan.filter(partPredicates.toArray)
    filterCalled = true
  }
}
filterCalled

Please also add a regression test: a scan that doesn't override pushedPredicates() and asserts its filter()-driven partition pruning still takes effect (e.g., via partition count on the resulting batch).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is overstated, there's actually a API 'supportsIterativeFiltering' that is default false, so it should not affect existing connector.

The rule is now, if the connector returns 'true', it needs to maintain pushedPredicates()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that being said, i did add a test for such a connector (that overrides supportsIterativeFiltering=true, but doesn't implement pushedPredicates. the expected behavior is then it gets duplicate predicates in the second round.

val pushed = filterableScan.pushedPredicates().toSet
val candidates = runtimeFilters.filter { f =>
!filtersToTranslated.get(f).exists(pushed.contains) &&
f.references.subsetOf(filterAttrs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These candidates are already constrained at planning time: DPP filters in PartitionPruning.scala:82-89 require resExp.references.subsetOf(filterAttrs) via V2ExpressionUtils.resolveRefs; scalar subquery filters in DataSourceV2Strategy.scala:168-173 require f.references.subsetOf(relation.runtimeFilterAttrs) — both using the proper resolver. The runtime reconstruction here uses r.fieldNames.head + output.find(resolver), which drops multi-part paths for nested partition fields and is redundant with planning-time filtering. Can we drop this re-check and rely on the planning-time filters? If it's defense-in-depth, please use V2ExpressionUtils.resolveRefs for consistency with the planning-time path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Extracted V2ExpressionUtils.resolveAttributeRefs and use it in both PartitionPruning and PushDownUtils for consistency.

*
* @since 4.2.0
*/
default boolean supportsIterativeFiltering() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SupportsPushDownV2Filters already has supportsIterativePushdown() for the same concept. Two names for one capability is a consistency hazard for connectors that implement both interfaces. Worth aligning — supportsIterativeFiltering matches the local filter() verb, supportsIterativePushdown matches the sibling interface. Happy either way, just ideally one name.

Copy link
Copy Markdown
Member Author

@szehon-ho szehon-ho Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea to add to the name context, on runtime filter the API is 'filter', so there's no mention of pushdown.

So supportsIterativePushdown may not make sense. But im open as well cc @aokolnychyi

Copy link
Copy Markdown
Member Author

@szehon-ho szehon-ho Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually never mind, it does make sense as the new other method is 'pushedPredicate', i changed the name to match now

* and only one of them should be implemented by the data sources.
*
* <p>
* <b>Iterative filtering:</b> When {@link #supportsIterativeFiltering()} returns true,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description says filter() "may be called multiple times" but doesn't state the call order. The implementation pushes translated V2 predicates first and PartitionPredicate instances second, and the in-repo InMemoryEnhancedRuntimePartitionFilterBatchScan already relies on that ordering. Worth documenting explicitly so implementations know what to expect in each call.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated Javadoc to document the two-pass call order and that the second pass excludes filters already accepted via pushedPredicates().

checkAnswer(df, Row(3, 3))

assertHasRuntimeFilters(df)
assertPushedPartitionPredicates(df, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is effectively a no-op here. getPushedPartitionPredicates pattern-matches specifically on InMemoryEnhancedRuntimePartitionFilterBatchScan; for any other scan class (case 11 uses InMemoryV2FilterBatchScan) it returns Seq.empty. So assertPushedPartitionPredicates(df, 0) is trivially true regardless of what actually happened — it would pass even if a PartitionPredicate did get pushed. More reliable options: track raw filter() argument types on the test scan and assert no PartitionPredicate was seen, or add a variant of the new test table that returns supportsIterativeFiltering=false so the existing helper can still inspect it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Case 11 now uses the enhanced catalog with supports-iterative-filtering=false table property, so the assertPushedPartitionPredicates helper can properly inspect the scan.

Comment on lines +137 to +138
* Only runtime filters that were not already translated are used to derive PartitionPredicates
* in the second pass, avoiding duplicate pushdown.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code filter is !filtersToTranslated.get(f).exists(pushed.contains) — i.e., exclude filters whose translation was already accepted (present in pushedPredicates()). Translated-but-rejected filters are still candidates. The inline comment on line 161-163 already states this correctly.

Suggested change
* Only runtime filters that were not already translated are used to derive PartitionPredicates
* in the second pass, avoiding duplicate pushdown.
* Only runtime filters whose translated form was not already accepted by the data source in
* the first pass are used to derive PartitionPredicates in the second pass, avoiding duplicate
* pushdown.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated scaladoc to clarify "accepted" vs "rejected".


test("case 11: supportsIterativeFiltering is false -> no PartitionPredicate") {
val baseCatalog = "testv2filterNoIterative"
spark.conf.set(s"spark.sql.catalog.$baseCatalog",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conf isn't unset; only spark.sessionState.catalogManager.reset() runs in after. Consider wrapping with withSQLConf (or unsetting at the end) so the conf is cleaned up between tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Wrapped with withSQLConf.

if (filterableScan.supportsIterativeFiltering()) {
val filterAttrs = AttributeSet(filterableScan.filterAttributes()
.flatMap(r => output.find(a => SQLConf.get.resolver(a.name, r.fieldNames.head))))
val pushed = filterableScan.pushedPredicates().toSet
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor — pushedPredicates() is invoked twice (here and again on line 180 for the return signal). For a connector with a non-trivial implementation, one call would do:

val before = filterableScan.pushedPredicates().toSet
// ... second-pass logic uses `before` ...
// at the end, reuse `filterCalled` (from the F1 suggestion) instead of re-querying pushedPredicates

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Refactored to val approach — pushedPredicates() is now called only once, and the return value uses translatedFiltersPushed || partPredicatesPushed.

…test improvements

- Return translatedFiltersPushed || partPredicatesPushed instead of
  pushedPredicates().nonEmpty, so filter() side effects are visible
  even if the connector does not override pushedPredicates().
- Extract V2ExpressionUtils.resolveAttributeRefs to share resolution
  logic between PartitionPruning and PushDownUtils.
- Clarify SupportsRuntimeV2Filtering javadoc: document two-pass call
  order and that the second pass excludes already-accepted filters.
- Refactor case 11 to use the enhanced catalog with
  supports-iterative-filtering=false property and withSQLConf.
- Add regression test for buggy connector that omits first-pass
  filters from pushedPredicates().
- Code tidying in InMemoryEnhancedRuntimePartitionFilterTable.
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of 6b07f10: 7 addressed, 1 remaining, 0 new.

Addressed:

  • F1 (BLOCKING): pushRuntimeFilters now returns translatedFiltersPushed || partPredicatesPushed — a direct signal of whether filter() was invoked, independent of pushedPredicates(). Added a regression test (pushedPredicates() omits first-pass filters -> second round still prunes) that covers the buggy-connector case.
  • F2: the runtime filterAttributes re-check now uses the new shared V2ExpressionUtils.resolveAttributeRefs helper, so PushDownUtils and PartitionPruning use the same resolver.
  • F4: class-level Javadoc on SupportsRuntimeV2Filtering now documents the two-pass call order (translated V2 predicates first, PartitionPredicate second).
  • F5: case 11 now uses the enhanced catalog with a new supports-iterative-filtering=false table property, so assertPushedPartitionPredicates actually inspects the scan instead of trivially matching Seq.empty.
  • F6: scaladoc on pushRuntimeFilters now distinguishes "not already accepted" from "not already translated".
  • F7: case 11 now wraps its scan-local conf in withSQLConf for proper cleanup.
  • F8: pushedPredicates() is invoked once per pushRuntimeFilters call (via a val that is reused for both the candidate filter and the return signal).

Remaining:

  • F3: supportsIterativeFiltering vs. the sibling SupportsPushDownV2Filters#supportsIterativePushdown naming is still open pending @aokolnychyi's input (#55382 (comment)). No blocker from my side either way — just one name.

New: none.

…tivePushdown

Align with SupportsPushDownV2Filters.supportsIterativePushdown() naming.
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of f426fbc: 1 addressed, 0 remaining, 0 new — F3 is now resolved.

Addressed this round:

  • F3: supportsIterativeFiltering renamed to supportsIterativePushdown, aligning with SupportsPushDownV2Filters#supportsIterativePushdown(). Test property key and test case names updated consistently; no stale references remain.

All eight findings from the original review are now resolved. No new concerns on my side.

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks. merging to master!

@cloud-fan cloud-fan closed this in 4c56079 Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants