Skip to content

[WIP][SPARK-56479][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017

Draft
dbtsai wants to merge 2 commits intoapache:masterfrom
dbtsai:dfCache
Draft

[WIP][SPARK-56479][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017
dbtsai wants to merge 2 commits intoapache:masterfrom
dbtsai:dfCache

Conversation

@dbtsai
Copy link
Copy Markdown
Member

@dbtsai dbtsai commented Mar 25, 2026

⏺ ## What changes were proposed in this pull request?

df.cache() is backed by InMemoryRelation, a pre-DSv2 logical plan node. Queries on cached
DataFrames skip the entire V2ScanRelationPushDown optimizer batch, losing column pruning, filter
pushdown, sort-order propagation, and standard statistics reporting that all DSv2-backed sources
benefit from.

This PR introduces a thin DSv2 wrapper layer — InMemoryCacheTable, InMemoryScanBuilder, and
InMemoryCacheScan — in a new file InMemoryCacheTable.scala. CacheManager.useCachedData() now
substitutes matching plan fragments with DataSourceV2Relation(InMemoryCacheTable) instead of bare
InMemoryRelation, so the V2ScanRelationPushDown batch fires on cached DataFrames just as it does
for any other DSv2 source.

The DSv2 interfaces implemented:

Interface Benefit
SupportsPushDownRequiredColumns Column pruning: InMemoryTableScanExec deserializes only the requested columns
SupportsPushDownV2Filters Filter pushdown: predicates are recorded for per-batch min/max pruning via CachedBatchSerializer.buildFilter; all predicates are returned (category-2) so a post-scan FilterExec is always added for
row-level evaluation
SupportsReportOrdering Sort-order propagation: cached sort order is visible to V2ScanPartitioningAndOrdering, eliminating redundant sorts on ordered cached data
SupportsReportStatistics Statistics: accurate row count, size, and column statistics (including those gathered via ANALYZE) are reported to the optimizer and AQE via the standard V2 path

Physical execution is fully preserved. A new case in DataSourceV2Strategy intercepts
DataSourceV2ScanRelation(InMemoryCacheScan) before the generic BatchScanExec path and routes it
back to InMemoryTableScanExec. No change to the columnar scan hot path.

A CachedRelation extractor object matches any of the three plan forms a cached relation takes
across query stages — InMemoryRelation, DataSourceV2Relation(InMemoryCacheTable), and
DataSourceV2ScanRelation(InMemoryCacheScan) — providing backward compatibility for all existing
pattern-match sites.

Why are the changes needed?

Without this change, every query on a cached DataFrame ignores standard optimizer rules that every
DSv2 source benefits from. The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Does this PR introduce any user-facing change?

Yes, in a positive way:

  • Queries that access a subset of columns from a cached DataFrame will be significantly faster.
  • Queries on sorted cached DataFrames may eliminate redundant sort operations.
  • The ANALYZE column statistics API for cached queries now correctly propagates stats through the
    optimized plan.

The InMemoryRelation type is still used internally and visible in CachedData.cachedRepresentation.
The change is transparent: df.cache(), spark.catalog.cacheTable(), and related APIs behave
identically from the user's perspective.

How was this patch tested?

All existing tests in CachedTableSuite (104 tests), InMemoryColumnarQuerySuite,
DatasetCacheSuite, UDFSuite, DataSourceV2SQLSuite, and LogicalPlanTagInSparkPlanSuite pass
without modification to test logic (only callsite updates to use the new CachedRelation extractor).

A new benchmark InMemoryCacheDSv2Benchmark was added. Results on 1M rows (AQE off, single
partition):

Column pruning - 1000000 rows, 10 cols, select 2:

sum 2 of 10 cols (column pruning via DSv2) 19ms 1.0X
sum all 10 cols (no pruning - pre-DSv2) 54ms 0.3X (~2.8x speedup)

Planning overhead - 1000 plan-only iterations:

optimizedPlan (DSv2 path, V2ScanRelationPushDown) 254ms total (~0.25ms/plan)

Column pruning yields a ~3x speedup when accessing 2 of 10 cached columns. Planning overhead
from the additional optimizer rules is ~0.25ms per query, negligible in practice.

@viirya
Copy link
Copy Markdown
Member

viirya commented Mar 25, 2026

The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Currently InMemoryTableScanExec has column pruning already. It only deserializes necessary column.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft March 30, 2026 23:30
@dbtsai dbtsai changed the title [WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules [WIP][SPARK-56479][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules Apr 14, 2026
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I spent some time studying the approach. Before going deeper into implementation details, I'd like to step back and discuss whether this is the right architecture.

I mapped out which capabilities are genuinely new vs. already provided by the existing InMemoryScans + InMemoryTableScanExec path:

Capability Already works today? How it works today
Column pruning Yes InMemoryScans uses pruneFilterProjectInMemoryTableScanExec(attributes=prunedCols) → serializer only deserializes requested columns
Batch-level filter pruning (min/max) Yes Filters passed to InMemoryTableScanExec.predicatesCachedBatchSerializer.buildFilter
Sort order propagation Yes InMemoryTableScanExec.outputOrdering derives from cachedPlan.outputOrdering
Statistics reporting Yes InMemoryRelation.computeStats() reports accumulator-based size/rowCount after materialization
Dynamic Partition Pruning No PartitionPruning.getFilterableTableScan has no InMemoryRelation case
Per-partition limit pushdown No Not supported today

The DSv2 wrapping adds the last two, but comes with tradeoffs:

  • Stats behavioral change: InMemoryCacheScan.estimateStatistics() applies a column-ratio scaling heuristic (sizeInBytes * prunedAttrs.size / output.size) that doesn't exist today. CBO sees different statistics, which could affect join strategy selection.
  • EXPLAIN / plan shape change: withCachedData shows DataSourceV2Relation instead of InMemoryRelation; optimizedPlan shows DataSourceV2ScanRelation. This is user-visible and could affect tooling or downstream code that inspects plans.

Given that the two genuinely new capabilities (DPP, limit pushdown) could alternatively be added directly to InMemoryScans / InMemoryTableScanExec with much less surface area, do we want to proceed with the DSv2 wrapping approach? The main argument for it would be future-proofing (new V2 capabilities auto-apply), but the tradeoff is coupling the in-memory cache to the full V2 contract surface area.

Would it make sense to either:

  1. Add DPP + limit support directly to the existing path (targeted, low-risk), or
  2. If we prefer the DSv2 approach for extensibility, add a config toggle so InMemoryScans stays alive as a fallback?


/**
* Fallback strategy for cached in-memory tables when the DSv2 cache path is disabled
* (spark.sql.inMemoryColumnarStorage.useDataSourceV2 = false).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config spark.sql.inMemoryColumnarStorage.useDataSourceV2 doesn't exist anywhere in the codebase. Also, after this PR InMemoryScans becomes unreachable dead code since CacheManager always wraps in DataSourceV2Relation — should we add a config toggle, or remove this strategy?

@transient relation: InMemoryRelation)
@transient relation: InMemoryRelation,
limit: Option[Int] = None,
runtimeFilters: Seq[Expression] = Nil)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case class now has 5 fields — sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala line 42 still has InMemoryTableScanExec(_, _, relation) which won't compile.

* The DPP pipeline injects [[DynamicPruning]] expressions into the plan, which
* [[DataSourceV2Strategy]] separates and passes as runtimeFilters to the exec node.
*/
override def filter(predicates: Array[V2Predicate]): Unit = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter() is a no-op — runtime filtering is actually routed through InMemoryTableScanExec.runtimeFilters instead. This deviates from the V2 contract where filter() is expected to prune InputPartitions.

DataSourceV2Strategy.withProjectAndFilter(
project, filters, localScanExec, needsUnsafeConversion = false) :: Nil

case PhysicalOperation(project, filters,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InMemoryCacheScan implements SupportsRuntimeV2Filtering and reports filterAttributes(), but this special case doesn't extract scalar subquery filters on those attributes the way the generic DataSourceV2ScanRelation case does (lines 168-183 on master). Should these be handled consistently?

@szehon-ho
Copy link
Copy Markdown
Member

just to clarify, it looks like a migration and pointing out some risk. (im ok to do it, if we feel its worth to do)

@dbtsai
Copy link
Copy Markdown
Member Author

dbtsai commented Apr 21, 2026

Thanks @szehon-ho and @viirya for the detailed review — you're both right that my original description overstated the gaps.

After double-checking with git blame and reading InMemoryScans carefully:

Capability Already works today? Via
Column pruning Yes InMemoryScans uses pruneFilterProjectInMemoryTableScanExec(prunedCols, ...)
Filter pushdown (batch min/max) Yes InMemoryTableScanExec.predicatesCachedBatchSerializer.buildFilter
Sort-order propagation Yes InMemoryTableScanExec.outputOrdering from cachedPlan.outputOrdering
Statistics Yes InMemoryRelation.computeStats()
Dynamic Partition Pruning No New via SupportsRuntimeV2Filtering
Per-partition LIMIT pushdown No New via SupportsPushDownLimit

I've pushed a follow-up commit that:

  1. Adds a config toggle spark.sql.inMemoryColumnarStorage.enableDatasourceV2 (default true). When set to false, CacheManager.useCachedData returns InMemoryRelation directly, falling back to the existing InMemoryScans planner path. This addresses the concern about plan shape and stats behavioral changes.

  2. Fixes the ScalaDoc in InMemoryCacheTable and InMemoryScanBuilder to accurately state that the genuinely new capabilities are DPP and per-partition LIMIT; the other interfaces mirror what InMemoryScans/pruneFilterProject already provides.

  3. Updates the InMemoryScans comment to reference the actual config key.

Regarding the two tradeoffs you flagged:

  • Stats behavioral change: The column-ratio scaling heuristic in estimateStatistics() is indeed new behavior. I'll add a note to the PR description and consider whether to make it match the existing behavior more closely.
  • Plan shape change: With the config toggle users can opt out if tooling depends on InMemoryRelation appearing in EXPLAIN output.

Does this approach address the architecture concern, or would you prefer I pursue option 1 (add DPP + limit directly to InMemoryScans/InMemoryTableScanExec without DSv2 wrapping)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants