[WIP][SPARK-56479][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules by dbtsai · Pull Request #55017 · apache/spark

dbtsai · 2026-03-25T21:23:35Z

⏺ ## What changes were proposed in this pull request?

df.cache() is backed by InMemoryRelation, a pre-DSv2 logical plan node. Queries on cached
DataFrames skip the entire V2ScanRelationPushDown optimizer batch, losing column pruning, filter
pushdown, sort-order propagation, and standard statistics reporting that all DSv2-backed sources
benefit from.

This PR introduces a thin DSv2 wrapper layer — InMemoryCacheTable, InMemoryScanBuilder, and
InMemoryCacheScan — in a new file InMemoryCacheTable.scala. CacheManager.useCachedData() now
substitutes matching plan fragments with DataSourceV2Relation(InMemoryCacheTable) instead of bare
InMemoryRelation, so the V2ScanRelationPushDown batch fires on cached DataFrames just as it does
for any other DSv2 source.

The DSv2 interfaces implemented:

Interface	Benefit
`SupportsPushDownRequiredColumns`	Column pruning: `InMemoryTableScanExec` deserializes only the requested columns
`SupportsPushDownV2Filters`	Filter pushdown: predicates are recorded for per-batch min/max pruning via `CachedBatchSerializer.buildFilter`; all predicates are returned (category-2) so a post-scan `FilterExec` is always added for
row-level evaluation
`SupportsReportOrdering`	Sort-order propagation: cached sort order is visible to `V2ScanPartitioningAndOrdering`, eliminating redundant sorts on ordered cached data
`SupportsReportStatistics`	Statistics: accurate row count, size, and column statistics (including those gathered via `ANALYZE`) are reported to the optimizer and AQE via the standard V2 path

Physical execution is fully preserved. A new case in DataSourceV2Strategy intercepts
DataSourceV2ScanRelation(InMemoryCacheScan) before the generic BatchScanExec path and routes it
back to InMemoryTableScanExec. No change to the columnar scan hot path.

A CachedRelation extractor object matches any of the three plan forms a cached relation takes
across query stages — InMemoryRelation, DataSourceV2Relation(InMemoryCacheTable), and
DataSourceV2ScanRelation(InMemoryCacheScan) — providing backward compatibility for all existing
pattern-match sites.

Why are the changes needed?

Without this change, every query on a cached DataFrame ignores standard optimizer rules that every
DSv2 source benefits from. The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Does this PR introduce any user-facing change?

Yes, in a positive way:

Queries that access a subset of columns from a cached DataFrame will be significantly faster.
Queries on sorted cached DataFrames may eliminate redundant sort operations.
The ANALYZE column statistics API for cached queries now correctly propagates stats through the
optimized plan.

The InMemoryRelation type is still used internally and visible in CachedData.cachedRepresentation.
The change is transparent: df.cache(), spark.catalog.cacheTable(), and related APIs behave
identically from the user's perspective.

How was this patch tested?

All existing tests in CachedTableSuite (104 tests), InMemoryColumnarQuerySuite,
DatasetCacheSuite, UDFSuite, DataSourceV2SQLSuite, and LogicalPlanTagInSparkPlanSuite pass
without modification to test logic (only callsite updates to use the new CachedRelation extractor).

A new benchmark InMemoryCacheDSv2Benchmark was added. Results on 1M rows (AQE off, single
partition):

Column pruning - 1000000 rows, 10 cols, select 2:

sum 2 of 10 cols (column pruning via DSv2) 19ms 1.0X
sum all 10 cols (no pruning - pre-DSv2) 54ms 0.3X (~2.8x speedup)

Planning overhead - 1000 plan-only iterations:

optimizedPlan (DSv2 path, V2ScanRelationPushDown) 254ms total (~0.25ms/plan)

Column pruning yields a ~3x speedup when accessing 2 of 10 cached columns. Planning overhead
from the additional optimizer rules is ~0.25ms per query, negligible in practice.

viirya · 2026-03-25T23:44:51Z

The most impactful gap is column pruning: even
SELECT one_col FROM cached_wide_table causes all columns to be deserialized from the in-memory
columnar store.

Currently InMemoryTableScanExec has column pruning already. It only deserializes necessary column.

szehon-ho

Thanks for the PR! I spent some time studying the approach. Before going deeper into implementation details, I'd like to step back and discuss whether this is the right architecture.

I mapped out which capabilities are genuinely new vs. already provided by the existing InMemoryScans + InMemoryTableScanExec path:

Capability	Already works today?	How it works today
Column pruning	Yes	`InMemoryScans` uses `pruneFilterProject` → `InMemoryTableScanExec(attributes=prunedCols)` → serializer only deserializes requested columns
Batch-level filter pruning (min/max)	Yes	Filters passed to `InMemoryTableScanExec.predicates` → `CachedBatchSerializer.buildFilter`
Sort order propagation	Yes	`InMemoryTableScanExec.outputOrdering` derives from `cachedPlan.outputOrdering`
Statistics reporting	Yes	`InMemoryRelation.computeStats()` reports accumulator-based size/rowCount after materialization
Dynamic Partition Pruning	No	`PartitionPruning.getFilterableTableScan` has no `InMemoryRelation` case
Per-partition limit pushdown	No	Not supported today

The DSv2 wrapping adds the last two, but comes with tradeoffs:

Stats behavioral change: InMemoryCacheScan.estimateStatistics() applies a column-ratio scaling heuristic (sizeInBytes * prunedAttrs.size / output.size) that doesn't exist today. CBO sees different statistics, which could affect join strategy selection.
EXPLAIN / plan shape change: withCachedData shows DataSourceV2Relation instead of InMemoryRelation; optimizedPlan shows DataSourceV2ScanRelation. This is user-visible and could affect tooling or downstream code that inspects plans.

Given that the two genuinely new capabilities (DPP, limit pushdown) could alternatively be added directly to InMemoryScans / InMemoryTableScanExec with much less surface area, do we want to proceed with the DSv2 wrapping approach? The main argument for it would be future-proofing (new V2 capabilities auto-apply), but the tradeoff is coupling the in-memory cache to the full V2 contract surface area.

Would it make sense to either:

Add DPP + limit support directly to the existing path (targeted, low-risk), or
If we prefer the DSv2 approach for extensibility, add a config toggle so InMemoryScans stays alive as a fallback?

szehon-ho · 2026-04-15T22:31:33Z


+  /**
+   * Fallback strategy for cached in-memory tables when the DSv2 cache path is disabled
+   * (spark.sql.inMemoryColumnarStorage.useDataSourceV2 = false).


This config spark.sql.inMemoryColumnarStorage.useDataSourceV2 doesn't exist anywhere in the codebase. Also, after this PR InMemoryScans becomes unreachable dead code since CacheManager always wraps in DataSourceV2Relation — should we add a config toggle, or remove this strategy?

szehon-ho · 2026-04-15T22:31:33Z

-    @transient relation: InMemoryRelation)
+    @transient relation: InMemoryRelation,
+    limit: Option[Int] = None,
+    runtimeFilters: Seq[Expression] = Nil)


The case class now has 5 fields — sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala line 42 still has InMemoryTableScanExec(_, _, relation) which won't compile.

szehon-ho · 2026-04-15T22:31:33Z

+   * The DPP pipeline injects [[DynamicPruning]] expressions into the plan, which
+   * [[DataSourceV2Strategy]] separates and passes as runtimeFilters to the exec node.
+   */
+  override def filter(predicates: Array[V2Predicate]): Unit = {}


filter() is a no-op — runtime filtering is actually routed through InMemoryTableScanExec.runtimeFilters instead. This deviates from the V2 contract where filter() is expected to prune InputPartitions.

szehon-ho · 2026-04-15T22:31:33Z

      DataSourceV2Strategy.withProjectAndFilter(
        project, filters, localScanExec, needsUnsafeConversion = false) :: Nil

+    case PhysicalOperation(project, filters,


InMemoryCacheScan implements SupportsRuntimeV2Filtering and reports filterAttributes(), but this special case doesn't extract scalar subquery filters on those attributes the way the generic DataSourceV2ScanRelation case does (lines 168-183 on master). Should these be handled consistently?

szehon-ho · 2026-04-15T22:34:33Z

just to clarify, it looks like a migration and pointing out some risk. (im ok to do it, if we feel its worth to do)

dbtsai · 2026-04-21T23:01:44Z

Thanks @szehon-ho and @viirya for the detailed review — you're both right that my original description overstated the gaps.

After double-checking with git blame and reading InMemoryScans carefully:

Capability	Already works today?	Via
Column pruning	Yes	`InMemoryScans` uses `pruneFilterProject` → `InMemoryTableScanExec(prunedCols, ...)`
Filter pushdown (batch min/max)	Yes	`InMemoryTableScanExec.predicates` → `CachedBatchSerializer.buildFilter`
Sort-order propagation	Yes	`InMemoryTableScanExec.outputOrdering` from `cachedPlan.outputOrdering`
Statistics	Yes	`InMemoryRelation.computeStats()`
Dynamic Partition Pruning	No	New via `SupportsRuntimeV2Filtering`
Per-partition LIMIT pushdown	No	New via `SupportsPushDownLimit`

I've pushed a follow-up commit that:

Adds a config toggle spark.sql.inMemoryColumnarStorage.enableDatasourceV2 (default true). When set to false, CacheManager.useCachedData returns InMemoryRelation directly, falling back to the existing InMemoryScans planner path. This addresses the concern about plan shape and stats behavioral changes.
Fixes the ScalaDoc in InMemoryCacheTable and InMemoryScanBuilder to accurately state that the genuinely new capabilities are DPP and per-partition LIMIT; the other interfaces mirror what InMemoryScans/pruneFilterProject already provides.
Updates the InMemoryScans comment to reference the actual config key.

Regarding the two tradeoffs you flagged:

Stats behavioral change: The column-ratio scaling heuristic in estimateStatistics() is indeed new behavior. I'll add a note to the PR description and consider whether to make it match the existing behavior more closely.
Plan shape change: With the config toggle users can opt out if tooling depends on InMemoryRelation appearing in EXPLAIN output.

Does this approach address the architecture concern, or would you prefer I pursue option 1 (add DPP + limit directly to InMemoryScans/InMemoryTableScanExec without DSv2 wrapping)?

dongjoon-hyun marked this pull request as draft March 30, 2026 23:30

dsv2 cache

2d49b4f

dbtsai force-pushed the dfCache branch from 211c27c to 2d49b4f Compare April 14, 2026 19:06

update

d6f666a

dbtsai changed the title ~~[WIP][SPARK-XXXXX][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules~~ [WIP][SPARK-56479][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules Apr 14, 2026

szehon-ho reviewed Apr 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-56479][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017

[WIP][SPARK-56479][SQL] df.cache() with DSv2 interfaces to enable V2ScanRelationPushDown optimizer rules#55017
dbtsai wants to merge 2 commits intoapache:masterfrom
dbtsai:dfCache

dbtsai commented Mar 25, 2026

Uh oh!

viirya commented Mar 25, 2026

Uh oh!

szehon-ho left a comment

Uh oh!

szehon-ho Apr 15, 2026

Uh oh!

szehon-ho Apr 15, 2026

Uh oh!

szehon-ho Apr 15, 2026

Uh oh!

szehon-ho Apr 15, 2026

Uh oh!

szehon-ho commented Apr 15, 2026

Uh oh!

dbtsai commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dbtsai commented Mar 25, 2026

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya commented Mar 25, 2026

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Apr 15, 2026

Uh oh!

dbtsai commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants