[SPARK-54022][SQL] Make DSv2 table resolution aware of cached tables #52764

aokolnychyi · 2025-10-28T18:23:23Z

What changes were proposed in this pull request?

This PR makes DSv2 table resolution aware of cached tables via CACHE TABLE t or spark.table("t").cache() commands.

Why are the changes needed?

These changes are needed to avoid silent cache misses for DSv2 tables. Cache lookups depend on DSv2 Table instance equality. If each query is allowed to load a new Table instance from the metastore, this would mean connectors can pick up external changes, leading to unexpected cache misses. This contradicts the behavior we had for built-in Tables and some DSv1 connectors such as Delta. Historically, the expected behavior of CACHE TABLE t and spark.table("t").cache() is to cache the table state.

Does this PR introduce any user-facing change?

Yes. The PR fixes the the resolution for DSv2 so that CACHE TABLE t behaves correctly and reliably.

caching table via Dataset API will now pin table state
caching table via CACHE TABLE will now pin table state
caching a query via Dataset API will continue to simply cache the query plan as before

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

aokolnychyi · 2025-10-28T18:25:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RelationResolution.scala

+                        newRelation.copyTagsFrom(multi)
+                        newRelation
+                    }
+                    val r = u.getTagValue(LogicalPlan.PLAN_ID_TAG)


I have a separate PR that refactors this rule a bit, I'll rebase before merging.
There is a bit of duplicated logic here now.

Here is the PR for refactoring: #52781

cloud-fan · 2025-10-29T02:26:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+        }
+        CacheManager.logCacheOperation(
+          log"Relation cache hit for table ${MDC(TABLE_NAME, nameWithTimeTravel)}")
+        Some(cachedRelation)


hmm we just return the first match? Shall we use the scan with the latest table version?

Well, we shouldn't have multiple matching relations after this change, but cachedData is IndexedSeq to which we always prepend entries (so newer entries are at the beginning of the sequence). We don't know which version of the table is newer cause they are strings. In Iceberg, for instance, they are random UUIDs. That said, this piece should always take the newest matching entry but we expect to only have one.

gengliangwang · 2025-11-15T00:51:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

    }
  }

+  private[sql] def lookupCachedTable(


let's make it more straightforward: lookupCachedTableByName

@aokolnychyi This is my last comment for this PR

@gengliangwang, I tested the rename locally, but not sure it would make the code any clearer. In fact, it only creates inconsistencies with other methods that accept name in args but don't have xxxByName suffix.

I feel lookupCachedTable(name, resolver) is already pretty clear. What do you think?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

aokolnychyi · 2025-11-19T21:54:07Z

@gengliangwang @cloud-fan, I made some updates to this PR. Please, take another look.

aokolnychyi · 2025-11-19T21:55:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RelationCache.scala

-  }
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+
+private[sql] trait RelationCache {


This can be any relation cache in the future.

gengliangwang · 2025-11-19T23:32:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2TableRefreshUtil.scala

          val tableName = V2TableUtil.toQualifiedName(catalog, ident)
-          logDebug(s"Refreshing table metadata for $tableName")
-          catalog.loadTable(ident)
+            val cachedTable = sharedRelationCache.lookup(nameParts, conf.resolver).collect {


nit: indent

gengliangwang · 2025-11-19T23:32:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2TableRefreshUtil.scala

-          logDebug(s"Refreshing table metadata for $tableName")
-          catalog.loadTable(ident)
+            val cachedTable = sharedRelationCache.lookup(nameParts, conf.resolver).collect {
+            case cached: DataSourceV2Relation


There is similar code in https://github.com/apache/spark/pull/52764/files#diff-cbcce0570cab073f758b082141fbc4efa3cdfc1da7395c63609c5f2fa5ad902eR138.

gengliangwang · 2025-11-19T23:37:54Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2DataFrameSuite.scala

    }
  }

+  test("SPARK-54022: caching table via Dataset API should pin table state") {


Let's mention the behavior change before / after this PR in the PR description:

caching table via Dataset API

caching a query via Dataset API

caching table via CACHE TABLE

cloud-fan · 2025-11-20T16:22:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

  override def markAsAnalyzed(ac: AnalysisContext): LogicalPlan = {
-    // RTAS may drop and recreate table before query execution, breaking self-references
-    // refresh and pin versions here to read from original table versions instead of
-    // newly created empty table that is meant to serve as target for append/overwrite


why we can remove this now?

We simply moved this to exec node as I no longer have access to SparkSession in catalyst. We can't do this refresh without checking cached relations as it may potentially hit the metastore and move the version inconsistently.

aokolnychyi · 2025-11-20T18:40:20Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

    // 2. Writing to the new table fails,
    // 3. The table returned by catalog.createTable doesn't support writing.
+    //
+    // RTAS must refresh and pin versions in query to read from original table versions instead of


@cloud-fan, this is the new place for this RTAS refresh.

gengliangwang

+1 for the proposal.

dongjoon-hyun

Could you resolve the conflict, @aokolnychyi ?

dongjoon-hyun

I've been monitoring this PR for last month as a kind of bug fix as described in the PR description (and JIRA). Thank you, @aokolnychyi , @cloud-fan , @gengliangwang .

Merged to master/4.1 for Apache Spark 4.1.0.

### What changes were proposed in this pull request? This PR makes DSv2 table resolution aware of cached tables via CACHE TABLE t or `spark.table("t").cache()` commands. ### Why are the changes needed? These changes are needed to avoid silent cache misses for DSv2 tables. Cache lookups depend on DSv2 Table instance equality. If each query is allowed to load a new Table instance from the metastore, this would mean connectors can pick up external changes, leading to unexpected cache misses. This contradicts the behavior we had for built-in Tables and some DSv1 connectors such as Delta. Historically, the expected behavior of CACHE TABLE t and `spark.table("t").cache()` is to cache the table state. ### Does this PR introduce _any_ user-facing change? Yes. The PR fixes the the resolution for DSv2 so that CACHE TABLE t behaves correctly and reliably. - caching table via Dataset API will now pin table state - caching table via CACHE TABLE will now pin table state - caching a query via Dataset API will continue to simply cache the query plan as before ### How was this patch tested? This PR comes with tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52764 from aokolnychyi/spark-54022. Lead-authored-by: Anton Okolnychyi <aokolnychyi@apache.org> Co-authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 92c948f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the SQL label Oct 28, 2025

aokolnychyi commented Oct 28, 2025

View reviewed changes

cloud-fan reviewed Oct 29, 2025

View reviewed changes

aokolnychyi force-pushed the spark-54022 branch from 58fd24d to 17c2c55 Compare November 14, 2025 23:31

gengliangwang reviewed Nov 15, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

aokolnychyi force-pushed the spark-54022 branch from 17c2c55 to 4e9df65 Compare November 19, 2025 20:58

aokolnychyi commented Nov 19, 2025

View reviewed changes

gengliangwang reviewed Nov 19, 2025

View reviewed changes

cloud-fan reviewed Nov 20, 2025

View reviewed changes

[SPARK-54022][SQL] Make DSv2 table resolution aware of cached tables

939f877

aokolnychyi force-pushed the spark-54022 branch from 4e9df65 to 939f877 Compare November 20, 2025 18:32

aokolnychyi commented Nov 20, 2025

View reviewed changes

gengliangwang approved these changes Nov 22, 2025

View reviewed changes

dongjoon-hyun reviewed Nov 23, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/master' into spark-54022

1097f58

dongjoon-hyun approved these changes Nov 23, 2025

View reviewed changes

dongjoon-hyun closed this in 92c948f Nov 23, 2025

[SPARK-54022][SQL] Make DSv2 table resolution aware of cached tables #52764

[SPARK-54022][SQL] Make DSv2 table resolution aware of cached tables #52764

Conversation

aokolnychyi commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aokolnychyi commented Nov 19, 2025

Uh oh!

aokolnychyi Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aokolnychyi commented Oct 28, 2025 •

edited

Loading

aokolnychyi Oct 29, 2025 •

edited

Loading

aokolnychyi Nov 19, 2025 •

edited

Loading

aokolnychyi Nov 20, 2025 •

edited

Loading