Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Nov 4, 2025

What changes were proposed in this pull request?

This PR makes Spark reload DSv2 tables in views created using plans on each view access.

Why are the changes needed?

The current problem is that the view definition in the session catalog captures the analyzed plan that references Table (that is supposed to pin the version). If a connector doesn’t have an internal cache and produces a new Table object on each load, the table referenced in the view will become orphan and there will be no way to refresh it unless that Table instance auto refreshes on each scan (super dangerous).

Does this PR introduce any user-facing change?

Yes, but it restores the correct behavior without requiring hacks in connectors.

How was this patch tested?

This PR comes with tests.

Was this patch authored or co-authored using generative AI tooling?

No.

extends LeafNode with MultiInstanceRelation with NamedRelation {

override def name: String = {
s"${catalog.name()}.${identifier.quoted}"
Copy link
Contributor

@cloud-fan cloud-fan Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's follow DataSourceV2RelationBase or create a base trait/util function

  override def name: String = {
    import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
    (catalog, identifier) match {
      case (Some(cat), Some(ident)) => s"${quoteIfNeeded(cat.name())}.${ident.quoted}"
      case _ => table.name()
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


object TableReference {

case class TableInfo(columns: Seq[Column])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just use StructType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually need Columns here to compare original metadata like types (e.g. char/varchar).

@aokolnychyi aokolnychyi changed the title [WIP][SPARK-53924] Reload DSv2 tables in views created using plans on each view access [SPARK-53924] Reload DSv2 tables in views created using plans on each view access Nov 14, 2025
@aokolnychyi aokolnychyi force-pushed the spark-53924 branch 3 times, most recently from 706ade0 to a9251b6 Compare November 14, 2025 18:56
@aokolnychyi
Copy link
Contributor Author

import org.apache.spark.sql.util.CaseInsensitiveStringMap
import org.apache.spark.util.ArrayImplicits._

case class TableReference private (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add code comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since it is DSV2 only, shall we rename as DSV2TableReference?

test("SPARK-53924: insert into DSv2 table invalidates cache of SQL temp views with plans") {
val t = "testcat.tbl"
withTable(t, "v") {
withSQLConf(SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW.key -> "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also test when the config is false

test("SPARK-53924: uncache DSv2 table using SQL uncaches SQL temp views with plans") {
val t = "testcat.tbl"
withTable(t, "v") {
withSQLConf(SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW.key -> "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also test when the config is false

private def adaptCachedRelation(cached: LogicalPlan, ref: TableReference): LogicalPlan = {
cached transform {
case r: DataSourceV2Relation if matchesReference(r, ref) =>
r.copy(output = ref.output, options = ref.options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this is needed? When resolveReference is call, all the cached relation will be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I got the question. This code exists as the cached output attributes may be different from the reference. This method replaces TableReference (which is resolved and has its output) with a relation from the cache but makes it use output and options from TableReference.

Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: do we need to call TableReferenceUtils.validateLoadedTable if it is cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that's a good call. Let me add it.

* For instance, temporary views with fully resolved logical plans don't allow schema changes
* in underlying tables.
*/
case class V2TableReference private(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add private[sql] to make it internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought everything in catalyst is considered internal, but let me update.

}
}

object V2TableReference {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add private[sql] to make it internal

}
}

object V2TableReferenceUtils extends SQLConfHelper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add private[sql] to make it internal

},
"INCOMPATIBLE_COLUMN_CHANGES_AFTER_VIEW_WITH_PLAN_CREATION" : {
"message" : [
"View <viewName> plan references table <tableName> whose <colType> columns changed since the view plan was initially captured.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Maybe, <colType> columns -> <colType> column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the error message says "column changes" in plural, so I think using columns here should be OK.

@aokolnychyi
Copy link
Contributor Author

@dongjoon-hyun @gengliangwang, could you folks help me merge this into master/4.1? All tests are passing.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM again. Thank you, @aokolnychyi .

dongjoon-hyun pushed a commit that referenced this pull request Nov 16, 2025
… view access

### What changes were proposed in this pull request?

This PR makes Spark reload DSv2 tables in views created using plans on each view access.

### Why are the changes needed?

The current problem is that the view definition in the session catalog captures the analyzed plan that references `Table` (that is supposed to pin the version). If a connector doesn’t have an internal cache and produces a new `Table` object on each load, the table referenced in the view will become orphan and there will be no way to refresh it unless that `Table` instance auto refreshes on each scan (super dangerous).

### Does this PR introduce _any_ user-facing change?

Yes, but it restores the correct behavior without requiring hacks in connectors.

### How was this patch tested?

This PR comes with tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #52876 from aokolnychyi/spark-53924.

Authored-by: Anton Okolnychyi <aokolnychyi@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 407e79c)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Merged to master/4.1 for Apache Spark 4.1.0.

@aokolnychyi
Copy link
Contributor Author

Thanks, @dongjoon-hyun @gengliangwang!

gengliangwang added a commit that referenced this pull request Nov 18, 2025
…ng schema changes

### What changes were proposed in this pull request?

Follow-up of #52876, add tests for cached temp view detecting schema changes

### Why are the changes needed?

There is no test coverage after comment #52876 (comment) is addressed. This PR is to add a test case for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New test case
### Was this patch authored or co-authored using generative AI tooling?

No

Closes #53103 from gengliangwang/SPARK-53924-test.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang added a commit that referenced this pull request Nov 18, 2025
…ng schema changes

### What changes were proposed in this pull request?

Follow-up of #52876, add tests for cached temp view detecting schema changes

### Why are the changes needed?

There is no test coverage after comment #52876 (comment) is addressed. This PR is to add a test case for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New test case
### Was this patch authored or co-authored using generative AI tooling?

No

Closes #53103 from gengliangwang/SPARK-53924-test.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit fd683ce)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
@manuzhang
Copy link
Member

@aokolnychyi I'm testing Spark 4.1 support for Iceberg in apache/iceberg#14155. It looks this PR has broken this Iceberg test with following error. Can you help check?

Exception in thread "test-extra-commit-message-writer-thread" java.lang.RuntimeException: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the unresolved operator: 'InsertIntoStatement TableReference[id#5, data#6] default_iceberg.`/var/folders/pv/9kgp4f8j685fqb28n83cdb800000gq/T/junit-16494569145098127962`.`iceberg-table`, false, false, false SQLSTATE: XX000
== SQL (line 1, position 1) ==
INSERT INTO target VALUES (3, 'c'), (4, 'd')
^^^^^^^^^^^^^^^^^^

@aokolnychyi
Copy link
Contributor Author

@manuzhang, ack, let me check.

dongjoon-hyun pushed a commit that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?
Resolve `V2TableReference` for table in `InsertIntoStatement`.

### Why are the changes needed?
#52876 brought in `V2TableReference` which broke relation resolution for insert into temp view on DSv2 table.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #53196 from manuzhang/FIX-SPARK-54491.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?
Resolve `V2TableReference` for table in `InsertIntoStatement`.

### Why are the changes needed?
#52876 brought in `V2TableReference` which broke relation resolution for insert into temp view on DSv2 table.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #53196 from manuzhang/FIX-SPARK-54491.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 3f5a2b9)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
… view access

### What changes were proposed in this pull request?

This PR makes Spark reload DSv2 tables in views created using plans on each view access.

### Why are the changes needed?

The current problem is that the view definition in the session catalog captures the analyzed plan that references `Table` (that is supposed to pin the version). If a connector doesn’t have an internal cache and produces a new `Table` object on each load, the table referenced in the view will become orphan and there will be no way to refresh it unless that `Table` instance auto refreshes on each scan (super dangerous).

### Does this PR introduce _any_ user-facing change?

Yes, but it restores the correct behavior without requiring hacks in connectors.

### How was this patch tested?

This PR comes with tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52876 from aokolnychyi/spark-53924.

Authored-by: Anton Okolnychyi <aokolnychyi@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…ng schema changes

### What changes were proposed in this pull request?

Follow-up of apache#52876, add tests for cached temp view detecting schema changes

### Why are the changes needed?

There is no test coverage after comment apache#52876 (comment) is addressed. This PR is to add a test case for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New test case
### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#53103 from gengliangwang/SPARK-53924-test.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?
Resolve `V2TableReference` for table in `InsertIntoStatement`.

### Why are the changes needed?
apache#52876 brought in `V2TableReference` which broke relation resolution for insert into temp view on DSv2 table.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add UT.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#53196 from manuzhang/FIX-SPARK-54491.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants