[SPARK-42398][SQL] Refine default column value DS v2 interface #40049

cloud-fan · 2023-02-16T03:10:19Z

What changes were proposed in this pull request?

The current default value DS V2 API is a bit inconsistent. The createTable API only takes StructType, so implementations must know the special metadata key of the default value to access it. The TableChange API has the default value as an individual field.

This API adds a new Column interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). createTable API now takes Column. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of.

Why are the changes needed?

better DS v2 API for default value

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

dtenedor

This is looking pretty close to ready

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala

dtenedor · 2023-02-16T22:39:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3108,7 +3108,7 @@ object SQLConf {
        "provided values when the corresponding fields are not present in storage.")
      .version("3.4.0")
      .stringConf
-      .createWithDefault("csv,json,orc,parquet")
+      .createWithDefault("csv,json,orc,parquet,hive")


Is this safe? What data source operator implements the hive provider? Does it support filling in the existence default values? Do we have any default-value test cases in this PR where the table is using hive?

dtenedor · 2023-02-16T22:40:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/ColumnImpl.scala

+import org.apache.spark.sql.connector.catalog.{Column, ColumnDefaultValue}
+import org.apache.spark.sql.types.DataType
+
+// The default implementation of v2 column.


Suggested change

// The default implementation of v2 column.

// The standard concrete implementation of data source V2 column.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java

gengliangwang · 2023-02-17T00:31:33Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java

+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal<?> value;


The naming value is confusing. Shall we rename it as initialValue? Or change sql and value as currentDefault and existingDefault.

A default value has two parts: the SQL string and the evaluated literal value. I don't think current default and exist default is easier to understand for data source developers.

Can you also read the classdoc? If you still think the name is confusing, let's figure out a better one.

Actually, I have read the classdoc before commenting...I don't have a better suggestion. Let's enhance the doc later

Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL.

On the other hand, the current default value is for DML only. The analyzer inserts this expression for any explicit reference to DEFAULT, or for a small subset of implicit cases.

For these fields we could clarify with comments, e.g.

// This is the original string contents of the SQL expression specified at the // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)", // this field is equal to the string literal "42" (without quotation marks). private String sql; // This is the literal value corresponding to the above SQL string. For the above // example, this would be a literal integer with a value of 42. private Literal<?> value;

gengliangwang · 2023-02-17T00:43:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -3108,7 +3108,7 @@ object SQLConf {
        "provided values when the corresponding fields are not present in storage.")
      .version("3.4.0")
      .stringConf
-      .createWithDefault("csv,json,orc,parquet")
+      .createWithDefault("csv,json,orc,parquet,hive")


gengliangwang · 2023-02-17T00:45:13Z

sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala

+    createTable(ident, CatalogV2Util.v2ColumnsToStructType(columns), partitions, properties)
+  }
+
+  // TODO: remove it when no tests calling this deprecated method.


Is there a follow-up ticket for this?

I haven't created it. This is a test-only change so the priority is low.

…s/logical/statements.scala Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

…alog/CatalogV2Util.scala Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

dtenedor · 2023-02-17T20:01:13Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java

+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal<?> value;


Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL.

On the other hand, the current default value is for DML only. The analyzer inserts this expression for any explicit reference to DEFAULT, or for a small subset of implicit cases.

For these fields we could clarify with comments, e.g.

// This is the original string contents of the SQL expression specified at the // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)", // this field is equal to the string literal "42" (without quotation marks). private String sql; // This is the literal value corresponding to the above SQL string. For the above // example, this would be a literal integer with a value of 42. private Literal<?> value;

cloud-fan · 2023-02-20T08:30:24Z

thanks for the review, merging to master/3.4!

### What changes were proposed in this pull request? The current default value DS V2 API is a bit inconsistent. The `createTable` API only takes `StructType`, so implementations must know the special metadata key of the default value to access it. The `TableChange` API has the default value as an individual field. This API adds a new `Column` interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). `createTable` API now takes `Column`. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of. ### Why are the changes needed? better DS v2 API for default value ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #40049 from cloud-fan/table2. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 70a098c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun

Hi, @cloud-fan and all.
This seems to break branch-3.4 CI while master branch is okay. Could you check branch-3.4 status once more?

dongjoon-hyun · 2023-02-20T20:50:14Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCatalog.java

+  /**
+   * Create a table in the catalog.
+   * <p>
+   * This is deprecated. Please override


Could you add a deprecation version explicitly?

dongjoon-hyun · 2023-02-20T20:55:21Z

Interestingly, it passed locally while GitHub Action jobs keep failing.

$ build/sbt "sql/testOnly *.OrcSourceV1Suite -- -z SPARK-11412"
...
[info] All tests passed.
[success] Total time: 23 s, completed Feb 20, 2023, 12:54:28 PM

dongjoon-hyun · 2023-02-20T22:35:26Z

Since this happens in GitHub Action currently, I made a WIP PR for further investigation. If it's valid, I'll convert it to the official PR separately from this PR.

[SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests #40095

dongjoon-hyun · 2023-02-21T00:42:26Z

I closed my PR because the failure seems to happen more earlier than this commit.

…de the new createTable method ### What changes were proposed in this pull request? This is a followup of #40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A, too trivial. Closes #40369 from cloud-fan/api. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…de the new createTable method ### What changes were proposed in this pull request? This is a followup of #40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A, too trivial. Closes #40369 from cloud-fan/api. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 061bd92) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? The current default value DS V2 API is a bit inconsistent. The `createTable` API only takes `StructType`, so implementations must know the special metadata key of the default value to access it. The `TableChange` API has the default value as an individual field. This API adds a new `Column` interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). `createTable` API now takes `Column`. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of. ### Why are the changes needed? better DS v2 API for default value ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#40049 from cloud-fan/table2. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 70a098c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…de the new createTable method ### What changes were proposed in this pull request? This is a followup of apache#40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A, too trivial. Closes apache#40369 from cloud-fan/api. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 061bd92) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… Add user facing error ### What changes were proposed in this pull request? FIRST CHANGE Pass correct parameter list to `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is invoked from `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`. `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method accepts 3 parameter 1) Field to analyze 2) Statement type - String 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT Method `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column` pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is not metadata key, instead of that, it is statement type, so different expression is analyzed. Pull requests where original change was introduced #40049 - Initial commit #44876 - Refactor that did not touch the issue #44935 - Another refactor that did not touch the issue SECOND CHANGE Add user facing exception when default value is not foldable or resolved. Otherwise, user would see message "You hit a bug in Spark ...". ### Why are the changes needed? It is needed to pass correct value to `Column` object ### Does this PR introduce _any_ user-facing change? Yes, this is a bug fix, existence default value has now proper expression, but before this change, existence default value was actually current default value of column. ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46594 from urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression. Lead-authored-by: Uros Stankovic <uros.stankovic@databricks.com> Co-authored-by: Uros Stankovic <155642965+urosstan-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… Add user facing error FIRST CHANGE Pass correct parameter list to `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is invoked from `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`. `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method accepts 3 parameter 1) Field to analyze 2) Statement type - String 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT Method `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column` pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is not metadata key, instead of that, it is statement type, so different expression is analyzed. Pull requests where original change was introduced #40049 - Initial commit #44876 - Refactor that did not touch the issue #44935 - Another refactor that did not touch the issue SECOND CHANGE Add user facing exception when default value is not foldable or resolved. Otherwise, user would see message "You hit a bug in Spark ...". It is needed to pass correct value to `Column` object Yes, this is a bug fix, existence default value has now proper expression, but before this change, existence default value was actually current default value of column. Unit test No Closes #46594 from urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression. Lead-authored-by: Uros Stankovic <uros.stankovic@databricks.com> Co-authored-by: Uros Stankovic <155642965+urosstan-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0f21df0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan added 6 commits February 14, 2023 21:30

refine default column value framework

84cbea7

improve

f5bc509

revert for proto

f07c991

fix javadoc

1034eaa

more improvement

d1979e8

rename

9ec993b

github-actions bot added SQL STRUCTURED STREAMING labels Feb 16, 2023

revert refactor

17f7ac8

cloud-fan force-pushed the table2 branch from 41e96e6 to 17f7ac8 Compare February 16, 2023 12:58

dtenedor reviewed Feb 16, 2023

View reviewed changes

gengliangwang reviewed Feb 17, 2023

View reviewed changes

cloud-fan and others added 3 commits February 18, 2023 01:51

address comments

7aa6d44

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plan…

6a0c8a0

…s/logical/statements.scala Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

Update sql/catalyst/src/main/scala/org/apache/spark/sql/connector/cat…

32728c7

…alog/CatalogV2Util.scala Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

gengliangwang approved these changes Feb 17, 2023

View reviewed changes

dtenedor approved these changes Feb 17, 2023

View reviewed changes

cloud-fan added 3 commits February 18, 2023 18:19

improve doc

88ce684

Merge remote-tracking branch 'origin/master' into table2

377bbab

fix compile

3e81760

github-actions bot added the CONNECT label Feb 18, 2023

cloud-fan added 2 commits February 20, 2023 10:26

Merge remote-tracking branch 'origin/master' into table2

619d846

fix compile

9cd3dd2

cloud-fan closed this in 70a098c Feb 20, 2023

dongjoon-hyun reviewed Feb 20, 2023

View reviewed changes

dongjoon-hyun mentioned this pull request Feb 20, 2023

[SPARK-XXX][SQL][TESTS][3.4] Reduce the degree of concurrency during ORC schema merge conflict tests #40095

Closed

dongjoon-hyun mentioned this pull request Feb 20, 2023

[SPARK-XXX][SQL][TESTS] Reduce the degree of concurrency during ORC schema merge conflict tests #40096

Closed

cloud-fan mentioned this pull request Mar 10, 2023

[SPARK-42398][SQL][FOLLOWUP] DelegatingCatalogExtension should override the new createTable method #40369

Closed

urosstan-db mentioned this pull request May 15, 2024

[SPARK-48286] Fix analysis of column with exists default expression - Add user facing error #46594

Closed

panbingkun mentioned this pull request Jul 29, 2024

[SPARK-49037][SQL][TESTS] Replace schema() with columns() #47515

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42398][SQL] Refine default column value DS v2 interface #40049

[SPARK-42398][SQL] Refine default column value DS v2 interface #40049

cloud-fan commented Feb 16, 2023

dtenedor left a comment

dtenedor Feb 16, 2023

gengliangwang Feb 17, 2023

dtenedor Feb 16, 2023

gengliangwang Feb 17, 2023

cloud-fan Feb 17, 2023

cloud-fan Feb 17, 2023

gengliangwang Feb 17, 2023

dtenedor Feb 17, 2023 •

edited

Loading

gengliangwang Feb 17, 2023

gengliangwang Feb 17, 2023

cloud-fan Feb 17, 2023

dtenedor Feb 17, 2023 •

edited

Loading

cloud-fan commented Feb 20, 2023

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Feb 20, 2023

dongjoon-hyun commented Feb 20, 2023

dongjoon-hyun commented Feb 20, 2023

dongjoon-hyun commented Feb 21, 2023

	// The default implementation of v2 column.
	// The standard concrete implementation of data source V2 column.

[SPARK-42398][SQL] Refine default column value DS v2 interface #40049

[SPARK-42398][SQL] Refine default column value DS v2 interface #40049

Conversation

cloud-fan commented Feb 16, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dtenedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Feb 20, 2023

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Feb 20, 2023

dongjoon-hyun commented Feb 20, 2023

dongjoon-hyun commented Feb 21, 2023

dtenedor Feb 17, 2023 •

edited

Loading

dtenedor Feb 17, 2023 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading