-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42398][SQL] Refine default column value DS v2 interface #40049
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty close to ready
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statements.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala
Show resolved
Hide resolved
@@ -3108,7 +3108,7 @@ object SQLConf { | |||
"provided values when the corresponding fields are not present in storage.") | |||
.version("3.4.0") | |||
.stringConf | |||
.createWithDefault("csv,json,orc,parquet") | |||
.createWithDefault("csv,json,orc,parquet,hive") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe? What data source operator implements the hive
provider? Does it support filling in the existence default values? Do we have any default-value test cases in this PR where the table is using hive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
import org.apache.spark.sql.connector.catalog.{Column, ColumnDefaultValue} | ||
import org.apache.spark.sql.types.DataType | ||
|
||
// The default implementation of v2 column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The default implementation of v2 column. | |
// The standard concrete implementation of data source V2 column. |
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Column.java
Show resolved
Hide resolved
*/ | ||
public class ColumnDefaultValue { | ||
private String sql; | ||
private Literal<?> value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming value
is confusing. Shall we rename it as initialValue
? Or change sql
and value
as currentDefault
and existingDefault
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A default value has two parts: the SQL string and the evaluated literal value. I don't think current default and exist default is easier to understand for data source developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also read the classdoc? If you still think the name is confusing, let's figure out a better one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have read the classdoc before commenting...I don't have a better suggestion. Let's enhance the doc later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL.
On the other hand, the current default value is for DML only. The analyzer inserts this expression for any explicit reference to DEFAULT
, or for a small subset of implicit cases.
For these fields we could clarify with comments, e.g.
// This is the original string contents of the SQL expression specified at the
// time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE
// ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)",
// this field is equal to the string literal "42" (without quotation marks).
private String sql;
// This is the literal value corresponding to the above SQL string. For the above
// example, this would be a literal integer with a value of 42.
private Literal<?> value;
@@ -3108,7 +3108,7 @@ object SQLConf { | |||
"provided values when the corresponding fields are not present in storage.") | |||
.version("3.4.0") | |||
.stringConf | |||
.createWithDefault("csv,json,orc,parquet") | |||
.createWithDefault("csv,json,orc,parquet,hive") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
createTable(ident, CatalogV2Util.v2ColumnsToStructType(columns), partitions, properties) | ||
} | ||
|
||
// TODO: remove it when no tests calling this deprecated method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a follow-up ticket for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't created it. This is a test-only change so the priority is low.
…s/logical/statements.scala Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
…alog/CatalogV2Util.scala Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
*/ | ||
public class ColumnDefaultValue { | ||
private String sql; | ||
private Literal<?> value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL.
On the other hand, the current default value is for DML only. The analyzer inserts this expression for any explicit reference to DEFAULT
, or for a small subset of implicit cases.
For these fields we could clarify with comments, e.g.
// This is the original string contents of the SQL expression specified at the
// time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE
// ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)",
// this field is equal to the string literal "42" (without quotation marks).
private String sql;
// This is the literal value corresponding to the above SQL string. For the above
// example, this would be a literal integer with a value of 42.
private Literal<?> value;
thanks for the review, merging to master/3.4! |
### What changes were proposed in this pull request? The current default value DS V2 API is a bit inconsistent. The `createTable` API only takes `StructType`, so implementations must know the special metadata key of the default value to access it. The `TableChange` API has the default value as an individual field. This API adds a new `Column` interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). `createTable` API now takes `Column`. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of. ### Why are the changes needed? better DS v2 API for default value ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #40049 from cloud-fan/table2. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 70a098c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan and all.
This seems to break branch-3.4
CI while master
branch is okay. Could you check branch-3.4
status once more?
/** | ||
* Create a table in the catalog. | ||
* <p> | ||
* This is deprecated. Please override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a deprecation version explicitly?
Interestingly, it passed locally while GitHub Action jobs keep failing.
|
Since this happens in GitHub Action currently, I made a WIP PR for further investigation. If it's valid, I'll convert it to the official PR separately from this PR. |
I closed my PR because the failure seems to happen more earlier than this commit. |
…de the new createTable method ### What changes were proposed in this pull request? This is a followup of #40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A, too trivial. Closes #40369 from cloud-fan/api. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…de the new createTable method ### What changes were proposed in this pull request? This is a followup of #40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A, too trivial. Closes #40369 from cloud-fan/api. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 061bd92) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? The current default value DS V2 API is a bit inconsistent. The `createTable` API only takes `StructType`, so implementations must know the special metadata key of the default value to access it. The `TableChange` API has the default value as an individual field. This API adds a new `Column` interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). `createTable` API now takes `Column`. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of. ### Why are the changes needed? better DS v2 API for default value ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#40049 from cloud-fan/table2. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 70a098c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…de the new createTable method ### What changes were proposed in this pull request? This is a followup of apache#40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A, too trivial. Closes apache#40369 from cloud-fan/api. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 061bd92) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… Add user facing error ### What changes were proposed in this pull request? FIRST CHANGE Pass correct parameter list to `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is invoked from `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`. `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method accepts 3 parameter 1) Field to analyze 2) Statement type - String 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT Method `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column` pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is not metadata key, instead of that, it is statement type, so different expression is analyzed. Pull requests where original change was introduced #40049 - Initial commit #44876 - Refactor that did not touch the issue #44935 - Another refactor that did not touch the issue SECOND CHANGE Add user facing exception when default value is not foldable or resolved. Otherwise, user would see message "You hit a bug in Spark ...". ### Why are the changes needed? It is needed to pass correct value to `Column` object ### Does this PR introduce _any_ user-facing change? Yes, this is a bug fix, existence default value has now proper expression, but before this change, existence default value was actually current default value of column. ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46594 from urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression. Lead-authored-by: Uros Stankovic <uros.stankovic@databricks.com> Co-authored-by: Uros Stankovic <155642965+urosstan-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… Add user facing error FIRST CHANGE Pass correct parameter list to `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is invoked from `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`. `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method accepts 3 parameter 1) Field to analyze 2) Statement type - String 3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT Method `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column` pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is not metadata key, instead of that, it is statement type, so different expression is analyzed. Pull requests where original change was introduced #40049 - Initial commit #44876 - Refactor that did not touch the issue #44935 - Another refactor that did not touch the issue SECOND CHANGE Add user facing exception when default value is not foldable or resolved. Otherwise, user would see message "You hit a bug in Spark ...". It is needed to pass correct value to `Column` object Yes, this is a bug fix, existence default value has now proper expression, but before this change, existence default value was actually current default value of column. Unit test No Closes #46594 from urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression. Lead-authored-by: Uros Stankovic <uros.stankovic@databricks.com> Co-authored-by: Uros Stankovic <155642965+urosstan-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0f21df0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
The current default value DS V2 API is a bit inconsistent. The
createTable
API only takesStructType
, so implementations must know the special metadata key of the default value to access it. TheTableChange
API has the default value as an individual field.This API adds a new
Column
interface, which holds both current default (as a SQL string) and exist default (as a v2 literal).createTable
API now takesColumn
. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of.Why are the changes needed?
better DS v2 API for default value
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests