Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42398][SQL] Refine default column value DS v2 interface #40049

Closed
wants to merge 15 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

The current default value DS V2 API is a bit inconsistent. The createTable API only takes StructType, so implementations must know the special metadata key of the default value to access it. The TableChange API has the default value as an individual field.

This API adds a new Column interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). createTable API now takes Column. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of.

Why are the changes needed?

better DS v2 API for default value

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty close to ready

@@ -3108,7 +3108,7 @@ object SQLConf {
"provided values when the corresponding fields are not present in storage.")
.version("3.4.0")
.stringConf
.createWithDefault("csv,json,orc,parquet")
.createWithDefault("csv,json,orc,parquet,hive")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe? What data source operator implements the hive provider? Does it support filling in the existence default values? Do we have any default-value test cases in this PR where the table is using hive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

import org.apache.spark.sql.connector.catalog.{Column, ColumnDefaultValue}
import org.apache.spark.sql.types.DataType

// The default implementation of v2 column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The default implementation of v2 column.
// The standard concrete implementation of data source V2 column.

*/
public class ColumnDefaultValue {
private String sql;
private Literal<?> value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming value is confusing. Shall we rename it as initialValue? Or change sql and value as currentDefault and existingDefault.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A default value has two parts: the SQL string and the evaluated literal value. I don't think current default and exist default is easier to understand for data source developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also read the classdoc? If you still think the name is confusing, let's figure out a better one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have read the classdoc before commenting...I don't have a better suggestion. Let's enhance the doc later

Copy link
Contributor

@dtenedor dtenedor Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL.

On the other hand, the current default value is for DML only. The analyzer inserts this expression for any explicit reference to DEFAULT, or for a small subset of implicit cases.

For these fields we could clarify with comments, e.g.

  // This is the original string contents of the SQL expression specified at the
  // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE
  // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)",
  // this field is equal to the string literal "42" (without quotation marks).
  private String sql;
  // This is the literal value corresponding to the above SQL string. For the above
  // example, this would be a literal integer with a value of 42.
  private Literal<?> value;

@@ -3108,7 +3108,7 @@ object SQLConf {
"provided values when the corresponding fields are not present in storage.")
.version("3.4.0")
.stringConf
.createWithDefault("csv,json,orc,parquet")
.createWithDefault("csv,json,orc,parquet,hive")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

createTable(ident, CatalogV2Util.v2ColumnsToStructType(columns), partitions, properties)
}

// TODO: remove it when no tests calling this deprecated method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a follow-up ticket for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't created it. This is a test-only change so the priority is low.

cloud-fan and others added 3 commits February 18, 2023 01:51
…s/logical/statements.scala

Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
…alog/CatalogV2Util.scala

Co-authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
*/
public class ColumnDefaultValue {
private String sql;
private Literal<?> value;
Copy link
Contributor

@dtenedor dtenedor Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL.

On the other hand, the current default value is for DML only. The analyzer inserts this expression for any explicit reference to DEFAULT, or for a small subset of implicit cases.

For these fields we could clarify with comments, e.g.

  // This is the original string contents of the SQL expression specified at the
  // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE
  // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)",
  // this field is equal to the string literal "42" (without quotation marks).
  private String sql;
  // This is the literal value corresponding to the above SQL string. For the above
  // example, this would be a literal integer with a value of 42.
  private Literal<?> value;

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master/3.4!

cloud-fan added a commit that referenced this pull request Feb 20, 2023
### What changes were proposed in this pull request?

The current default value DS V2 API is a bit inconsistent. The `createTable` API only takes `StructType`, so implementations must know the special metadata key of the default value to access it. The `TableChange` API has the default value as an individual field.

This API adds a new `Column` interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). `createTable` API now takes `Column`. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of.

### Why are the changes needed?

better DS v2 API for default value

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #40049 from cloud-fan/table2.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 70a098c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this in 70a098c Feb 20, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* Create a table in the catalog.
* <p>
* This is deprecated. Please override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a deprecation version explicitly?

@dongjoon-hyun
Copy link
Member

Interestingly, it passed locally while GitHub Action jobs keep failing.

$ build/sbt "sql/testOnly *.OrcSourceV1Suite -- -z SPARK-11412"
...
[info] All tests passed.
[success] Total time: 23 s, completed Feb 20, 2023, 12:54:28 PM

@dongjoon-hyun
Copy link
Member

Since this happens in GitHub Action currently, I made a WIP PR for further investigation. If it's valid, I'll convert it to the official PR separately from this PR.

@dongjoon-hyun
Copy link
Member

I closed my PR because the failure seems to happen more earlier than this commit.

dongjoon-hyun pushed a commit that referenced this pull request Mar 10, 2023
…de the new createTable method

### What changes were proposed in this pull request?

This is a followup of #40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A, too trivial.

Closes #40369 from cloud-fan/api.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Mar 10, 2023
…de the new createTable method

### What changes were proposed in this pull request?

This is a followup of #40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A, too trivial.

Closes #40369 from cloud-fan/api.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 061bd92)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

The current default value DS V2 API is a bit inconsistent. The `createTable` API only takes `StructType`, so implementations must know the special metadata key of the default value to access it. The `TableChange` API has the default value as an individual field.

This API adds a new `Column` interface, which holds both current default (as a SQL string) and exist default (as a v2 literal). `createTable` API now takes `Column`. This avoids the need of special metadata key and is also more extensible when adding more special cols like generated cols. This is also type-safe and makes sure the exist default is literal. The implementation is free to decide how to encode and store default values. Note: backward compatibility is taken care of.

### Why are the changes needed?

better DS v2 API for default value

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes apache#40049 from cloud-fan/table2.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 70a098c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…de the new createTable method

### What changes were proposed in this pull request?

This is a followup of apache#40049 to fix a small issue: `DelegatingCatalogExtension` should also override the new `createTable` function and call the session catalog, instead of using the default implementation.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A, too trivial.

Closes apache#40369 from cloud-fan/api.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 061bd92)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
cloud-fan pushed a commit that referenced this pull request Jun 6, 2024
… Add user facing error

### What changes were proposed in this pull request?

FIRST CHANGE

Pass correct parameter list to `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is invoked from `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`.

`org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method accepts 3 parameter

1) Field to analyze
2) Statement type - String
3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT

Method `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`
pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is not metadata key, instead of that, it is statement type, so different expression is analyzed.

Pull requests where original change was introduced
#40049 - Initial commit
#44876 - Refactor that did not touch the issue
#44935 - Another refactor that did not touch the issue

SECOND CHANGE
Add user facing exception when default value is not foldable or resolved. Otherwise, user would see message "You hit a bug in Spark ...".
### Why are the changes needed?
It is needed to pass correct value to `Column` object

### Does this PR introduce _any_ user-facing change?
Yes, this is a bug fix, existence default value has now proper expression, but before this change, existence default value was actually current default value of column.

### How was this patch tested?
Unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46594 from urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression.

Lead-authored-by: Uros Stankovic <uros.stankovic@databricks.com>
Co-authored-by: Uros Stankovic <155642965+urosstan-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jun 6, 2024
… Add user facing error

FIRST CHANGE

Pass correct parameter list to `org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` when it is invoked from `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`.

`org.apache.spark.sql.catalyst.util.ResolveDefaultColumns#analyze` method accepts 3 parameter

1) Field to analyze
2) Statement type - String
3) Metadata key - CURRENT_DEFAULT or EXISTS_DEFAULT

Method `org.apache.spark.sql.connector.catalog.CatalogV2Util#structFieldToV2Column`
pass `fieldToAnalyze` and `EXISTS_DEFAULT` as second parameter, so it is not metadata key, instead of that, it is statement type, so different expression is analyzed.

Pull requests where original change was introduced
#40049 - Initial commit
#44876 - Refactor that did not touch the issue
#44935 - Another refactor that did not touch the issue

SECOND CHANGE
Add user facing exception when default value is not foldable or resolved. Otherwise, user would see message "You hit a bug in Spark ...".
It is needed to pass correct value to `Column` object

Yes, this is a bug fix, existence default value has now proper expression, but before this change, existence default value was actually current default value of column.

Unit test

No

Closes #46594 from urosstan-db/SPARK-48286-Analyze-exists-default-expression-instead-of-current-default-expression.

Lead-authored-by: Uros Stankovic <uros.stankovic@databricks.com>
Co-authored-by: Uros Stankovic <155642965+urosstan-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0f21df0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants