Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48067][SQL] Fix variant default columns #46312

Closed

Conversation

richardc-db
Copy link
Contributor

@richardc-db richardc-db commented May 1, 2024

What changes were proposed in this pull request?

Changes the literal sql representation of a variant value to parse_json(variant.toJson). This is because there is no other representation of a literal variant.

This allows variant default columns to work because default columns store a literal string representation in the schema struct fields metadata as the default value.

Why are the changes needed?

previously we could not set a variant default column like

create table t(
        v6 variant default parse_json('{\"k\": \"v\"}')
)

Does this PR introduce any user-facing change?

no

How was this patch tested?

added UT

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label May 1, 2024
@@ -84,9 +84,16 @@ object ResolveDefaultColumns extends QueryErrorsBase
if (SQLConf.get.enableDefaultColumns) {
val newFields: Seq[StructField] = tableSchema.fields.map { field =>
if (field.metadata.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY)) {
val analyzed: Expression = analyze(field, statementType)
val defaultSql: String = if (field.dataType.isInstanceOf[VariantType]) {
// A variant's SQL/string representation is its JSON string which cannot be directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we should implement the type coercion instead of doing this approach. Then we can correctly store the variant value in as the column default expression rather than a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i believe all default values are already stored as string sql expressions (usually literal expressions). I.e. the existing values (as a string sql expression) is retrieved here and is analyzed using analyze and eventually coerced into the target type here. We could "coerce" the variant string into a variant by wrapping it with parse_json as mentioned in the PR description. However, this feels inefficient. We would evaluate the default expression to produce a variant, transform it to its string representation (by expr.sql) and then transform it back to its variant value later. Instead, we could just leave the expression as is and lazily evaluate it to avoid the extra variant-> string and string -> variant conversions.

Regarding your other comment, I believe this doesn't work even if ParseJson can be constant folded (I quickly tried it) because analyze attempts to analyze the expression as a variant {"k": "v"}, which isn't valid sql syntax (there aren't even quotations around it). The root of the problem here is that there isn't a way to represent variant types as a literal

@dtenedor
Copy link
Contributor

dtenedor commented May 1, 2024

previously we could not set a variant default column like

create table t(
        v6 variant default parse_json('{\"k\": \"v\"}')
)

@richardc-db this would work if the ParseJson expression was constant-foldable. Is there a reason we cannot do that?

Here's an example of marking an expression as constant-foldable:

override def foldable: Boolean = deterministic && children.forall(_.foldable)

@gengliangwang
Copy link
Member

Sync with @richardc-db offline. I suggest changing the .sql method of variant expressions instead.

@@ -549,6 +549,7 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression {
s"${Literal(kv._1, mapType.keyType).sql}, ${Literal(kv._2, mapType.valueType).sql}"
}
s"MAP(${keysAndValues.mkString(", ")})"
case (v: VariantVal, variantType: VariantType) => s"PARSE_JSON('${v.toJson(timeZoneId)}')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add some test cases for the change here.

@gengliangwang
Copy link
Member

Thanks, merging to master

JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

Changes the literal `sql` representation of a variant value to `parse_json(variant.toJson)`. This is because there is no other representation of a literal variant.

This allows variant default columns to work because default columns store a literal string representation in the schema struct fields metadata as the default value.

### Why are the changes needed?

previously we could not set a variant default column like
```
create table t(
        v6 variant default parse_json('{\"k\": \"v\"}')
)
```

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

added UT

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46312 from richardc-db/fix_variant_default_cols.

Authored-by: Richard Chen <r.chen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants