[SPARK-47681][SQL] Add schema_of_variant expression. #45806

chenhao-db · 2024-04-02T02:34:24Z

What changes were proposed in this pull request?

This PR adds a new SchemaOfVariant expression. It returns schema in the SQL format of a variant.

Usage examples:

> SELECT schema_of_variant(parse_json('null'));
 VOID
> SELECT schema_of_variant(parse_json('[{"b":true,"a":0}]'));
 ARRAY<STRUCT<a: BIGINT, b: BOOLEAN>>

Why are the changes needed?

This expression can help the user explore the content of variant values.

Does this PR introduce any user-facing change?

Yes. A new SQL expression is added.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2024-04-03T01:42:58Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+   * be sorted alphabetically.
+   */
+  def mergeSchema(t1: DataType, t2: DataType): DataType = (t1, t2) match {
+    case (t1, t2) if t1 == t2 => t1


Can we reuse TypeCoercion rule here?

I personally like to keep the current code. I don't think TypeCoercion contains a suitable rule that can be directly used here. If we use findTightestCommonType, we still need most of the code that handles decimal/struct/array and can hardly simplify the code. If we use any function that calls findWiderTypeForDecimal (like findWiderTypeForTwo), its semantics will be undesired because If the wider decimal type exceeds system limitation, this rule will truncate the decimal type (and we still need custom code for struct/array). Using these rules may fruitlessly visit the whole type object, and we need to do a second pass of visit. Since this function is used in the expression evaluation, I think we do care about its efficiency.

Essentially, mergeSchema only need to handle the result of mergeSchema and schemaOf, and we can have a better control over them if we have all the type resolution logic inside and avoid calling any libaray functions.

cloud-fan · 2024-04-08T03:24:21Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+    case Type.ARRAY =>
+      var elementType: DataType = NullType
+      for (i <- 0 until v.arraySize()) {
+        elementType = mergeSchema(elementType, schemaOf(v.getElementAtIndex(i)))


Interesting, so the variant format spec does not require the array elements to be the same data type.

This is true. It is necessary because JSON allows such flexibility.

cloud-fan · 2024-04-08T03:27:44Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala

+    case (t1: DecimalType, t2: DecimalType) =>
+      val scale = math.max(t1.scale, t2.scale)
+      val range = math.max(t1.precision - t1.scale, t2.precision - t2.scale)
+      if (range + scale > DecimalType.MAX_PRECISION) {


The problem of not reusing existing code is we need to reason about the behavior difference. Why do you think it's better to return double than rounding the decimal? Double is less accurate than a rounded decimal.

Does AnsiTypeCoercion.findWiderTypeForTwo(...).getOrElse(VariantType) work?

I think it will be more intuitive if the variant can be successfully cast into the inferred schema (even at the cost of a precision loss). If we return a truncated decimal here, the cast will deterministically fail.

Actually, this mergedSchema function, including this fallback-to-double logic, is largely adapted from the existing JSON schema inference code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala#L364. I find it is not too difficult to reuse this code, so I changed the implementation to depend on it instead.

cloud-fan · 2024-04-08T06:40:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala

   */
-  def compatibleType(t1: DataType, t2: DataType): DataType = {
+  def compatibleType(
+      t1: DataType, t2: DataType, incompatibleType: DataType = StringType): DataType = {


shall we call it defaultDataType?

cloud-fan · 2024-04-08T16:03:36Z

thanks, merging to master!

variant_schema

6186555

github-actions bot added the SQL label Apr 2, 2024

chenhao-db changed the title ~~[SPARK-47680][SQL] Add variant_schema expression.~~ [SPARK-47681][SQL] Add variant_schema expression. Apr 2, 2024

rename to schema_of_variant

aaf3c35

chenhao-db changed the title ~~[SPARK-47681][SQL] Add variant_schema expression.~~ [SPARK-47681][SQL] Add schema_of_variant expression. Apr 2, 2024

HyukjinKwon reviewed Apr 3, 2024

View reviewed changes

chenhao-db requested a review from HyukjinKwon April 3, 2024 18:59

cloud-fan reviewed Apr 8, 2024

View reviewed changes

chenhao-db added 2 commits April 7, 2024 23:11

depend on JsonInferSchema

e0f6a7a

Merge branch 'master' into variant_schema

de73ecc

chenhao-db requested a review from cloud-fan April 8, 2024 06:12

cloud-fan reviewed Apr 8, 2024

View reviewed changes

cloud-fan approved these changes Apr 8, 2024

View reviewed changes

rename

f09e817

cloud-fan closed this in 134a139 Apr 8, 2024

dongjoon-hyun mentioned this pull request May 3, 2024

[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47681][SQL] Add schema_of_variant expression. #45806

[SPARK-47681][SQL] Add schema_of_variant expression. #45806

chenhao-db commented Apr 2, 2024 •

edited

HyukjinKwon Apr 3, 2024

chenhao-db Apr 3, 2024 •

edited

cloud-fan Apr 8, 2024

chenhao-db Apr 8, 2024 •

edited

cloud-fan Apr 8, 2024

cloud-fan Apr 8, 2024

chenhao-db Apr 8, 2024

cloud-fan Apr 8, 2024 •

edited

cloud-fan commented Apr 8, 2024

[SPARK-47681][SQL] Add schema_of_variant expression. #45806

[SPARK-47681][SQL] Add schema_of_variant expression. #45806

Conversation

chenhao-db commented Apr 2, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon Apr 3, 2024

Choose a reason for hiding this comment

chenhao-db Apr 3, 2024 • edited

Choose a reason for hiding this comment

cloud-fan Apr 8, 2024

Choose a reason for hiding this comment

chenhao-db Apr 8, 2024 • edited

Choose a reason for hiding this comment

cloud-fan Apr 8, 2024

Choose a reason for hiding this comment

cloud-fan Apr 8, 2024

Choose a reason for hiding this comment

chenhao-db Apr 8, 2024

Choose a reason for hiding this comment

cloud-fan Apr 8, 2024 • edited

Choose a reason for hiding this comment

cloud-fan commented Apr 8, 2024

chenhao-db commented Apr 2, 2024 •

edited

chenhao-db Apr 3, 2024 •

edited

chenhao-db Apr 8, 2024 •

edited

cloud-fan Apr 8, 2024 •

edited