Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47681][SQL] Add schema_of_variant expression. #45806

Closed
wants to merge 5 commits into from

Conversation

chenhao-db
Copy link
Contributor

@chenhao-db chenhao-db commented Apr 2, 2024

What changes were proposed in this pull request?

This PR adds a new SchemaOfVariant expression. It returns schema in the SQL format of a variant.

Usage examples:

> SELECT schema_of_variant(parse_json('null'));
 VOID
> SELECT schema_of_variant(parse_json('[{"b":true,"a":0}]'));
 ARRAY<STRUCT<a: BIGINT, b: BOOLEAN>>

Why are the changes needed?

This expression can help the user explore the content of variant values.

Does this PR introduce any user-facing change?

Yes. A new SQL expression is added.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 2, 2024
@chenhao-db chenhao-db changed the title [SPARK-47680][SQL] Add variant_schema expression. [SPARK-47681][SQL] Add variant_schema expression. Apr 2, 2024
@chenhao-db chenhao-db changed the title [SPARK-47681][SQL] Add variant_schema expression. [SPARK-47681][SQL] Add schema_of_variant expression. Apr 2, 2024
* be sorted alphabetically.
*/
def mergeSchema(t1: DataType, t2: DataType): DataType = (t1, t2) match {
case (t1, t2) if t1 == t2 => t1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse TypeCoercion rule here?

Copy link
Contributor Author

@chenhao-db chenhao-db Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like to keep the current code. I don't think TypeCoercion contains a suitable rule that can be directly used here. If we use findTightestCommonType, we still need most of the code that handles decimal/struct/array and can hardly simplify the code. If we use any function that calls findWiderTypeForDecimal (like findWiderTypeForTwo), its semantics will be undesired because If the wider decimal type exceeds system limitation, this rule will truncate the decimal type (and we still need custom code for struct/array). Using these rules may fruitlessly visit the whole type object, and we need to do a second pass of visit. Since this function is used in the expression evaluation, I think we do care about its efficiency.

Essentially, mergeSchema only need to handle the result of mergeSchema and schemaOf, and we can have a better control over them if we have all the type resolution logic inside and avoid calling any libaray functions.

case Type.ARRAY =>
var elementType: DataType = NullType
for (i <- 0 until v.arraySize()) {
elementType = mergeSchema(elementType, schemaOf(v.getElementAtIndex(i)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so the variant format spec does not require the array elements to be the same data type.

Copy link
Contributor Author

@chenhao-db chenhao-db Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true. It is necessary because JSON allows such flexibility.

case (t1: DecimalType, t2: DecimalType) =>
val scale = math.max(t1.scale, t2.scale)
val range = math.max(t1.precision - t1.scale, t2.precision - t2.scale)
if (range + scale > DecimalType.MAX_PRECISION) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem of not reusing existing code is we need to reason about the behavior difference. Why do you think it's better to return double than rounding the decimal? Double is less accurate than a rounded decimal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does AnsiTypeCoercion.findWiderTypeForTwo(...).getOrElse(VariantType) work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be more intuitive if the variant can be successfully cast into the inferred schema (even at the cost of a precision loss). If we return a truncated decimal here, the cast will deterministically fail.

Actually, this mergedSchema function, including this fallback-to-double logic, is largely adapted from the existing JSON schema inference code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala#L364. I find it is not too difficult to reuse this code, so I changed the implementation to depend on it instead.

*/
def compatibleType(t1: DataType, t2: DataType): DataType = {
def compatibleType(
t1: DataType, t2: DataType, incompatibleType: DataType = StringType): DataType = {
Copy link
Contributor

@cloud-fan cloud-fan Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we call it defaultDataType?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants