-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47681][SQL] Add schema_of_variant expression. #45806
Conversation
* be sorted alphabetically. | ||
*/ | ||
def mergeSchema(t1: DataType, t2: DataType): DataType = (t1, t2) match { | ||
case (t1, t2) if t1 == t2 => t1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse TypeCoercion rule here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like to keep the current code. I don't think TypeCoercion
contains a suitable rule that can be directly used here. If we use findTightestCommonType
, we still need most of the code that handles decimal/struct/array and can hardly simplify the code. If we use any function that calls findWiderTypeForDecimal
(like findWiderTypeForTwo
), its semantics will be undesired because If the wider decimal type exceeds system limitation, this rule will truncate the decimal type (and we still need custom code for struct/array). Using these rules may fruitlessly visit the whole type object, and we need to do a second pass of visit. Since this function is used in the expression evaluation, I think we do care about its efficiency.
Essentially, mergeSchema
only need to handle the result of mergeSchema
and schemaOf
, and we can have a better control over them if we have all the type resolution logic inside and avoid calling any libaray functions.
case Type.ARRAY => | ||
var elementType: DataType = NullType | ||
for (i <- 0 until v.arraySize()) { | ||
elementType = mergeSchema(elementType, schemaOf(v.getElementAtIndex(i))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so the variant format spec does not require the array elements to be the same data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true. It is necessary because JSON allows such flexibility.
case (t1: DecimalType, t2: DecimalType) => | ||
val scale = math.max(t1.scale, t2.scale) | ||
val range = math.max(t1.precision - t1.scale, t2.precision - t2.scale) | ||
if (range + scale > DecimalType.MAX_PRECISION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem of not reusing existing code is we need to reason about the behavior difference. Why do you think it's better to return double than rounding the decimal? Double is less accurate than a rounded decimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does AnsiTypeCoercion.findWiderTypeForTwo(...).getOrElse(VariantType)
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be more intuitive if the variant can be successfully cast into the inferred schema (even at the cost of a precision loss). If we return a truncated decimal here, the cast will deterministically fail.
Actually, this mergedSchema
function, including this fallback-to-double logic, is largely adapted from the existing JSON schema inference code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala#L364. I find it is not too difficult to reuse this code, so I changed the implementation to depend on it instead.
*/ | ||
def compatibleType(t1: DataType, t2: DataType): DataType = { | ||
def compatibleType( | ||
t1: DataType, t2: DataType, incompatibleType: DataType = StringType): DataType = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we call it defaultDataType
?
thanks, merging to master! |
What changes were proposed in this pull request?
This PR adds a new
SchemaOfVariant
expression. It returns schema in the SQL format of a variant.Usage examples:
Why are the changes needed?
This expression can help the user explore the content of variant values.
Does this PR introduce any user-facing change?
Yes. A new SQL expression is added.
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.