-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType #26644
Conversation
ok to test |
Thank you for making this PR, @Fokko . |
My pleasure @dongjoon-hyun |
Test build #114320 has finished for PR 26644 at commit
|
@@ -592,6 +592,9 @@ object StructType extends AbstractDataType { | |||
case (leftUdt: UserDefinedType[_], rightUdt: UserDefinedType[_]) | |||
if leftUdt.userClass == rightUdt.userClass => leftUdt | |||
|
|||
case (leftType, rightUdt: UserDefinedType[_]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is beyond our existing rule, shall we update the function description accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a Scaladoc to the private merge function. I think the Javadoc that you're pointing to, is describing the function at a different level. For example, it doesn't mention any UDT's at all. Let me know what you think.
val right = StructType( | ||
StructField("a", new CustomXMLGregorianCalendarType) :: Nil) | ||
|
||
assert(left.merge(right) === left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check the opposite case, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation isn't symmetrical. So we can convert a UserDefinedType to a DateType, but not the other way around. I'm hesitant to add this functionality because I don't see any obvious applications. Please let me know if you think this should be added as well. I've added a test to check the opposite case as well, including some additional comments to clarify the idea and working.
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
Outdated
Show resolved
Hide resolved
Could you add end-to-end tests somewhere, e.g., |
Test build #114344 has finished for PR 26644 at commit
|
Test build #114346 has finished for PR 26644 at commit
|
Test build #114347 has finished for PR 26644 at commit
|
2f2d676
to
dd4c6c4
Compare
Test build #114350 has finished for PR 26644 at commit
|
Test build #114348 has finished for PR 26644 at commit
|
Test build #114352 has finished for PR 26644 at commit
|
Test build #114353 has finished for PR 26644 at commit
|
Test build #114351 has finished for PR 26644 at commit
|
Test build #114359 has finished for PR 26644 at commit
|
Test build #114360 has finished for PR 26644 at commit
|
Test build #114377 has finished for PR 26644 at commit
|
Test build #114716 has finished for PR 26644 at commit
|
Test build #114736 has finished for PR 26644 at commit
|
Test build #114757 has finished for PR 26644 at commit
|
Test build #114759 has finished for PR 26644 at commit
|
Test build #114774 has finished for PR 26644 at commit
|
Test build #114784 has finished for PR 26644 at commit
|
@dongjoon-hyun @HyukjinKwon @maropu Any further thoughts? |
Rebased onto master |
Test build #115158 has finished for PR 26644 at commit
|
It looks like apache.org is unreachable:
|
In case you write a UDT, you always need to read it with the UDT registered. In many cases you want to write it, and then convert it into a native DataType. In the case of Delta or when appending a partition, you can write to the same table and then it needs to be able to convert merge the UDT into the native type again. * Add a test to the DataTypeSuite.scala
Test build #115165 has finished for PR 26644 at commit
|
Test build #115961 has finished for PR 26644 at commit
|
if leftType == rightUdt.sqlType => leftType | ||
|
||
case (leftUdt: UserDefinedType[_], rightType) | ||
if leftUdt.sqlType == rightType => rightType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, sorry for my late response. I doubt if we should allow this case.
Currently, merge
only allows the same types but UDT <> UDT's SQL types are not the same types. I think it makes less sense to allow this case alone.
Also, this https://github.com/apache/spark/pull/26644/files#r350486670 looks weird. jsonValue
seems it should have JSON-serialized value of its own type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the UserDefinedType
extends DataType
, similar to TimestampType
, StringType
, and any other type. The thing is that a UserDefinedType can be compatible with any other type. For example, it is allowed to merge an int into a long. This is an explicit choice by the developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is allowed to merge an int into a long.
But this StructType.merge
does not allow such type merging. Given that, it looks weird to allow only UDT.
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> StructType.merge(LongType, IntegerType)
org.apache.spark.SparkException: Failed to merge incompatible data types bigint and int
at org.apache.spark.sql.types.StructType$.merge(StructType.scala:600)
... 49 elided
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
I've been thinking of this a lot, but could not come up with a clean solution. I'll leave it for now. |
What changes were proposed in this pull request?
In case you write a UDT, you always need to read it with the UDT registered. In many cases, you want to write it, and then convert it into a native DataType.
In the case of Delta or when appending a partition, you can write to the same table and then it needs to be able to convert merge the UDT into the native type again.
Why are the changes needed?
When appending data to the table, I get the exception:
Does this PR introduce any user-facing change?
How was this patch tested?
https://jira.apache.org/jira/browse/SPARK-30004