Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType #26644

Closed
wants to merge 6 commits into from

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Nov 23, 2019

What changes were proposed in this pull request?

In case you write a UDT, you always need to read it with the UDT registered. In many cases, you want to write it, and then convert it into a native DataType.

In the case of Delta or when appending a partition, you can write to the same table and then it needs to be able to convert merge the UDT into the native type again.

Why are the changes needed?

When appending data to the table, I get the exception:

Failed to merge fields 'START_DATE_MAINTENANCE_FLPL' and 'START_DATE_MAINTENANCE_FLPL'.
Failed to merge incompatible data types TimestampType and org.apache.spark.sql.types.CustomXMLGregorianCalendarType@5ff12345;;

Does this PR introduce any user-facing change?

How was this patch tested?

  • Add a unit test to the DataTypeSuite.scala
  • Add an integration test to UserDefinedTypeSuite.scala

https://jira.apache.org/jira/browse/SPARK-30004

@Fokko Fokko changed the title SPARK-30004: Allow merge UserDefinedType into normal DataType SPARK-30004: Allow merge UserDefinedType into a native DataType Nov 23, 2019
@dongjoon-hyun dongjoon-hyun changed the title SPARK-30004: Allow merge UserDefinedType into a native DataType [SPARK-30004][SQL] Allow merge UserDefinedType into a native DataType Nov 23, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

Thank you for making this PR, @Fokko .

@Fokko
Copy link
Contributor Author

Fokko commented Nov 23, 2019

My pleasure @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Nov 23, 2019

Test build #114320 has finished for PR 26644 at commit 741a070.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -592,6 +592,9 @@ object StructType extends AbstractDataType {
case (leftUdt: UserDefinedType[_], rightUdt: UserDefinedType[_])
if leftUdt.userClass == rightUdt.userClass => leftUdt

case (leftType, rightUdt: UserDefinedType[_])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is beyond our existing rule, shall we update the function description accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a Scaladoc to the private merge function. I think the Javadoc that you're pointing to, is describing the function at a different level. For example, it doesn't mention any UDT's at all. Let me know what you think.

val right = StructType(
StructField("a", new CustomXMLGregorianCalendarType) :: Nil)

assert(left.merge(right) === left)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we check the opposite case, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation isn't symmetrical. So we can convert a UserDefinedType to a DateType, but not the other way around. I'm hesitant to add this functionality because I don't see any obvious applications. Please let me know if you think this should be added as well. I've added a test to check the opposite case as well, including some additional comments to clarify the idea and working.

@maropu
Copy link
Member

maropu commented Nov 24, 2019

Could you add end-to-end tests somewhere, e.g., UserDefinedTypeSuite, SQLQuerySuite, ...?

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114344 has finished for PR 26644 at commit bc9a3c3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114346 has finished for PR 26644 at commit 0da1628.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114347 has finished for PR 26644 at commit e7b449d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko Fokko force-pushed the SPARK-30004 branch 2 times, most recently from 2f2d676 to dd4c6c4 Compare November 24, 2019 13:28
@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114350 has finished for PR 26644 at commit dd4c6c4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114348 has finished for PR 26644 at commit 2f2d676.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114352 has finished for PR 26644 at commit 74f6951.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114353 has finished for PR 26644 at commit d1ca92a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114351 has finished for PR 26644 at commit a1088d0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114359 has finished for PR 26644 at commit 8f07b78.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2019

Test build #114360 has finished for PR 26644 at commit 8c324e3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114377 has finished for PR 26644 at commit 8538eed.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114716 has finished for PR 26644 at commit 7aaaee2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114736 has finished for PR 26644 at commit a33ed0b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114757 has finished for PR 26644 at commit 3d68a75.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114759 has finished for PR 26644 at commit 8f64856.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114774 has finished for PR 26644 at commit 12a2c93.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 3, 2019

Test build #114784 has finished for PR 26644 at commit 539ac96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Dec 5, 2019

@dongjoon-hyun @HyukjinKwon @maropu Any further thoughts?

@Fokko
Copy link
Contributor Author

Fokko commented Dec 11, 2019

Rebased onto master

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115158 has finished for PR 26644 at commit e976297.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Fokko
Copy link
Contributor Author

Fokko commented Dec 11, 2019

It looks like apache.org is unreachable:

curl: (7) Failed to connect to www.apache.org port 443: Connection timed out

gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Using `mvn` from path: /home/runner/work/spark/spark/build/apache-maven-3.6.3/bin/mvn
./build/mvn: line 172: /home/runner/work/spark/spark/build/apache-maven-3.6.3/bin/mvn: No such file or directory

In case you write a UDT, you always need to read it with the
UDT registered. In many cases you want to write it, and then
convert it into a native DataType.

In the case of Delta or when appending a partition, you can
write to the same table and then it needs to be able to
convert merge the UDT into the native type again.

* Add a test to the DataTypeSuite.scala
@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115165 has finished for PR 26644 at commit b378628.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 30, 2019

Test build #115961 has finished for PR 26644 at commit dd3c913.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if leftType == rightUdt.sqlType => leftType

case (leftUdt: UserDefinedType[_], rightType)
if leftUdt.sqlType == rightType => rightType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, sorry for my late response. I doubt if we should allow this case.

Currently, merge only allows the same types but UDT <> UDT's SQL types are not the same types. I think it makes less sense to allow this case alone.

Also, this https://github.com/apache/spark/pull/26644/files#r350486670 looks weird. jsonValue seems it should have JSON-serialized value of its own type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the UserDefinedType extends DataType, similar to TimestampType, StringType, and any other type. The thing is that a UserDefinedType can be compatible with any other type. For example, it is allowed to merge an int into a long. This is an explicit choice by the developer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is allowed to merge an int into a long.

But this StructType.merge does not allow such type merging. Given that, it looks weird to allow only UDT.

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> StructType.merge(LongType, IntegerType)
org.apache.spark.SparkException: Failed to merge incompatible data types bigint and int
  at org.apache.spark.sql.types.StructType$.merge(StructType.scala:600)
  ... 49 elided

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Apr 16, 2020
@github-actions github-actions bot closed this Apr 17, 2020
@Fokko
Copy link
Contributor Author

Fokko commented Apr 26, 2020

I've been thinking of this a lot, but could not come up with a clean solution. I'll leave it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants