-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3572] [sql] [mllib] User-Defined Types and MLlib Datasets #2919
Conversation
QA tests have started for PR 2919 at commit
|
QA tests have finished for PR 2919 at commit
|
Test FAILed. |
Test build #22148 has started for PR 2919 at commit
|
Test build #22150 has started for PR 2919 at commit
|
Test build #22148 has finished for PR 2919 at commit
|
Test FAILed. |
Test build #22150 has finished for PR 2919 at commit
|
Test PASSed. |
* ::DeveloperApi:: | ||
* The data type for User Defined Types. | ||
*/ | ||
@DeveloperApi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this have some extra documentation about what it's purpose is and when a user might want to define one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
Test build #22315 has started for PR 2919 at commit
|
Test build #22315 has finished for PR 2919 at commit
|
Test PASSed. |
Test build #22375 has started for PR 2919 at commit
|
Test build #22376 has started for PR 2919 at commit
|
Test build #22375 has finished for PR 2919 at commit
|
Test PASSed. |
Test build #22376 has finished for PR 2919 at commit
|
Test PASSed. |
Following @jkbradley's suggestion - I've moved this comment over to the JIRA - https://issues.apache.org/jira/browse/SPARK-3573 |
@etrain Thanks for your thoughts! This sounds like a discussion which would fit better on the Dataset JIRA. Could we please move it to there? This PR is meant to give a standard SQL UDT implementation; I am OK with removing the MLlib Dataset example if that needs to be discussed more. I'll post some thoughts on the JIRA once you move the comment there (for keeping a record). Thanks! |
I'm about to remove the mllib/ part of this PR; that can be put in after more discussions and whatever modifications. |
Test build #22462 has started for PR 2919 at commit
|
Test build #22462 has finished for PR 2919 at commit
|
Test PASSed. |
Test build #22557 has started for PR 2919 at commit
|
@marmbrus Just pushed WIP update to include Java support, but currently have issue with accessing Scala UserDefinedType (in catalyst) from Java side. The goal is to use a UDT defined in Scala (MyDenseVector) in Java, but the Java user needs to be able to convert the Scala UDT to a Java UDT. It is hard to write a (public) conversion method in Java since it needs to take a Scala UDT as an argument (and it does not recognize the Scala UserDefinedType alias from package.scala). Proposal: Write a conversion method in Scala, and have Java users call it. Specifically, expose UDTWrappers.wrapAsJava() and wrapAsScala(). Thoughts? Thanks! |
Test build #22557 has finished for PR 2919 at commit
|
…ved SQL UDT examples from mllib.
…isterUDT takes only the udt argument. Mid-process adding Java support for UDTs.
… Extended JavaUserDefinedTypeSuite
…hen creating schema from Java Bean
Test build #22776 has started for PR 2919 at commit
|
Test build #22776 has finished for PR 2919 at commit
|
Test FAILed. |
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT (cherry picked from commit 04450d1) Signed-off-by: Xiangrui Meng <meng@databricks.com>
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT
This has been subsumed by other PRs right? |
Yes, I'll close it. |
This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).
Main additions
Private SQL API
ScalaReflection
Unit Tests
Design decisions
Items left for future PRs
CC: @mengxr @marmbrus