[SPARK-3572] [sql] [mllib] User-Defined Types and MLlib Datasets #2919

jkbradley · 2014-10-24T03:42:52Z

This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).

Main additions

Private SQL API

Added annotation SQLUserDefinedType (DeveloperApi)
Added abstract class UserDefinedType

ScalaReflection

Methods for converting between Scala and Catalyst types now take DataType.
- convertRowToScala added in several locations in SQL
schemaFor checks for SQLUserDefinedType annotation

Unit Tests

UserDefinedTypeSuite.scala: Tests fake version of DenseVector
JavaUserDefinedTypeSuite.java: Tests fake version of DenseVector (defined in Scala)

Design decisions

UDTs override types natively recognized by SQL.
Question: Should users be able to override primitive or built-in types?

Items left for future PRs

Java and Python APIs
Serialization (Parquet, etc.)

CC: @mengxr @marmbrus

SparkQA · 2014-10-24T03:49:49Z

QA tests have started for PR 2919 at commit 3de3d76.

This patch merges cleanly.

SparkQA · 2014-10-24T05:07:25Z

QA tests have finished for PR 2919 at commit 3de3d76.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable

AmplabJenkins · 2014-10-24T05:07:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22112/
Test FAILed.

SparkQA · 2014-10-24T18:14:48Z

Test build #22148 has started for PR 2919 at commit 716c19f.

This patch merges cleanly.

SparkQA · 2014-10-24T18:29:57Z

Test build #22150 has started for PR 2919 at commit 8ca2339.

This patch merges cleanly.

SparkQA · 2014-10-24T19:26:30Z

Test build #22148 has finished for PR 2919 at commit 716c19f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable

AmplabJenkins · 2014-10-24T19:26:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22148/
Test FAILed.

SparkQA · 2014-10-24T20:01:10Z

Test build #22150 has finished for PR 2919 at commit 8ca2339.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable

AmplabJenkins · 2014-10-24T20:01:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22150/
Test PASSed.

sryza · 2014-10-25T02:44:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

+ * ::DeveloperApi::
+ * The data type for User Defined Types.
+ */
+@DeveloperApi


Can this have some extra documentation about what it's purpose is and when a user might want to define one?

SparkQA · 2014-10-28T01:09:44Z

Test build #22315 has started for PR 2919 at commit 7dd045a.

This patch merges cleanly.

SparkQA · 2014-10-28T02:52:16Z

Test build #22315 has finished for PR 2919 at commit 7dd045a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable

AmplabJenkins · 2014-10-28T02:52:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22315/
Test PASSed.

SparkQA · 2014-10-28T20:42:29Z

Test build #22375 has started for PR 2919 at commit bbb862a.

This patch merges cleanly.

jkbradley · 2014-10-28T20:57:29Z

@marmbrus Parquet support added by @mengxr so this should be ready for a pass. Thanks both!

SparkQA · 2014-10-28T21:00:01Z

Test build #22376 has started for PR 2919 at commit b74251d.

This patch merges cleanly.

SparkQA · 2014-10-28T22:11:32Z

Test build #22375 has finished for PR 2919 at commit bbb862a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable

AmplabJenkins · 2014-10-28T22:11:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22375/
Test PASSed.

SparkQA · 2014-10-28T22:30:39Z

Test build #22376 has finished for PR 2919 at commit b74251d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable

AmplabJenkins · 2014-10-28T22:30:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22376/
Test PASSed.

etrain · 2014-10-29T17:39:17Z

Following @jkbradley's suggestion - I've moved this comment over to the JIRA - https://issues.apache.org/jira/browse/SPARK-3573

jkbradley · 2014-10-29T18:43:11Z

@etrain Thanks for your thoughts! This sounds like a discussion which would fit better on the Dataset JIRA. Could we please move it to there? This PR is meant to give a standard SQL UDT implementation; I am OK with removing the MLlib Dataset example if that needs to be discussed more. I'll post some thoughts on the JIRA once you move the comment there (for keeping a record). Thanks!

jkbradley · 2014-10-29T18:48:07Z

I'm about to remove the mllib/ part of this PR; that can be put in after more discussions and whatever modifications.

SparkQA · 2014-10-29T19:00:10Z

Test build #22462 has started for PR 2919 at commit a459956.

This patch merges cleanly.

SparkQA · 2014-10-29T19:50:40Z

Test build #22462 has finished for PR 2919 at commit a459956.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-29T19:50:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22462/
Test PASSed.

SparkQA · 2014-10-30T20:27:29Z

Test build #22557 has started for PR 2919 at commit 9c175e9.

This patch merges cleanly.

jkbradley · 2014-10-30T20:31:29Z

@marmbrus Just pushed WIP update to include Java support, but currently have issue with accessing Scala UserDefinedType (in catalyst) from Java side. The goal is to use a UDT defined in Scala (MyDenseVector) in Java, but the Java user needs to be able to convert the Scala UDT to a Java UDT. It is hard to write a (public) conversion method in Java since it needs to take a Scala UDT as an argument (and it does not recognize the Scala UserDefinedType alias from package.scala).

Proposal: Write a conversion method in Scala, and have Java users call it. Specifically, expose UDTWrappers.wrapAsJava() and wrapAsScala().

Thoughts? Thanks!

SparkQA · 2014-10-30T20:34:00Z

Test build #22557 has finished for PR 2919 at commit 9c175e9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable
- public abstract class UserDefinedType<UserType> extends DataType

…ved SQL UDT examples from mllib.

…isterUDT takes only the udt argument. Mid-process adding Java support for UDTs.

… Extended JavaUserDefinedTypeSuite

…hen creating schema from Java Bean

SparkQA · 2014-11-02T23:35:00Z

Test build #22776 has started for PR 2919 at commit e13cd8a.

This patch does not merge cleanly.

SparkQA · 2014-11-03T01:05:30Z

Test build #22776 has finished for PR 2919 at commit e13cd8a.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- case class Params(
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable
- public abstract class UserDefinedType<UserType> extends DataType implements Serializable

AmplabJenkins · 2014-11-03T01:05:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22776/
Test FAILed.

Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT (cherry picked from commit 04450d1) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python. marmbrus jkbradley davies Author: Xiangrui Meng <meng@databricks.com> Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits: acff637 [Xiangrui Meng] merge master dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well 2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion 7c4a6a9 [Xiangrui Meng] address comments 75223db [Xiangrui Meng] minor update f740379 [Xiangrui Meng] remove UDT from default imports e98d9d0 [Xiangrui Meng] fix py style 4e84fce [Xiangrui Meng] remove local hive tests and add more tests 39f19e0 [Xiangrui Meng] add tests b7f666d [Xiangrui Meng] add Python UDT

marmbrus · 2014-11-04T18:54:57Z

This has been subsumed by other PRs right?

jkbradley · 2014-11-04T19:12:03Z

Yes, I'll close it.

sryza reviewed Oct 25, 2014
View reviewed changes

jkbradley and others added 18 commits November 2, 2014 11:26

Added more doc to UserDefineType

759af7a

Added more doc for UserDefinedType. Removed unused code in Suite

db16139

support UDT in parquet

cfbc321

remove unnecessary changes

3143ac3

remove debug code

87264a5

update example code

4500d8a

allow any type in UDT

b028675

Moved udt case to top of all matches. Small cleanups

7f29656

Fixed merge error after last merge. Note: Last merge commit also remo…

8b242ea

…ved SQL UDT examples from mllib.

Modified UserDefinedType to store Java class of user type so that reg…

8de957c

…isterUDT takes only the udt argument. Mid-process adding Java support for UDTs.

Removed Java UserDefinedType, and made UDTs private[spark] for now

fa86b20

fixed scalastyle

20630bc

Made MyLabeledPoint into a Java Bean

6fddc1c

Removed old UDT code (registry and Java UDTs). Cleaned up other code.…

a571bb6

… Extended JavaUserDefinedTypeSuite

Cleaned up Java UDT Suite, and added warning about element ordering w…

d063380

…hen creating schema from Java Bean

updates based on code review

30ce5b2

style edits

5817b2b

Removed Vector UDTs

e13cd8a

jkbradley force-pushed the sql-udt branch from f8002b4 to e13cd8a Compare November 2, 2014 23:30

mengxr mentioned this pull request Nov 3, 2014

[SPARK-4192][SQL] Internal API for Python UDT #3068

Closed

jkbradley closed this Nov 4, 2014

jkbradley deleted the sql-udt branch December 4, 2014 20:30

[SPARK-3572] [sql] [mllib] User-Defined Types and MLlib Datasets #2919

[SPARK-3572] [sql] [mllib] User-Defined Types and MLlib Datasets #2919

Conversation

jkbradley commented Oct 24, 2014

Main additions

Design decisions

Items left for future PRs

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

sryza Oct 25, 2014

Choose a reason for hiding this comment

jkbradley Oct 28, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 28, 2014

SparkQA commented Oct 28, 2014

AmplabJenkins commented Oct 28, 2014

SparkQA commented Oct 28, 2014

jkbradley commented Oct 28, 2014

SparkQA commented Oct 28, 2014

SparkQA commented Oct 28, 2014

AmplabJenkins commented Oct 28, 2014

SparkQA commented Oct 28, 2014

AmplabJenkins commented Oct 28, 2014

etrain commented Oct 29, 2014

jkbradley commented Oct 29, 2014

jkbradley commented Oct 29, 2014

SparkQA commented Oct 29, 2014

SparkQA commented Oct 29, 2014

AmplabJenkins commented Oct 29, 2014

SparkQA commented Oct 30, 2014

jkbradley commented Oct 30, 2014

SparkQA commented Oct 30, 2014

SparkQA commented Nov 2, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

marmbrus commented Nov 4, 2014

jkbradley commented Nov 4, 2014