Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3572] [sql] [mllib] User-Defined Types and MLlib Datasets #2919

Closed
wants to merge 42 commits into from

Conversation

jkbradley
Copy link
Member

This PR adds User-Defined Types (UDTs) to SQL. It is a precursor to using SchemaRDD as a Dataset for the new MLlib API. Currently, the UDT API is private since there is incomplete support (e.g., no Java or Python support yet).

Main additions

Private SQL API

  • Added annotation SQLUserDefinedType (DeveloperApi)
  • Added abstract class UserDefinedType

ScalaReflection

  • Methods for converting between Scala and Catalyst types now take DataType.
    • convertRowToScala added in several locations in SQL
  • schemaFor checks for SQLUserDefinedType annotation

Unit Tests

  • UserDefinedTypeSuite.scala: Tests fake version of DenseVector
  • JavaUserDefinedTypeSuite.java: Tests fake version of DenseVector (defined in Scala)

Design decisions

  • UDTs override types natively recognized by SQL.
  • Question: Should users be able to override primitive or built-in types?

Items left for future PRs

  • Java and Python APIs
  • Serialization (Parquet, etc.)

CC: @mengxr @marmbrus

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

QA tests have started for PR 2919 at commit 3de3d76.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

QA tests have finished for PR 2919 at commit 3de3d76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22112/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22148 has started for PR 2919 at commit 716c19f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22150 has started for PR 2919 at commit 8ca2339.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22148 has finished for PR 2919 at commit 716c19f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22148/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22150 has finished for PR 2919 at commit 8ca2339.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22150/
Test PASSed.

* ::DeveloperApi::
* The data type for User Defined Types.
*/
@DeveloperApi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this have some extra documentation about what it's purpose is and when a user might want to define one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22315 has started for PR 2919 at commit 7dd045a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22315 has finished for PR 2919 at commit 7dd045a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22315/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22375 has started for PR 2919 at commit bbb862a.

  • This patch merges cleanly.

@jkbradley
Copy link
Member Author

@marmbrus Parquet support added by @mengxr so this should be ready for a pass. Thanks both!

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22376 has started for PR 2919 at commit b74251d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22375 has finished for PR 2919 at commit bbb862a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22375/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 28, 2014

Test build #22376 has finished for PR 2919 at commit b74251d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22376/
Test PASSed.

@etrain
Copy link
Contributor

etrain commented Oct 29, 2014

Following @jkbradley's suggestion - I've moved this comment over to the JIRA - https://issues.apache.org/jira/browse/SPARK-3573

@jkbradley
Copy link
Member Author

@etrain Thanks for your thoughts! This sounds like a discussion which would fit better on the Dataset JIRA. Could we please move it to there? This PR is meant to give a standard SQL UDT implementation; I am OK with removing the MLlib Dataset example if that needs to be discussed more. I'll post some thoughts on the JIRA once you move the comment there (for keeping a record). Thanks!

@jkbradley
Copy link
Member Author

I'm about to remove the mllib/ part of this PR; that can be put in after more discussions and whatever modifications.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22462 has started for PR 2919 at commit a459956.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 29, 2014

Test build #22462 has finished for PR 2919 at commit a459956.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22462/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22557 has started for PR 2919 at commit 9c175e9.

  • This patch merges cleanly.

@jkbradley
Copy link
Member Author

@marmbrus Just pushed WIP update to include Java support, but currently have issue with accessing Scala UserDefinedType (in catalyst) from Java side. The goal is to use a UDT defined in Scala (MyDenseVector) in Java, but the Java user needs to be able to convert the Scala UDT to a Java UDT. It is hard to write a (public) conversion method in Java since it needs to take a Scala UDT as an argument (and it does not recognize the Scala UserDefinedType alias from package.scala).

Proposal: Write a conversion method in Scala, and have Java users call it. Specifically, expose UDTWrappers.wrapAsJava() and wrapAsScala().

Thoughts? Thanks!

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22557 has finished for PR 2919 at commit 9c175e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable
    • public abstract class UserDefinedType<UserType> extends DataType

@SparkQA
Copy link

SparkQA commented Nov 2, 2014

Test build #22776 has started for PR 2919 at commit e13cd8a.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22776 has finished for PR 2919 at commit e13cd8a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable
    • public abstract class UserDefinedType<UserType> extends DataType implements Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22776/
Test FAILed.

asfgit pushed a commit that referenced this pull request Nov 4, 2014
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.

marmbrus jkbradley davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:

acff637 [Xiangrui Meng] merge master
dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
7c4a6a9 [Xiangrui Meng] address comments
75223db [Xiangrui Meng] minor update
f740379 [Xiangrui Meng] remove UDT from default imports
e98d9d0 [Xiangrui Meng] fix py style
4e84fce [Xiangrui Meng] remove local hive tests and add more tests
39f19e0 [Xiangrui Meng] add tests
b7f666d [Xiangrui Meng] add Python UDT

(cherry picked from commit 04450d1)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 4, 2014
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.

marmbrus jkbradley davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:

acff637 [Xiangrui Meng] merge master
dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
7c4a6a9 [Xiangrui Meng] address comments
75223db [Xiangrui Meng] minor update
f740379 [Xiangrui Meng] remove UDT from default imports
e98d9d0 [Xiangrui Meng] fix py style
4e84fce [Xiangrui Meng] remove local hive tests and add more tests
39f19e0 [Xiangrui Meng] add tests
b7f666d [Xiangrui Meng] add Python UDT
@marmbrus
Copy link
Contributor

marmbrus commented Nov 4, 2014

This has been subsumed by other PRs right?

@jkbradley
Copy link
Member Author

Yes, I'll close it.

@jkbradley jkbradley closed this Nov 4, 2014
@jkbradley jkbradley deleted the sql-udt branch December 4, 2014 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants