[SPARK-9486][SQL] Add data source aliasing for external packages #7802

JDrit · 2015-07-30T20:45:33Z

Users currently have to provide the full class name for external data sources, like:

sqlContext.read.format("com.databricks.spark.avro").load(path)

This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like:

sqlContext.read.format("avro").load(path)

This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc.

SparkQA · 2015-08-03T15:43:10Z

Test build #1306 has finished for PR 7802 at commit 208a2a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait DataSourceProvider
- trait RelationProvider extends DataSourceProvider
- trait SchemaRelationProvider extends DataSourceProvider
- trait HadoopFsRelationProvider extends DataSourceProvider
- trait CreatableRelationProvider extends DataSourceProvider

vanzin · 2015-08-03T17:32:52Z

.rat-excludes

Perhaps just add META-INF/services/ instead? For future-proofness.

vanzin · 2015-08-03T17:48:41Z

Looks ok as far as I can tell. I generally find it weird to use traits for public APIs (too easy to break compatibility), but then all the API here is scala, so maybe it's not a big deal. I also wonder if there's a test that can be written to ensure that we're not mistakenly registering two sources with the same name.

And finally, ServiceLoader may have some funny semantics when used with userClassPathFirst, given the current implementation of ChildFirstURLClassLoader. Might be worth it to make a note to look at that behavior in more detail.

JDrit · 2015-08-03T23:24:00Z

I looked into the userClassPathFirst and it did not seem to interfere with anything. I turned it on for both the driver and executor and all the tests still passed.

JDrit · 2015-08-03T23:24:35Z

I moved the classloader out to a lazy val as well.

JDrit · 2015-08-03T23:25:57Z

There are also test to ensure correct functionality for when two or more data sources are registered under the same name.

rxin · 2015-08-05T06:08:38Z

@JDrit what's still WIP about this patch?

rxin · 2015-08-05T06:10:27Z

Actually I think the current API breaks binary compatibility for data sources, so we can't merge it as is.

In Java (or Scala binary compatibility), RelationProvider now has an extra interface that has no default implementation. We need to find a workaround to provide this information.

SparkQA · 2015-08-05T07:42:52Z

Test build #1345 has finished for PR 7802 at commit 74db85e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s"Failed to load class for data source: $provider")
- trait DataSourceProvider
- trait RelationProvider extends DataSourceProvider
- trait SchemaRelationProvider extends DataSourceProvider
- trait HadoopFsRelationProvider extends DataSourceProvider
- trait CreatableRelationProvider extends DataSourceProvider

…er themselves

JDrit · 2015-08-05T20:30:03Z

@rxin I changed the interface that provided the alias to be a mixin used in the different data sources, so that should fix the binary compatibility problem. Data sources now mixin this trait if they want to provide an alias for themselves. Let me know if this satisfies your concerns.

SparkQA · 2015-08-05T23:35:13Z

Test build #1372 has finished for PR 7802 at commit 87b7f1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class In(value: Expression, list: Seq[Expression]) extends Predicate
- case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate
- .getOrElse(sys.error(s"Failed to load class for data source: $provider"))
- trait DataSourceRegister

SparkQA · 2015-08-06T19:33:18Z

Test build #1378 has finished for PR 7802 at commit 87b7f1c.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- .getOrElse(sys.error(s"Failed to load class for data source: $provider"))
- trait DataSourceRegister

rxin · 2015-08-06T19:35:42Z

@JDrit still failing orc.

JDrit · 2015-08-06T22:13:00Z

It was an issue with the class loader not being reloaded on every call of lookupDataSource, this fix should fix that.

vanzin · 2015-08-06T22:17:13Z

.rat-excludes

Can this be META-INF/services/*? I can see someone creating a package with actual source files called services.

SparkQA · 2015-08-07T03:40:10Z

Test build #1395 has finished for PR 7802 at commit 72b349a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s"Failed to load class for data source: $provider")
- trait DataSourceRegister

SparkQA · 2015-08-07T06:27:55Z

Test build #1398 has finished for PR 7802 at commit e5e93b2.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class BlockMatrix(DistributedMatrix):
- case class In(value: Expression, list: Seq[Expression]) extends Predicate
- case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate
- sys.error(s"Failed to load class for data source: $provider")
- trait DataSourceRegister

chenghao-intel · 2015-08-07T07:43:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala

tryLoad(loader, s"$provider.DefaultSource") => tryLoad(loader, s"$provider")?

Oh, sorry, it's both supported.

SparkQA · 2015-08-07T09:25:49Z

Test build #1402 timed out for PR 7802 at commit e5e93b2 after a configured wait of 175m.

SparkQA · 2015-08-08T07:02:43Z

Test build #1411 has finished for PR 7802 at commit e5e93b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- sys.error(s"Failed to load class for data source: $provider")
- trait DataSourceRegister

rxin · 2015-08-08T18:02:49Z

I'm going to merge this - I will submit a pr later to change the API slightly.

Users currently have to provide the full class name for external data sources, like: `sqlContext.read.format("com.databricks.spark.avro").load(path)` This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like: `sqlContext.read.format("avro").load(path)` This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc. Author: Joseph Batchik <joseph.batchik@cloudera.com> Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7802 from JDrit/service_loader and squashes the following commits: 49a01ec [Joseph Batchik] fixed a couple of format / error bugs e5e93b2 [Joseph Batchik] modified rat file to only excluded added services 72b349a [Joseph Batchik] fixed error with orc data source actually 9f93ea7 [Joseph Batchik] fixed error with orc data source 87b7f1c [Joseph Batchik] fixed typo 101cd22 [Joseph Batchik] removing unneeded changes 8f3cf43 [Joseph Batchik] merged in changes b63d337 [Joseph Batchik] merged in master 95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves 74db85e [Joseph Batchik] reformatted class loader ac2270d [Joseph Batchik] removing some added test a6926db [Joseph Batchik] added test cases for data source loader 208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources 946186e [Joseph Batchik] started working on service loader (cherry picked from commit a3aec91) Signed-off-by: Reynold Xin <rxin@databricks.com>

tedyu · 2015-08-09T16:49:07Z

sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

Should orc be added as well ?
I see change to OrcRelation.scala below.

Orc is added in the other resource file since hive is a sperate package.

Users currently have to provide the full class name for external data sources, like: `sqlContext.read.format("com.databricks.spark.avro").load(path)` This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like: `sqlContext.read.format("avro").load(path)` This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc. Author: Joseph Batchik <joseph.batchik@cloudera.com> Author: Joseph Batchik <josephbatchik@gmail.com> Closes apache#7802 from JDrit/service_loader and squashes the following commits: 49a01ec [Joseph Batchik] fixed a couple of format / error bugs e5e93b2 [Joseph Batchik] modified rat file to only excluded added services 72b349a [Joseph Batchik] fixed error with orc data source actually 9f93ea7 [Joseph Batchik] fixed error with orc data source 87b7f1c [Joseph Batchik] fixed typo 101cd22 [Joseph Batchik] removing unneeded changes 8f3cf43 [Joseph Batchik] merged in changes b63d337 [Joseph Batchik] merged in master 95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves 74db85e [Joseph Batchik] reformatted class loader ac2270d [Joseph Batchik] removing some added test a6926db [Joseph Batchik] added test cases for data source loader 208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources 946186e [Joseph Batchik] started working on service loader

Joseph Batchik added 2 commits July 30, 2015 11:28

started working on service loader

946186e

changes to do error catching if there are multiple data sources

208a2a8

vanzin reviewed Aug 3, 2015
View reviewed changes

.rat-excludes Outdated

Copy link

Contributor

vanzin Aug 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just add META-INF/services/ instead? For future-proofness.

Joseph Batchik added 3 commits August 3, 2015 12:02

added test cases for data source loader

a6926db

removing some added test

ac2270d

reformatted class loader

74db85e

Joseph Batchik added 5 commits August 5, 2015 12:45

changed the new trait to be used as a mixin for data source to regist…

95ae030

…er themselves

merged in master

b63d337

merged in changes

8f3cf43

removing unneeded changes

101cd22

fixed typo

87b7f1c

JDrit changed the title ~~[SPARK-9486][SQL][WIP] Add data source aliasing for external packages~~ [SPARK-9486][SQL] Add data source aliasing for external packages Aug 5, 2015

fixed error with orc data source

9f93ea7

fixed error with orc data source actually

72b349a

vanzin reviewed Aug 6, 2015
View reviewed changes

modified rat file to only excluded added services

e5e93b2

chenghao-intel reviewed Aug 7, 2015
View reviewed changes

fixed a couple of format / error bugs

49a01ec

asfgit closed this in a3aec91 Aug 8, 2015

tedyu reviewed Aug 9, 2015
View reviewed changes

JoshRosen mentioned this pull request Oct 9, 2015

Enable data source short name for Spark 1.5.0+ databricks/spark-redshift#105

Closed

[SPARK-9486][SQL] Add data source aliasing for external packages #7802

[SPARK-9486][SQL] Add data source aliasing for external packages #7802

Uh oh!

Conversation

JDrit commented Jul 30, 2015

Uh oh!

SparkQA commented Aug 3, 2015

Uh oh!

vanzin Aug 3, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin commented Aug 3, 2015

Uh oh!

JDrit commented Aug 3, 2015

Uh oh!

JDrit commented Aug 3, 2015

Uh oh!

JDrit commented Aug 3, 2015

Uh oh!

rxin commented Aug 5, 2015

Uh oh!

rxin commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

JDrit commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 5, 2015

Uh oh!

SparkQA commented Aug 6, 2015

Uh oh!

rxin commented Aug 6, 2015

Uh oh!

JDrit commented Aug 6, 2015

Uh oh!

vanzin Aug 6, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 7, 2015

Uh oh!

SparkQA commented Aug 7, 2015

Uh oh!

chenghao-intel Aug 7, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Aug 7, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 7, 2015

Uh oh!

SparkQA commented Aug 8, 2015

Uh oh!

rxin commented Aug 8, 2015

Uh oh!

tedyu Aug 9, 2015

Choose a reason for hiding this comment

Uh oh!

JDrit Aug 9, 2015

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants