Skip to content

Conversation

JDrit
Copy link
Contributor

@JDrit JDrit commented Jul 30, 2015

Users currently have to provide the full class name for external data sources, like:

sqlContext.read.format("com.databricks.spark.avro").load(path)

This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like:

sqlContext.read.format("avro").load(path)

This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc.

@SparkQA
Copy link

SparkQA commented Aug 3, 2015

Test build #1306 has finished for PR 7802 at commit 208a2a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait DataSourceProvider
    • trait RelationProvider extends DataSourceProvider
    • trait SchemaRelationProvider extends DataSourceProvider
    • trait HadoopFsRelationProvider extends DataSourceProvider
    • trait CreatableRelationProvider extends DataSourceProvider

.rat-excludes Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just add META-INF/services/ instead? For future-proofness.

@vanzin
Copy link
Contributor

vanzin commented Aug 3, 2015

Looks ok as far as I can tell. I generally find it weird to use traits for public APIs (too easy to break compatibility), but then all the API here is scala, so maybe it's not a big deal. I also wonder if there's a test that can be written to ensure that we're not mistakenly registering two sources with the same name.

And finally, ServiceLoader may have some funny semantics when used with userClassPathFirst, given the current implementation of ChildFirstURLClassLoader. Might be worth it to make a note to look at that behavior in more detail.

@JDrit
Copy link
Contributor Author

JDrit commented Aug 3, 2015

I looked into the userClassPathFirst and it did not seem to interfere with anything. I turned it on for both the driver and executor and all the tests still passed.

@JDrit
Copy link
Contributor Author

JDrit commented Aug 3, 2015

I moved the classloader out to a lazy val as well.

@JDrit
Copy link
Contributor Author

JDrit commented Aug 3, 2015

There are also test to ensure correct functionality for when two or more data sources are registered under the same name.

@rxin
Copy link
Contributor

rxin commented Aug 5, 2015

@JDrit what's still WIP about this patch?

@rxin
Copy link
Contributor

rxin commented Aug 5, 2015

Actually I think the current API breaks binary compatibility for data sources, so we can't merge it as is.

In Java (or Scala binary compatibility), RelationProvider now has an extra interface that has no default implementation. We need to find a workaround to provide this information.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #1345 has finished for PR 7802 at commit 74db85e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sys.error(s"Failed to load class for data source: $provider")
    • trait DataSourceProvider
    • trait RelationProvider extends DataSourceProvider
    • trait SchemaRelationProvider extends DataSourceProvider
    • trait HadoopFsRelationProvider extends DataSourceProvider
    • trait CreatableRelationProvider extends DataSourceProvider

@JDrit
Copy link
Contributor Author

JDrit commented Aug 5, 2015

@rxin I changed the interface that provided the alias to be a mixin used in the different data sources, so that should fix the binary compatibility problem. Data sources now mixin this trait if they want to provide an alias for themselves. Let me know if this satisfies your concerns.

@JDrit JDrit changed the title [SPARK-9486][SQL][WIP] Add data source aliasing for external packages [SPARK-9486][SQL] Add data source aliasing for external packages Aug 5, 2015
@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #1372 has finished for PR 7802 at commit 87b7f1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate
    • .getOrElse(sys.error(s"Failed to load class for data source: $provider"))
    • trait DataSourceRegister

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #1378 has finished for PR 7802 at commit 87b7f1c.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • .getOrElse(sys.error(s"Failed to load class for data source: $provider"))
    • trait DataSourceRegister

@rxin
Copy link
Contributor

rxin commented Aug 6, 2015

@JDrit still failing orc.

@JDrit
Copy link
Contributor Author

JDrit commented Aug 6, 2015

It was an issue with the class loader not being reloaded on every call of lookupDataSource, this fix should fix that.

.rat-excludes Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be META-INF/services/*? I can see someone creating a package with actual source files called services.

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #1395 has finished for PR 7802 at commit 72b349a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sys.error(s"Failed to load class for data source: $provider")
    • trait DataSourceRegister

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #1398 has finished for PR 7802 at commit e5e93b2.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class BlockMatrix(DistributedMatrix):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate
    • sys.error(s"Failed to load class for data source: $provider")
    • trait DataSourceRegister

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryLoad(loader, s"$provider.DefaultSource") => tryLoad(loader, s"$provider")?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, it's both supported.

@SparkQA
Copy link

SparkQA commented Aug 7, 2015

Test build #1402 timed out for PR 7802 at commit e5e93b2 after a configured wait of 175m.

@SparkQA
Copy link

SparkQA commented Aug 8, 2015

Test build #1411 has finished for PR 7802 at commit e5e93b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • sys.error(s"Failed to load class for data source: $provider")
    • trait DataSourceRegister

@rxin
Copy link
Contributor

rxin commented Aug 8, 2015

I'm going to merge this - I will submit a pr later to change the API slightly.

asfgit pushed a commit that referenced this pull request Aug 8, 2015
Users currently have to provide the full class name for external data sources, like:

`sqlContext.read.format("com.databricks.spark.avro").load(path)`

This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like:

`sqlContext.read.format("avro").load(path)`

This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc.

Author: Joseph Batchik <joseph.batchik@cloudera.com>
Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7802 from JDrit/service_loader and squashes the following commits:

49a01ec [Joseph Batchik] fixed a couple of format / error bugs
e5e93b2 [Joseph Batchik] modified rat file to only excluded added services
72b349a [Joseph Batchik] fixed error with orc data source actually
9f93ea7 [Joseph Batchik] fixed error with orc data source
87b7f1c [Joseph Batchik] fixed typo
101cd22 [Joseph Batchik] removing unneeded changes
8f3cf43 [Joseph Batchik] merged in changes
b63d337 [Joseph Batchik] merged in master
95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves
74db85e [Joseph Batchik] reformatted class loader
ac2270d [Joseph Batchik] removing some added test
a6926db [Joseph Batchik] added test cases for data source loader
208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources
946186e [Joseph Batchik] started working on service loader

(cherry picked from commit a3aec91)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in a3aec91 Aug 8, 2015
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should orc be added as well ?
I see change to OrcRelation.scala below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orc is added in the other resource file since hive is a sperate package.

CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
Users currently have to provide the full class name for external data sources, like:

`sqlContext.read.format("com.databricks.spark.avro").load(path)`

This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like:

`sqlContext.read.format("avro").load(path)`

This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc.

Author: Joseph Batchik <joseph.batchik@cloudera.com>
Author: Joseph Batchik <josephbatchik@gmail.com>

Closes apache#7802 from JDrit/service_loader and squashes the following commits:

49a01ec [Joseph Batchik] fixed a couple of format / error bugs
e5e93b2 [Joseph Batchik] modified rat file to only excluded added services
72b349a [Joseph Batchik] fixed error with orc data source actually
9f93ea7 [Joseph Batchik] fixed error with orc data source
87b7f1c [Joseph Batchik] fixed typo
101cd22 [Joseph Batchik] removing unneeded changes
8f3cf43 [Joseph Batchik] merged in changes
b63d337 [Joseph Batchik] merged in master
95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves
74db85e [Joseph Batchik] reformatted class loader
ac2270d [Joseph Batchik] removing some added test
a6926db [Joseph Batchik] added test cases for data source loader
208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources
946186e [Joseph Batchik] started working on service loader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants