-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-9486][SQL] Add data source aliasing for external packages #7802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #1306 has finished for PR 7802 at commit
|
.rat-excludes
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just add META-INF/services/
instead? For future-proofness.
Looks ok as far as I can tell. I generally find it weird to use traits for public APIs (too easy to break compatibility), but then all the API here is scala, so maybe it's not a big deal. I also wonder if there's a test that can be written to ensure that we're not mistakenly registering two sources with the same name. And finally, |
I looked into the |
I moved the classloader out to a lazy val as well. |
There are also test to ensure correct functionality for when two or more data sources are registered under the same name. |
@JDrit what's still WIP about this patch? |
Actually I think the current API breaks binary compatibility for data sources, so we can't merge it as is. In Java (or Scala binary compatibility), RelationProvider now has an extra interface that has no default implementation. We need to find a workaround to provide this information. |
Test build #1345 has finished for PR 7802 at commit
|
@rxin I changed the interface that provided the alias to be a mixin used in the different data sources, so that should fix the binary compatibility problem. Data sources now mixin this trait if they want to provide an alias for themselves. Let me know if this satisfies your concerns. |
Test build #1372 has finished for PR 7802 at commit
|
Test build #1378 has finished for PR 7802 at commit
|
@JDrit still failing orc. |
It was an issue with the class loader not being reloaded on every call of |
.rat-excludes
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be META-INF/services/*
? I can see someone creating a package with actual source files called services
.
Test build #1395 has finished for PR 7802 at commit
|
Test build #1398 has finished for PR 7802 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tryLoad(loader, s"$provider.DefaultSource")
=> tryLoad(loader, s"$provider")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, it's both supported.
Test build #1402 timed out for PR 7802 at commit |
Test build #1411 has finished for PR 7802 at commit
|
I'm going to merge this - I will submit a pr later to change the API slightly. |
Users currently have to provide the full class name for external data sources, like: `sqlContext.read.format("com.databricks.spark.avro").load(path)` This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like: `sqlContext.read.format("avro").load(path)` This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc. Author: Joseph Batchik <joseph.batchik@cloudera.com> Author: Joseph Batchik <josephbatchik@gmail.com> Closes #7802 from JDrit/service_loader and squashes the following commits: 49a01ec [Joseph Batchik] fixed a couple of format / error bugs e5e93b2 [Joseph Batchik] modified rat file to only excluded added services 72b349a [Joseph Batchik] fixed error with orc data source actually 9f93ea7 [Joseph Batchik] fixed error with orc data source 87b7f1c [Joseph Batchik] fixed typo 101cd22 [Joseph Batchik] removing unneeded changes 8f3cf43 [Joseph Batchik] merged in changes b63d337 [Joseph Batchik] merged in master 95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves 74db85e [Joseph Batchik] reformatted class loader ac2270d [Joseph Batchik] removing some added test a6926db [Joseph Batchik] added test cases for data source loader 208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources 946186e [Joseph Batchik] started working on service loader (cherry picked from commit a3aec91) Signed-off-by: Reynold Xin <rxin@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should orc be added as well ?
I see change to OrcRelation.scala below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orc is added in the other resource file since hive is a sperate package.
Users currently have to provide the full class name for external data sources, like: `sqlContext.read.format("com.databricks.spark.avro").load(path)` This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like: `sqlContext.read.format("avro").load(path)` This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc. Author: Joseph Batchik <joseph.batchik@cloudera.com> Author: Joseph Batchik <josephbatchik@gmail.com> Closes apache#7802 from JDrit/service_loader and squashes the following commits: 49a01ec [Joseph Batchik] fixed a couple of format / error bugs e5e93b2 [Joseph Batchik] modified rat file to only excluded added services 72b349a [Joseph Batchik] fixed error with orc data source actually 9f93ea7 [Joseph Batchik] fixed error with orc data source 87b7f1c [Joseph Batchik] fixed typo 101cd22 [Joseph Batchik] removing unneeded changes 8f3cf43 [Joseph Batchik] merged in changes b63d337 [Joseph Batchik] merged in master 95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves 74db85e [Joseph Batchik] reformatted class loader ac2270d [Joseph Batchik] removing some added test a6926db [Joseph Batchik] added test cases for data source loader 208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources 946186e [Joseph Batchik] started working on service loader
Users currently have to provide the full class name for external data sources, like:
sqlContext.read.format("com.databricks.spark.avro").load(path)
This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like:
sqlContext.read.format("avro").load(path)
This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc.