New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576
Conversation
Can one of the admins verify this patch? |
Have you considered cooperate with #2475? |
@cloud-fan, no since #2475 has not merged to master. @marmbrus can you take a look at this? |
I guess we should revisit it after #2475 is in. |
@@ -54,6 +54,21 @@ | |||
<scope>test</scope> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.spark-project.hive</groupId> | |||
<artifactId>hive-exec</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can depend only on ORC file and not on the entire hive-exec
package? It think we'll need to move this functionality into the hive
package if it requires all of the hive dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it appears ORC has deep dependencies on Hive types. So I think this will need to be moved into the Hive project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually i also want that, but unfortunately have not found the solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive project have supported ORC, this PR is to enable sql project to support it, just as it support parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of having hive as a separate project is to avoid pulling in all of the dependencies of hive for all spark users. Thus, as Patrick said, this will have to go in the hive sub project. That said, I think this patch is still quite valuable. We generally recommend that all users use the HiveContext if they can tolerate hive's dependencies, so it'll still get use. Also, this patch provides better programatic API support in addition to the existing SQL support.
Hey @scwf, thanks for working on this! This will be a pretty awesome feature that people have been asking for. I did a quick pass and made some comments. One higher level comment, we'll want to change this to use an API similar to #2475 as we are going to stop adding new data source methods directly to SQLContext (and we'll probably try and deprecate parquet and json from there as well eventually). I haven't had enough time to finish that PR yet (it doesn't yet support inserting data), but will try and get that merged in soon. |
Once #2616 is in, can we reuse stuff in |
ok to test |
Test FAILed. |
QA tests have started for PR 2576 at commit
|
Test FAILed. |
QA tests have started for PR 2576 at commit
|
Test FAILed. |
This reverts commit 5f5fda8.
Added support orc with new datasource API. Since now there is no sink interface in data source api, not remove the old version. |
Test build #23756 has started for PR 2576 at commit
|
Test FAILed. |
* Allows creation of orc based tables using the syntax | ||
* `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.orc`. | ||
* Currently the only option required is `path`, which should be the location of a collection of, | ||
* optionally partitioned, parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orc files, instead of parquet files
predictor pushdown support
Test build #24014 has started for PR 2576 at commit
|
Test FAILed. |
@marmbrus, i am fixing the test failure and refactoring the code based on datasource api, and one question here is, should i keep the sink part(write interface) here? Or just provide the ability to read orc file based on datasource api? |
We are adding support for writing data in the next version of the API so probably better to wait until that is available. |
OK, so i make #3753 to support orc based on the new datasource api. @zhzhan, in that PR it's support to read partitioned orc files but no ppd(ppd also need refactory based on the new datasource api), you can make a follow up PR for ppd after it merged or you can send the ppd support to my branch, anyway is ok. |
We can close this issue right? |
yes, closed! |
Provide a initial support for orc file formate in spark sql, support for both reading and writing ORC files . user can use ORC file just like parquet as follow: