Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576

Closed
wants to merge 45 commits into from

Conversation

scwf
Copy link
Contributor

@scwf scwf commented Sep 29, 2014

Provide a initial support for orc file formate in spark sql, support for both reading and writing ORC files . user can use ORC file just like parquet as follow:

... ...
  rdd.registerTempTable("records")
  rdd.saveAsOrcFile("pair.orc")
  val orcFile = sqlContext.orcFile("pair.orc")
  orcFile.registerTempTable("orcFile")
  sql("SELECT * FROM orcFile").collect().foreach(println)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@cloud-fan
Copy link
Contributor

Have you considered cooperate with #2475?

@scwf
Copy link
Contributor Author

scwf commented Sep 30, 2014

@cloud-fan, no since #2475 has not merged to master. @marmbrus can you take a look at this?

@yhuai
Copy link
Contributor

yhuai commented Sep 30, 2014

I guess we should revisit it after #2475 is in.

@@ -54,6 +54,21 @@
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.spark-project.hive</groupId>
<artifactId>hive-exec</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can depend only on ORC file and not on the entire hive-exec package? It think we'll need to move this functionality into the hive package if it requires all of the hive dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it appears ORC has deep dependencies on Hive types. So I think this will need to be moved into the Hive project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually i also want that, but unfortunately have not found the solution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive project have supported ORC, this PR is to enable sql project to support it, just as it support parquet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of having hive as a separate project is to avoid pulling in all of the dependencies of hive for all spark users. Thus, as Patrick said, this will have to go in the hive sub project. That said, I think this patch is still quite valuable. We generally recommend that all users use the HiveContext if they can tolerate hive's dependencies, so it'll still get use. Also, this patch provides better programatic API support in addition to the existing SQL support.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 1, 2014

Hey @scwf, thanks for working on this! This will be a pretty awesome feature that people have been asking for. I did a quick pass and made some comments.

One higher level comment, we'll want to change this to use an API similar to #2475 as we are going to stop adding new data source methods directly to SQLContext (and we'll probably try and deprecate parquet and json from there as well eventually). I haven't had enough time to finish that PR yet (it doesn't yet support inserting data), but will try and get that merged in soon.

@yhuai
Copy link
Contributor

yhuai commented Oct 1, 2014

Once #2616 is in, can we reuse stuff in hiveWriterContainers.scala and InsertIntoHiveTable.scala to avoid duplicating code?

@marmbrus
Copy link
Contributor

marmbrus commented Oct 1, 2014

ok to test

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21149/

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have started for PR 2576 at commit f928657.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21288/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have started for PR 2576 at commit 1505af4.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21290/Test FAILed.

@scwf scwf changed the title [WIP][SPARK-3720][SQL]initial support ORC in spark sql [WIP][SPARK-2883][SQL]initial support ORC in spark sql Nov 18, 2014
@scwf
Copy link
Contributor Author

scwf commented Nov 23, 2014

Added support orc with new datasource API. Since now there is no sink interface in data source api, not remove the old version.
TODO:
to fix issue of partitioned table

@SparkQA
Copy link

SparkQA commented Nov 23, 2014

Test build #23756 has started for PR 2576 at commit 1e0c1d9.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23756/
Test FAILed.

* Allows creation of orc based tables using the syntax
* `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.orc`.
* Currently the only option required is `path`, which should be the location of a collection of,
* optionally partitioned, parquet files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orc files, instead of parquet files

@zhzhan
Copy link
Contributor

zhzhan commented Dec 1, 2014

@scwf I send a pull request to your orc1 branch for the predictor push down support. Please take a look at it, and merge them. Let me know if you have any concerns. scwf#13

predictor pushdown support
@SparkQA
Copy link

SparkQA commented Dec 2, 2014

Test build #24014 has started for PR 2576 at commit 601d242.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24014/
Test FAILed.

@scwf
Copy link
Contributor Author

scwf commented Dec 15, 2014

@marmbrus, i am fixing the test failure and refactoring the code based on datasource api, and one question here is, should i keep the sink part(write interface) here? Or just provide the ability to read orc file based on datasource api?

@marmbrus
Copy link
Contributor

We are adding support for writing data in the next version of the API so probably better to wait until that is available.

@scwf
Copy link
Contributor Author

scwf commented Dec 21, 2014

OK, so i make #3753 to support orc based on the new datasource api. @zhzhan, in that PR it's support to read partitioned orc files but no ppd(ppd also need refactory based on the new datasource api), you can make a follow up PR for ppd after it merged or you can send the ppd support to my branch, anyway is ok.

@marmbrus
Copy link
Contributor

We can close this issue right?

@scwf
Copy link
Contributor Author

scwf commented Dec 30, 2014

yes, closed!

@scwf scwf closed this Dec 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants