[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576

scwf · 2014-09-29T14:24:30Z

Provide a initial support for orc file formate in spark sql, support for both reading and writing ORC files . user can use ORC file just like parquet as follow:

... ...
  rdd.registerTempTable("records")
  rdd.saveAsOrcFile("pair.orc")
  val orcFile = sqlContext.orcFile("pair.orc")
  orcFile.registerTempTable("orcFile")
  sql("SELECT * FROM orcFile").collect().foreach(println)

AmplabJenkins · 2014-09-29T14:27:11Z

Can one of the admins verify this patch?

cloud-fan · 2014-09-30T03:08:12Z

Have you considered cooperate with #2475?

scwf · 2014-09-30T05:46:08Z

@cloud-fan, no since #2475 has not merged to master. @marmbrus can you take a look at this?

yhuai · 2014-09-30T19:30:12Z

I guess we should revisit it after #2475 is in.

pwendell · 2014-10-01T16:09:07Z

sql/core/pom.xml

@@ -54,6 +54,21 @@
      <scope>test</scope>
    </dependency>
    <dependency>
+       <groupId>org.spark-project.hive</groupId>
+       <artifactId>hive-exec</artifactId>


Is there a way we can depend only on ORC file and not on the entire hive-exec package? It think we'll need to move this functionality into the hive package if it requires all of the hive dependencies.

Yeah it appears ORC has deep dependencies on Hive types. So I think this will need to be moved into the Hive project.

Actually i also want that, but unfortunately have not found the solution

hive project have supported ORC, this PR is to enable sql project to support it, just as it support parquet

The whole point of having hive as a separate project is to avoid pulling in all of the dependencies of hive for all spark users. Thus, as Patrick said, this will have to go in the hive sub project. That said, I think this patch is still quite valuable. We generally recommend that all users use the HiveContext if they can tolerate hive's dependencies, so it'll still get use. Also, this patch provides better programatic API support in addition to the existing SQL support.

marmbrus · 2014-10-01T19:06:07Z

Hey @scwf, thanks for working on this! This will be a pretty awesome feature that people have been asking for. I did a quick pass and made some comments.

One higher level comment, we'll want to change this to use an API similar to #2475 as we are going to stop adding new data source methods directly to SQLContext (and we'll probably try and deprecate parquet and json from there as well eventually). I haven't had enough time to finish that PR yet (it doesn't yet support inserting data), but will try and get that merged in soon.

yhuai · 2014-10-01T19:10:48Z

Once #2616 is in, can we reuse stuff in hiveWriterContainers.scala and InsertIntoHiveTable.scala to avoid duplicating code?

marmbrus · 2014-10-01T19:34:38Z

ok to test

AmplabJenkins · 2014-10-01T19:47:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21149/

SparkQA · 2014-10-04T07:29:38Z

QA tests have started for PR 2576 at commit f928657.

This patch merges cleanly.

AmplabJenkins · 2014-10-04T07:30:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21288/Test FAILed.

SparkQA · 2014-10-04T07:39:35Z

QA tests have started for PR 2576 at commit 1505af4.

This patch merges cleanly.

AmplabJenkins · 2014-10-04T07:40:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21290/Test FAILed.

…xpressions.

…arquet

This reverts commit 5f5fda8.

scwf · 2014-11-23T01:48:48Z

Added support orc with new datasource API. Since now there is no sink interface in data source api, not remove the old version.
TODO:
to fix issue of partitioned table

SparkQA · 2014-11-23T01:50:03Z

Test build #23756 has started for PR 2576 at commit 1e0c1d9.

This patch merges cleanly.

AmplabJenkins · 2014-11-23T03:19:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23756/
Test FAILed.

zhzhan · 2014-11-26T23:14:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/orc.scala

+ * Allows creation of orc based tables using the syntax
+ * `CREATE TEMPORARY TABLE ... USING org.apache.spark.sql.orc`.
+ * Currently the only option required is `path`, which should be the location of a collection of,
+ * optionally partitioned, parquet files.


Orc files, instead of parquet files

zhzhan · 2014-12-01T21:39:32Z

@scwf I send a pull request to your orc1 branch for the predictor push down support. Please take a look at it, and merge them. Let me know if you have any concerns. scwf#13

predictor pushdown support

SparkQA · 2014-12-02T01:55:24Z

Test build #24014 has started for PR 2576 at commit 601d242.

This patch merges cleanly.

AmplabJenkins · 2014-12-02T03:12:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24014/
Test FAILed.

scwf · 2014-12-15T14:43:03Z

@marmbrus, i am fixing the test failure and refactoring the code based on datasource api, and one question here is, should i keep the sink part(write interface) here? Or just provide the ability to read orc file based on datasource api?

marmbrus · 2014-12-15T20:28:55Z

We are adding support for writing data in the next version of the API so probably better to wait until that is available.

scwf · 2014-12-21T07:34:29Z

OK, so i make #3753 to support orc based on the new datasource api. @zhzhan, in that PR it's support to read partitioned orc files but no ppd(ppd also need refactory based on the new datasource api), you can make a follow up PR for ppd after it merged or you can send the ppd support to my branch, anyway is ok.

marmbrus · 2014-12-30T20:10:59Z

We can close this issue right?

scwf · 2014-12-30T23:00:23Z

yes, closed!

scwf added 3 commits September 29, 2014 21:45

initial support orc in spark sql

fb14a06

add unit tests

ec3cdaf

add orc to example of spark sql

7126290

pwendell reviewed Oct 1, 2014
View reviewed changes

scwf added 2 commits October 4, 2014 15:32

merge with apache/master and fix conflict

655b23f

fix according comments and move orc to hive sub project

1505af4

scwf force-pushed the orc1 branch from f928657 to 1505af4 Compare October 4, 2014 07:35

fix scala style

1db30b1

scwf changed the title ~~[WIP][SPARK-3720][SQL]initial support ORC in spark sql~~ [WIP][SPARK-2883][SQL]initial support ORC in spark sql Nov 18, 2014

marmbrus and others added 14 commits November 18, 2014 14:30

logging / formatting improvements.

5d7f863

Add an experimental interface to data sources that exposes catalyst e…

94e0d40

…xpressions.

Alternative implementation of parquet based on the datasources API.

dd78aa7

Merge branch 'newParquet' of https://github.com/marmbrus/spark into p…

244ab59

…arquet

wf comment

a37f6a8

Merge branch 'master' of https://github.com/apache/spark into mvn

1d6856e

Merge branch 'parquet' into newOrc

b8e6f84

draft for datasource api

e521e6a

Merge branch 'master' into newOrc

9ba04ac

test case

c90ed2f

fix test

b6ae12b

no used bin file

3ca68eb

update with apache master

abf1b78

Revert "https -> http in pom"

1e0c1d9

This reverts commit 5f5fda8.

zhzhan reviewed Nov 26, 2014
View reviewed changes

ppd support

c5236ef

Merge pull request #13 from zhzhan/orc1

601d242

predictor pushdown support

scwf closed this Dec 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576

[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576

scwf commented Sep 29, 2014

AmplabJenkins commented Sep 29, 2014

cloud-fan commented Sep 30, 2014

scwf commented Sep 30, 2014

yhuai commented Sep 30, 2014

pwendell Oct 1, 2014

pwendell Oct 1, 2014

scwf Oct 1, 2014

scwf Oct 1, 2014

marmbrus Oct 1, 2014

marmbrus commented Oct 1, 2014

yhuai commented Oct 1, 2014

marmbrus commented Oct 1, 2014

AmplabJenkins commented Oct 1, 2014

SparkQA commented Oct 4, 2014

AmplabJenkins commented Oct 4, 2014

SparkQA commented Oct 4, 2014

AmplabJenkins commented Oct 4, 2014

scwf commented Nov 23, 2014

SparkQA commented Nov 23, 2014

AmplabJenkins commented Nov 23, 2014

zhzhan Nov 26, 2014

zhzhan commented Dec 1, 2014

SparkQA commented Dec 2, 2014

AmplabJenkins commented Dec 2, 2014

scwf commented Dec 15, 2014

marmbrus commented Dec 15, 2014

scwf commented Dec 21, 2014

marmbrus commented Dec 30, 2014

scwf commented Dec 30, 2014

[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576

[WIP][SPARK-2883][SQL]initial support ORC in spark sql #2576

Conversation

scwf commented Sep 29, 2014

AmplabJenkins commented Sep 29, 2014

cloud-fan commented Sep 30, 2014

scwf commented Sep 30, 2014

yhuai commented Sep 30, 2014

pwendell Oct 1, 2014

Choose a reason for hiding this comment

pwendell Oct 1, 2014

Choose a reason for hiding this comment

scwf Oct 1, 2014

Choose a reason for hiding this comment

scwf Oct 1, 2014

Choose a reason for hiding this comment

marmbrus Oct 1, 2014

Choose a reason for hiding this comment

marmbrus commented Oct 1, 2014

yhuai commented Oct 1, 2014

marmbrus commented Oct 1, 2014

AmplabJenkins commented Oct 1, 2014

SparkQA commented Oct 4, 2014

AmplabJenkins commented Oct 4, 2014

SparkQA commented Oct 4, 2014

AmplabJenkins commented Oct 4, 2014

scwf commented Nov 23, 2014

SparkQA commented Nov 23, 2014

AmplabJenkins commented Nov 23, 2014

zhzhan Nov 26, 2014

Choose a reason for hiding this comment

zhzhan commented Dec 1, 2014

SparkQA commented Dec 2, 2014

AmplabJenkins commented Dec 2, 2014

scwf commented Dec 15, 2014

marmbrus commented Dec 15, 2014

scwf commented Dec 21, 2014

marmbrus commented Dec 30, 2014

scwf commented Dec 30, 2014