[SPARK-14116][SQL] Implements buildReader() for ORC data source #11936

liancheng · 2016-03-24T15:58:06Z

What changes were proposed in this pull request?

This PR implements FileFormat.buildReader() for our ORC data source. It also fixed several minor styling issues related to HadoopFsRelation planning code path.

Note that OrcNewInputFormat doesn't rely on OrcNewSplit for creating OrcRecordReaders, plain FileSplit is just fine. That's why we can simply create the record reader with the help of OrcNewInputFormat and FileSplit.

How was this patch tested?

Existing test cases should do the work

SparkQA · 2016-03-24T16:03:53Z

Test build #54048 has finished for PR 11936 at commit 057b6f2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends Partition

SparkQA · 2016-03-25T11:18:57Z

Test build #54169 has finished for PR 11936 at commit b1d630b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-25T12:17:55Z

cc @cloud-fan @yhuai @marmbrus

cloud-fan · 2016-03-25T12:32:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

@@ -81,10 +81,10 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
      val bucketColumns =


seems this variable is never used?

Yea... Didn't remove it since I haven't gone through that part thoroughly.

cloud-fan · 2016-03-25T12:41:21Z

LGTM

SparkQA · 2016-03-25T14:11:07Z

Test build #54172 has finished for PR 11936 at commit 7d628ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-25T14:14:39Z

@marmbrus Unfortunately our current strategy for generating Spark partitions doesn't play well with ORC version of buildReader(), or to be more specific, OrcRecordReader. The following Spark shell snippet shows the problem:

import org.apache.spark.sql.types._

// Just creates a random file that is large enough
val df = sqlContext.range(1000000).select(
  'id cast StringType as 'a,
  'id cast StringType as 'b,
  'id cast StringType as 'c
)

val path = "/tmp/large.orc"
df.write.mode("overwrite").orc(path)

sqlContext.sql(s"SET spark.sql.files.maxPartitionBytes=${1024 * 1024}")
sqlContext.sql(s"SET spark.sql.parquet.enableVectorizedReader=false")

sqlContext.read.orc(path).count() // <-- Gives 500,000 instead of 1,000,000

Please refer to discussion here for details.

SparkQA · 2016-03-25T15:23:54Z

Test build #54182 has finished for PR 11936 at commit 2cbb56d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-03-25T15:26:57Z

Just realized the problem while washing dishes...

The above comment is a false alarm. The real problem is that I ignored start position and length of the PartitionedFile and created ORC splits using OrcNewInputFormat. After digging for a while, I found that although instantiating OrcNewSplit is pretty complicated, OrcNewInputFormat doesn't really require an OrcNewSplit, a plain FileSplit is just enough. So now the OrcRecordReader is simply created using OrcNewInputFormat and a FileSplit, which works well.

cloud-fan · 2016-03-25T15:36:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

+        // Appends partition values
+        val fullOutput = dataSchema.toAttributes ++ partitionSchema.toAttributes
+        val joinedRow = new JoinedRow()
+        val appendPartitionColumns = GenerateUnsafeProjection.generate(fullOutput, fullOutput)


should we move this out of the function? Then we don't need to create them everytime this function is used.

Yea, we should. Michael also mentioned this in this thread. We can do it in a follow-up PR after finishing other data sources to avoid conflicts.

SparkQA · 2016-03-25T17:14:18Z

Test build #54183 has finished for PR 11936 at commit a26973e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-03-25T17:35:26Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala

+      // iterator.
+      val maybePhysicalSchema = OrcFileOperator.readSchema(Seq(file.filePath), Some(conf))
+
+      maybePhysicalSchema.fold(Iterator.empty: Iterator[InternalRow]) { physicalSchema =>


I feel that using a fold at here make the code harder to understand.

How about we use a more straightforward way.

SparkQA · 2016-03-26T08:25:02Z

Test build #54254 has finished for PR 11936 at commit 25d894f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-03-26T23:08:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala

      if (files.fileFormat.toString == "TestFileFormat" ||
-         files.fileFormat.isInstanceOf[parquet.DefaultSource]) &&
+         files.fileFormat.isInstanceOf[parquet.DefaultSource] ||
+         files.fileFormat.toString == "ORC") &&
         files.sqlContext.conf.parquetFileScan =>


Let's rename this conf.

yhuai · 2016-03-26T23:09:59Z

Thanks. I am merging this to master.

@liancheng @cloud-fan Let's address https://github.com/apache/spark/pull/11936/files#r57520723 in either of your PR for other formats.

WIP

057b6f2

liancheng force-pushed the spark-14116-build-reader-for-orc branch from e219295 to 057b6f2 Compare March 24, 2016 15:58

Fixes read path of partitioned ORC table and removes duplicated code

b1d630b

liancheng changed the title ~~[SPARK-14116][SQL][WIP] Implements buildReader() for ORC data source~~ [SPARK-14116][SQL] Implements buildReader() for ORC data source Mar 25, 2016

Fixes Scala style

7d628ed

cloud-fan reviewed Mar 25, 2016
View reviewed changes

liancheng mentioned this pull request Mar 25, 2016

[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files #11646

Closed

Fixes ORC record reader instantiation

2cbb56d

Fixes Scala style, again

a26973e

cloud-fan reviewed Mar 25, 2016
View reviewed changes

yhuai reviewed Mar 25, 2016
View reviewed changes

Addresses PR comments

25d894f

yhuai reviewed Mar 26, 2016
View reviewed changes

asfgit closed this in b547de8 Mar 26, 2016

liancheng deleted the spark-14116-build-reader-for-orc branch March 29, 2016 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14116][SQL] Implements buildReader() for ORC data source #11936

[SPARK-14116][SQL] Implements buildReader() for ORC data source #11936

liancheng commented Mar 24, 2016

SparkQA commented Mar 24, 2016

SparkQA commented Mar 25, 2016

liancheng commented Mar 25, 2016

cloud-fan Mar 25, 2016

liancheng Mar 25, 2016

cloud-fan commented Mar 25, 2016

SparkQA commented Mar 25, 2016

liancheng commented Mar 25, 2016

SparkQA commented Mar 25, 2016

liancheng commented Mar 25, 2016

cloud-fan Mar 25, 2016

liancheng Mar 25, 2016

SparkQA commented Mar 25, 2016

yhuai Mar 25, 2016

yhuai Mar 25, 2016

liancheng Mar 26, 2016

SparkQA commented Mar 26, 2016

yhuai Mar 26, 2016

yhuai commented Mar 26, 2016

		@@ -81,10 +81,10 @@ private[sql] object FileSourceStrategy extends Strategy with Logging {
		val bucketColumns =

[SPARK-14116][SQL] Implements buildReader() for ORC data source #11936

[SPARK-14116][SQL] Implements buildReader() for ORC data source #11936

Conversation

liancheng commented Mar 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 24, 2016

SparkQA commented Mar 25, 2016

liancheng commented Mar 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 25, 2016

SparkQA commented Mar 25, 2016

liancheng commented Mar 25, 2016

SparkQA commented Mar 25, 2016

liancheng commented Mar 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 26, 2016

Choose a reason for hiding this comment

yhuai commented Mar 26, 2016