[SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader #11709

marmbrus · 2016-03-15T02:44:50Z

This PR add implements the new buildReader interface for the Parquet FileFormat. An simple implementation of FileScanRDD is also included.

This code should be tested by the many existing tests for parquet.

marmbrus · 2016-03-15T02:45:07Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

+      sqlContext.conf.getConf(SQLConf.PARQUET_INT96_AS_TIMESTAMP))
+
+    // Try to push down filters when filter push-down is enabled.
+    if (sqlContext.getConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key).toBoolean) {


@liancheng any idea why this isn't working?

SparkQA · 2016-03-15T02:48:45Z

Test build #53150 has finished for PR 11709 at commit aad7bd1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RecordReaderIterator[T](rowReader: RecordReader[_, T]) extends Iterator[T]
- // the type in next() and we get a class cast exception. If we make that function return

SparkQA · 2016-03-15T19:33:45Z

Test build #53212 has finished for PR 11709 at commit d3a39e7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RecordReaderIterator[T](rowReader: RecordReader[_, T]) extends Iterator[T]
- // the type in next() and we get a class cast exception. If we make that function return

SparkQA · 2016-03-15T22:53:27Z

Test build #53229 has finished for PR 11709 at commit c9229a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RecordReaderIterator[T](rowReader: RecordReader[_, T]) extends Iterator[T]
- // the type in next() and we get a class cast exception. If we make that function return

SparkQA · 2016-03-15T23:39:43Z

Test build #53237 has finished for PR 11709 at commit 9b720cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // the type in next() and we get a class cast exception. If we make that function return

SparkQA · 2016-03-16T03:26:24Z

Test build #53262 has finished for PR 11709 at commit 122f572.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-16T22:27:57Z

Test build #53367 has finished for PR 11709 at commit 8ccdd77.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-16T22:32:16Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

+      val iter = new RecordReaderIterator(parquetReader)
+      val fullSchema = dataSchema.toAttributes ++ partitionSchema.toAttributes
+      val joinedRow = new JoinedRow()
+      val appendPartitionColumns = GenerateUnsafeProjection.generate(fullSchema, fullSchema)


Can we append the partition values as batches?

We do, this only happens when we fall back on the old parquet-mr reader. Otherwise columns are appended on line 370. I can more this into the if to make that more clear.

davies · 2016-03-16T22:38:32Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

+      val appendPartitionColumns = GenerateUnsafeProjection.generate(fullSchema, fullSchema)
+
+      // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
+      if (parquetReader.isInstanceOf[UnsafeRowParquetRecordReader]) {


UnsafeRowParquetRecordReader could still produce UnsafeRow (without partition values), if enableVectorizedParquetReader is false

SparkQA · 2016-03-16T22:43:14Z

Test build #53369 has finished for PR 11709 at commit 19e455c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-17T00:03:47Z

Test build #53372 has finished for PR 11709 at commit ed60f3d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-17T18:00:48Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

@@ -807,6 +809,11 @@ public final int appendStruct(boolean isNull) {
  public final boolean isArray() { return resultArray != null; }

  /**
+   * Marks this column as being constant.
+   */
+  public final void setIsConstant() { isConstant = true; }


Should we have a special ColumnVector for constants (that always return the same value )?

I think that's a good suggestion. Let's do it as a follow up though.

SparkQA · 2016-03-18T21:36:05Z

Test build #53558 has finished for PR 11709 at commit c7ca5fe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

resolve merge conflicts in vectorized parquet reader

SparkQA · 2016-03-18T22:12:09Z

Test build #53571 has finished for PR 11709 at commit 40037e6.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

Fix ParquetRelation

SparkQA · 2016-03-18T22:39:08Z

Test build #53565 has finished for PR 11709 at commit b0cd621.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-03-19T00:14:22Z

Test build #53575 has finished for PR 11709 at commit e5725f8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

SparkQA · 2016-03-22T00:07:29Z

Test build #53719 has finished for PR 11709 at commit 2e21c7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-22T01:45:06Z

LGTM

marmbrus · 2016-03-22T01:47:15Z

Thanks, merging to master!

tedyu · 2016-03-22T16:03:43Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

+      if (fileSchema.containsPath(colPath)) {
+        ColumnDescriptor fd = fileSchema.getColumnDescription(colPath);
+        if (!fd.equals(requestedSchema.getColumns().get(i))) {
+          throw new IOException("Schema evolution not supported.");


It would be helpful to include fd in the exception message

This PR add implements the new `buildReader` interface for the Parquet `FileFormat`. An simple implementation of `FileScanRDD` is also included. This code should be tested by the many existing tests for parquet. Author: Michael Armbrust <michael@databricks.com> Author: Sameer Agarwal <sameer@databricks.com> Author: Nong Li <nong@databricks.com> Closes apache#11709 from marmbrus/parquetReader.

liancheng · 2016-03-24T15:02:36Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala

+        val vectorizedReader = new VectorizedParquetRecordReader()
+        vectorizedReader.initialize(split, hadoopAttemptContext)
+        logDebug(s"Appending $partitionSchema ${file.partitionValues}")
+        vectorizedReader.initBatch(partitionSchema, file.partitionValues)


@marmbrus Not quite sure about the intention of this line. Are we "reserving" column batches for partition columns here so that partition values can be filled later after data columns are fetched?

I was wondering why we need to pass partition schema and values to buildReader since partitioning should have already been handled during planning phase. Then I found they are only used here. Seems that this is pretty much a Parquet vectorized reader specific use case.

Yeah, this perhaps belongs at a higher layer (probably in FileScanRDD), but that would require us to vectorize everything or take a giant performance hit. This line is telling the vectorized reader to append the partition columns as static columns. This allows us to avoid an extra copy to append them for the optimized path.

Basically, I'd just take the non-vectorized version below, put it in a utility function and use it everywhere. If we vectorize all the sources, that will be the only part we have to remove and then this can be done in FileScanRDD.

I think that you do not want to do the actually partition appending in the planner like we were before, because you can't have Spark Partitions (splits) that read from different partitions very easily. This is what was making the bucking logic so convoluted in the old code path. This makes bucketing and collapsing of small files into a single partition much simpler.

Sorry for the confusion. When I mentioned "planning phase" what I really meant was that ideally the data source implementation shouldn't care about partitioning at all. But I mixed up partition discovery and partition value appending. I agree with your comments. Thanks for the explanations.

marmbrus reviewed Mar 15, 2016
View reviewed changes

marmbrus force-pushed the parquetReader branch from aad7bd1 to d3a39e7 Compare March 15, 2016 19:31

WIP

c9229a9

marmbrus force-pushed the parquetReader branch from d3a39e7 to c9229a9 Compare March 15, 2016 21:53

nongli and others added 2 commits March 15, 2016 15:19

Function to initialize batch with partition values.

144d2ef

enable returning batches

9b720cd

Fix null partition values.

07d9281

one more test

122f572

Missing columns

8ccdd77

davies reviewed Mar 16, 2016
View reviewed changes

quiet logging

19e455c

davies reviewed Mar 16, 2016
View reviewed changes

davies comments

ed60f3d

marmbrus changed the title ~~[WIP][SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader~~ [SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader Mar 16, 2016

davies reviewed Mar 17, 2016
View reviewed changes

marmbrus added 3 commits March 18, 2016 12:59

feature flag and debugging

c7ca5fe

Merge remote-tracking branch 'apache/master' into parquetReader

ca42149

make sure we pass the right partition/data columns

b0cd621

sameeragarwal and others added 3 commits March 18, 2016 14:56

upstream changes

3271ada

minor

93f2cef

Merge pull request #33 from sameeragarwal/fix

40037e6

resolve merge conflicts in vectorized parquet reader

sameeragarwal and others added 2 commits March 18, 2016 15:35

Fix ParquetRelation

7059733

Merge pull request #34 from sameeragarwal/fix-2

e5725f8

Fix ParquetRelation

marmbrus added 2 commits March 21, 2016 15:16

Merge remote-tracking branch 'apache/master' into parquetReader

26cf642

Conflicts: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

fix more tests

2e21c7a

asfgit closed this in 8014a51 Mar 22, 2016

tedyu reviewed Mar 22, 2016
View reviewed changes

liancheng reviewed Mar 24, 2016
View reviewed changes

liancheng mentioned this pull request Mar 25, 2016

[SPARK-14116][SQL] Implements buildReader() for ORC data source #11936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader #11709

[SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader #11709

marmbrus commented Mar 15, 2016

marmbrus Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

davies Mar 16, 2016

marmbrus Mar 16, 2016

davies Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 17, 2016

davies Mar 17, 2016

nongli Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 19, 2016

SparkQA commented Mar 22, 2016

nongli commented Mar 22, 2016

marmbrus commented Mar 22, 2016

tedyu Mar 22, 2016

liancheng Mar 24, 2016

liancheng Mar 24, 2016

marmbrus Mar 24, 2016

marmbrus Mar 24, 2016

liancheng Mar 25, 2016

[SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader #11709

[SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader #11709

Conversation

marmbrus commented Mar 15, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 16, 2016

SparkQA commented Mar 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 19, 2016

SparkQA commented Mar 22, 2016

nongli commented Mar 22, 2016

marmbrus commented Mar 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment