Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader #11709

Closed
wants to merge 18 commits into from

Conversation

marmbrus
Copy link
Contributor

This PR add implements the new buildReader interface for the Parquet FileFormat. An simple implementation of FileScanRDD is also included.

This code should be tested by the many existing tests for parquet.

sqlContext.conf.getConf(SQLConf.PARQUET_INT96_AS_TIMESTAMP))

// Try to push down filters when filter push-down is enabled.
if (sqlContext.getConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key).toBoolean) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liancheng any idea why this isn't working?

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53150 has finished for PR 11709 at commit aad7bd1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RecordReaderIterator[T](rowReader: RecordReader[_, T]) extends Iterator[T]
    • // the type in next() and we get a class cast exception. If we make that function return

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53212 has finished for PR 11709 at commit d3a39e7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RecordReaderIterator[T](rowReader: RecordReader[_, T]) extends Iterator[T]
    • // the type in next() and we get a class cast exception. If we make that function return

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53229 has finished for PR 11709 at commit c9229a9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RecordReaderIterator[T](rowReader: RecordReader[_, T]) extends Iterator[T]
    • // the type in next() and we get a class cast exception. If we make that function return

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53237 has finished for PR 11709 at commit 9b720cd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • // the type in next() and we get a class cast exception. If we make that function return

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53262 has finished for PR 11709 at commit 122f572.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53367 has finished for PR 11709 at commit 8ccdd77.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

val iter = new RecordReaderIterator(parquetReader)
val fullSchema = dataSchema.toAttributes ++ partitionSchema.toAttributes
val joinedRow = new JoinedRow()
val appendPartitionColumns = GenerateUnsafeProjection.generate(fullSchema, fullSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we append the partition values as batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, this only happens when we fall back on the old parquet-mr reader. Otherwise columns are appended on line 370. I can more this into the if to make that more clear.

val appendPartitionColumns = GenerateUnsafeProjection.generate(fullSchema, fullSchema)

// UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
if (parquetReader.isInstanceOf[UnsafeRowParquetRecordReader]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnsafeRowParquetRecordReader could still produce UnsafeRow (without partition values), if enableVectorizedParquetReader is false

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53369 has finished for PR 11709 at commit 19e455c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus marmbrus changed the title [WIP][SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader [SPARK-13883][SQL] Parquet Implementation of FileFormat.buildReader Mar 16, 2016
@SparkQA
Copy link

SparkQA commented Mar 17, 2016

Test build #53372 has finished for PR 11709 at commit ed60f3d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -807,6 +809,11 @@ public final int appendStruct(boolean isNull) {
public final boolean isArray() { return resultArray != null; }

/**
* Marks this column as being constant.
*/
public final void setIsConstant() { isConstant = true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a special ColumnVector for constants (that always return the same value )?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good suggestion. Let's do it as a follow up though.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53558 has finished for PR 11709 at commit c7ca5fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

sameeragarwal and others added 3 commits March 18, 2016 14:56
resolve merge conflicts in vectorized parquet reader
@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53571 has finished for PR 11709 at commit 40037e6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 18, 2016

Test build #53565 has finished for PR 11709 at commit b0cd621.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2016

Test build #53575 has finished for PR 11709 at commit e5725f8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53719 has finished for PR 11709 at commit 2e21c7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Mar 22, 2016

LGTM

@marmbrus
Copy link
Contributor Author

Thanks, merging to master!

@asfgit asfgit closed this in 8014a51 Mar 22, 2016
if (fileSchema.containsPath(colPath)) {
ColumnDescriptor fd = fileSchema.getColumnDescription(colPath);
if (!fd.equals(requestedSchema.getColumns().get(i))) {
throw new IOException("Schema evolution not supported.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to include fd in the exception message

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
This PR add implements the new `buildReader` interface for the Parquet `FileFormat`.  An simple implementation of `FileScanRDD` is also included.

This code should be tested by the many existing tests for parquet.

Author: Michael Armbrust <michael@databricks.com>
Author: Sameer Agarwal <sameer@databricks.com>
Author: Nong Li <nong@databricks.com>

Closes apache#11709 from marmbrus/parquetReader.
val vectorizedReader = new VectorizedParquetRecordReader()
vectorizedReader.initialize(split, hadoopAttemptContext)
logDebug(s"Appending $partitionSchema ${file.partitionValues}")
vectorizedReader.initBatch(partitionSchema, file.partitionValues)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus Not quite sure about the intention of this line. Are we "reserving" column batches for partition columns here so that partition values can be filled later after data columns are fetched?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why we need to pass partition schema and values to buildReader since partitioning should have already been handled during planning phase. Then I found they are only used here. Seems that this is pretty much a Parquet vectorized reader specific use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this perhaps belongs at a higher layer (probably in FileScanRDD), but that would require us to vectorize everything or take a giant performance hit. This line is telling the vectorized reader to append the partition columns as static columns. This allows us to avoid an extra copy to append them for the optimized path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I'd just take the non-vectorized version below, put it in a utility function and use it everywhere. If we vectorize all the sources, that will be the only part we have to remove and then this can be done in FileScanRDD.

I think that you do not want to do the actually partition appending in the planner like we were before, because you can't have Spark Partitions (splits) that read from different partitions very easily. This is what was making the bucking logic so convoluted in the old code path. This makes bucketing and collapsing of small files into a single partition much simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. When I mentioned "planning phase" what I really meant was that ideally the data source implementation shouldn't care about partitioning at all. But I mixed up partition discovery and partition value appending. I agree with your comments. Thanks for the explanations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants