Skip to content

[SPARK-13255][SQL] Integrate vectorized parquet scanner with whole stage codegen.#11146

Closed
nongli wants to merge 1 commit intoapache:masterfrom
nongli:spark-13255
Closed

[SPARK-13255][SQL] Integrate vectorized parquet scanner with whole stage codegen.#11146
nongli wants to merge 1 commit intoapache:masterfrom
nongli:spark-13255

Conversation

@nongli
Copy link
Copy Markdown
Contributor

@nongli nongli commented Feb 9, 2016

This patch integrates these two so that the codegen'ed function is run over
batches of rows. This removes all of the per row iterator calls for the
code paths that support this. Unfortunately, this patch combines a few different
things.

  1. Refactor some of SqlNewHadoopRDD/ParquetRelation to produce RDD's of ColumnarBatch
    in addition to RDD's of InternalRow. This adds a new level of class hierarchy to minimize
    code duplication. This could use further refactoring to not require the low level
    components to return RDD's.
  2. Minor refactoring of WholeStageCodegen to handle leaf operators that produce batches.

…age codegen.

This patch integrates these two so that the codegen'ed function is run over
batches of rows. This removes all of the per row iterator calls for the
code paths that support this. Unfortunately, this patch combines a few different
things.

1. Refactor some of SqlNewHadoopRDD/ParquetRelation to produce RDD's of ColumnarBatch
in addition to RDD's of InternalRow. This adds a new level of class hierarchy to minimize
code duplication. This could use further refactoring to not require the low level
components to return RDD's.

2. Minor refactoring of WholeStageCodegen to handle leaf operators that produce batches.
@SparkQA
Copy link
Copy Markdown

SparkQA commented Feb 10, 2016

Test build #51007 has finished for PR 11146 at commit 5717783.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class HadoopIterator(theSplit: SparkPartition, context: TaskContext)

@hvanhovell
Copy link
Copy Markdown
Contributor

@nongli is there somekind of design doc on the envisaged use of ColumnarBatches? I am curious if we are also going to use them for joins (and kill JoinedRow in the process), aggregates and other operators.

@kiszk
Copy link
Copy Markdown
Member

kiszk commented Feb 10, 2016

@nongli Is there some kind of design doc on the ColumnarBatch? I am planning to make PRs for columnar storage and its computations with DataFrame/Dataset.
We are curious whether ColumnarBatch and ColumnVector are designed only for Parquet or for general use.

@nongli
Copy link
Copy Markdown
Contributor Author

nongli commented Feb 11, 2016

I'm going to close this since it might not be necessary and has some other issues. I'll address your comments in follow ups.

@nongli nongli closed this Feb 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants