[SPARK-13255][SQL] Integrate vectorized parquet scanner with whole stage codegen.#11146
Closed
nongli wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-13255][SQL] Integrate vectorized parquet scanner with whole stage codegen.#11146nongli wants to merge 1 commit intoapache:masterfrom
nongli wants to merge 1 commit intoapache:masterfrom
Conversation
…age codegen. This patch integrates these two so that the codegen'ed function is run over batches of rows. This removes all of the per row iterator calls for the code paths that support this. Unfortunately, this patch combines a few different things. 1. Refactor some of SqlNewHadoopRDD/ParquetRelation to produce RDD's of ColumnarBatch in addition to RDD's of InternalRow. This adds a new level of class hierarchy to minimize code duplication. This could use further refactoring to not require the low level components to return RDD's. 2. Minor refactoring of WholeStageCodegen to handle leaf operators that produce batches.
|
Test build #51007 has finished for PR 11146 at commit
|
Contributor
|
@nongli is there somekind of design doc on the envisaged use of ColumnarBatches? I am curious if we are also going to use them for joins (and kill JoinedRow in the process), aggregates and other operators. |
Member
|
@nongli Is there some kind of design doc on the ColumnarBatch? I am planning to make PRs for columnar storage and its computations with DataFrame/Dataset. |
Contributor
Author
|
I'm going to close this since it might not be necessary and has some other issues. I'll address your comments in follow ups. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This patch integrates these two so that the codegen'ed function is run over
batches of rows. This removes all of the per row iterator calls for the
code paths that support this. Unfortunately, this patch combines a few different
things.
in addition to RDD's of InternalRow. This adds a new level of class hierarchy to minimize
code duplication. This could use further refactoring to not require the low level
components to return RDD's.