-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37933][SQL] Limit push down for parquet vectorized reader #35256
Conversation
also cc @sunchao |
} | ||
|
||
public VectorizedParquetRecordReader( | ||
ZoneId convertTz, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4-space indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
checkEndOfRowGroup(); | ||
|
||
int num = (int) Math.min((long) capacity, totalCountLoadedSoFar - rowsReturned); | ||
int num = (int) Math.min((long) capacity, | ||
Math.min((long) limit - rowsReturned, totalCountLoadedSoFar - rowsReturned)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
val df = spark.read.parquet(path.getPath).limit(pushedLimit) | ||
val sparkPlan = df.queryExecution.sparkPlan | ||
sparkPlan foreachUp { | ||
case r@ BatchScanExec(_, f: ParquetScan, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space between r
and @
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@stczwd Thanks for working on this! The changes look reasonable to me. I left a couple of nit comments for coding style. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stczwd .
I'm curious whether this will be very useful. To my knowledge, Spark already pushes local limit to each task, see LimitExec.doExecute
:
protected override def doExecute(): RDD[InternalRow] = {
val childRDD = child.execute()
if (childRDD.getNumPartitions == 0) {
new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
} else {
val singlePartitionRDD = if (childRDD.getNumPartitions == 1) {
childRDD
} else {
val locallyLimited = childRDD.mapPartitionsInternal(_.take(limit))
new ShuffledRowRDD(
ShuffleExchangeExec.prepareShuffleDependency(
locallyLimited,
child.output,
SinglePartition,
serializer,
writeMetrics),
readMetrics)
}
singlePartitionRDD.mapPartitionsInternal(_.take(limit))
}
}
Since the vectorized Parquet reader behaves like an iterator of ColumnarBatch
es, it will stop reading more batches when the local limit is reached.
Also have you done any benchmark with this feature?
this.convertTz = convertTz; | ||
this.datetimeRebaseMode = datetimeRebaseMode; | ||
this.datetimeRebaseTz = datetimeRebaseTz; | ||
this.int96RebaseMode = int96RebaseMode; | ||
this.int96RebaseTz = int96RebaseTz; | ||
MEMORY_MODE = useOffHeap ? MemoryMode.OFF_HEAP : MemoryMode.ON_HEAP; | ||
this.capacity = capacity; | ||
this.limit = Integer.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use:
this(convertTz, datetimeRebaseMode, datetimeRebaseTz, int96RebaseMode, int96RebaseTz,
useOffHeap, capacity, Integer.MAX_VALUE);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Thanks for your reply, @sunchao. |
I have same concern with @sunchao. IMO the improvement might not be significant as we already have limit to not read extra batch during execution. On the other hand, the DSv2 ORC vectorized reader does not have control of skipping row group, as Parquet reader, so the limit push down cannot be implemented for ORC reader in the future. And here we have to introduce code complexity in the common interface |
Thanks for your reply @c21.
|
In essence, the improvement of this pr is to reduce the amount of data obtained for the last time, especially in the following scenarios.
Based on the above performance improvement, personally think that it is still necessary, but it still depends on your opinions, thank you. |
Can you resolve the conflict first @stczwd |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Why are the changes needed?
Based on 34291, we can support limit push down to parquet datasource v2 reader, which can stop scanning parquet early, and reduce network and disk IO.
Currently, only vectorized reader is supported in this pr. Row based reader with limit pushdown needs to be supported in parquet-hadoop first, thus it will be supported in the next pr.
Limit parse status for parquet
before
After
Does this PR introduce any user-facing change?
No
How was this patch tested?
origin tests and new tests