[SPARK-37933][SQL] Limit push down for parquet vectorized reader #35256

jackylee-ch · 2022-01-20T08:09:56Z

Why are the changes needed?

Based on 34291, we can support limit push down to parquet datasource v2 reader, which can stop scanning parquet early, and reduce network and disk IO.
Currently, only vectorized reader is supported in this pr. Row based reader with limit pushdown needs to be supported in parquet-hadoop first, thus it will be supported in the next pr.

Limit parse status for parquet
before

== Physical Plan ==
CollectLimit 10
+- *(1) ColumnarToRow
   +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: []

After

== Physical Plan ==
CollectLimit 10
+- *(1) ColumnarToRow
   +- BatchScan[a#0, b#1] ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[file:/datasources.db/test_push_down/par..., PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: [], ReadSchema: struct<a:int,b:int>, PushedFilters: [], PushedAggregation: [], PushedGroupBy: [], PushedLimit: Some(10) RuntimeFilters: []

Does this PR introduce any user-facing change?

No

How was this patch tested?

origin tests and new tests

jackylee-ch · 2022-01-20T09:54:48Z

cc @huaxingao @cloud-fan

jackylee-ch · 2022-01-20T09:57:23Z

also cc @sunchao

huaxingao · 2022-01-20T20:13:27Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

+  }
+
+  public VectorizedParquetRecordReader(
+          ZoneId convertTz,


4-space indentation

huaxingao · 2022-01-20T20:13:38Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

    checkEndOfRowGroup();

-    int num = (int) Math.min((long) capacity, totalCountLoadedSoFar - rowsReturned);
+    int num = (int) Math.min((long) capacity,
+            Math.min((long) limit - rowsReturned, totalCountLoadedSoFar - rowsReturned));


indentation is off

huaxingao · 2022-01-20T20:13:54Z

...est/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetLimitPushDownSuite.scala

+        val df = spark.read.parquet(path.getPath).limit(pushedLimit)
+        val sparkPlan = df.queryExecution.sparkPlan
+        sparkPlan foreachUp  {
+          case r@ BatchScanExec(_, f: ParquetScan, _) =>


space between r and @

huaxingao · 2022-01-20T20:21:07Z

@stczwd Thanks for working on this! The changes look reasonable to me. I left a couple of nit comments for coding style.

sunchao

Thanks @stczwd .

I'm curious whether this will be very useful. To my knowledge, Spark already pushes local limit to each task, see LimitExec.doExecute:

  protected override def doExecute(): RDD[InternalRow] = {
    val childRDD = child.execute()
    if (childRDD.getNumPartitions == 0) {
      new ParallelCollectionRDD(sparkContext, Seq.empty[InternalRow], 1, Map.empty)
    } else {
      val singlePartitionRDD = if (childRDD.getNumPartitions == 1) {
        childRDD
      } else {
        val locallyLimited = childRDD.mapPartitionsInternal(_.take(limit))
        new ShuffledRowRDD(
          ShuffleExchangeExec.prepareShuffleDependency(
            locallyLimited,
            child.output,
            SinglePartition,
            serializer,
            writeMetrics),
          readMetrics)
      }
      singlePartitionRDD.mapPartitionsInternal(_.take(limit))
    }
  }

Since the vectorized Parquet reader behaves like an iterator of ColumnarBatches, it will stop reading more batches when the local limit is reached.

Also have you done any benchmark with this feature?

sunchao · 2022-01-20T21:55:18Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

    this.convertTz = convertTz;
    this.datetimeRebaseMode = datetimeRebaseMode;
    this.datetimeRebaseTz = datetimeRebaseTz;
    this.int96RebaseMode = int96RebaseMode;
    this.int96RebaseTz = int96RebaseTz;
    MEMORY_MODE = useOffHeap ? MemoryMode.OFF_HEAP : MemoryMode.ON_HEAP;
    this.capacity = capacity;
+    this.limit = Integer.MAX_VALUE;


we can use:

this(convertTz, datetimeRebaseMode, datetimeRebaseTz, int96RebaseMode, int96RebaseTz, useOffHeap, capacity, Integer.MAX_VALUE);

jackylee-ch · 2022-01-21T08:00:28Z

Thanks for your reply, @sunchao.
Limit pushdown can terminate the capacity batch reading in advance, it is used especially when reading across rowgroups, which can reduce the reading of one rowgroup.
I have test limitBenchMark in ParquetNestedSchemaPruningBenchmark. Unfortunately it didn't show clear advantage.

c21 · 2022-01-21T22:34:27Z

I have same concern with @sunchao. IMO the improvement might not be significant as we already have limit to not read extra batch during execution. On the other hand, the DSv2 ORC vectorized reader does not have control of skipping row group, as Parquet reader, so the limit push down cannot be implemented for ORC reader in the future. And here we have to introduce code complexity in the common interface FileScanBuilder. I feel this is an overkill unless we can showcase some obvious improvements on current benchmark.

jackylee-ch · 2022-01-24T02:08:13Z

Thanks for your reply @c21.
Sorry, my last test didn't adjust the limit amount, so the limit didn't really work.
This time I also test limitBenchMark in ParquetNestedSchemaPruningBenchmarkset, but set the batch capacity to 10240 and the limit to 12500. Now we found that the performance improved by 1.3x.

OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1025-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Limiting:                                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Top-level column with out limit                      83            101          21         12.1          82.7       1.0X
Nested column with out limit                         77             95          11         12.9          77.4       1.1X
Nested column in array with out limit               105            121          19          9.5         105.1       0.8X
Top-level column with limit                          61             69           5         16.3          61.3       1.3X
Nested column with limit                             66             73           7         15.2          65.8       1.3X
Nested column in array with limit                   101            113          20          9.9         101.2       0.8X

jackylee-ch · 2022-01-24T02:13:52Z

In essence, the improvement of this pr is to reduce the amount of data obtained for the last time, especially in the following scenarios.

the capacity is set relatively large, but with limit we only need to read a small batch.
the last batch need to query 2 rowgroups, but with limit we only need to read only one rowgroup.

Based on the above performance improvement, personally think that it is still necessary, but it still depends on your opinions, thank you.
@c21 @sunchao @cloud-fan @huaxingao

LuciferYang · 2022-03-03T14:00:27Z

Can you resolve the conflict first @stczwd

github-actions · 2022-06-13T00:21:46Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

push down limit to parquet

ed6a6e5

github-actions bot added the SQL label Jan 20, 2022

jackylee-ch changed the title ~~[SPARK-37831][SQL] Limit push down for parquet vectorized reader~~ [SPARK-37933][SQL] Limit push down for parquet vectorized reader Jan 20, 2022

huaxingao reviewed Jan 20, 2022

View reviewed changes

sunchao reviewed Jan 20, 2022

View reviewed changes

change code style

9d4f068

Merge branch 'master' into SPARK-37933-2

212c13b

github-actions bot added the Stale label Jun 13, 2022

github-actions bot closed this Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37933][SQL] Limit push down for parquet vectorized reader #35256

[SPARK-37933][SQL] Limit push down for parquet vectorized reader #35256

jackylee-ch commented Jan 20, 2022

jackylee-ch commented Jan 20, 2022

jackylee-ch commented Jan 20, 2022

huaxingao Jan 20, 2022

jackylee-ch Jan 21, 2022

huaxingao Jan 20, 2022

jackylee-ch Jan 21, 2022

huaxingao Jan 20, 2022

jackylee-ch Jan 21, 2022

huaxingao commented Jan 20, 2022

sunchao left a comment

sunchao Jan 20, 2022

jackylee-ch Jan 21, 2022

jackylee-ch commented Jan 21, 2022

c21 commented Jan 21, 2022

jackylee-ch commented Jan 24, 2022

jackylee-ch commented Jan 24, 2022 •

edited

LuciferYang commented Mar 3, 2022

github-actions bot commented Jun 13, 2022

[SPARK-37933][SQL] Limit push down for parquet vectorized reader #35256

[SPARK-37933][SQL] Limit push down for parquet vectorized reader #35256

Conversation

jackylee-ch commented Jan 20, 2022

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jackylee-ch commented Jan 20, 2022

jackylee-ch commented Jan 20, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Jan 20, 2022

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch commented Jan 21, 2022

c21 commented Jan 21, 2022

jackylee-ch commented Jan 24, 2022

jackylee-ch commented Jan 24, 2022 • edited

LuciferYang commented Mar 3, 2022

github-actions bot commented Jun 13, 2022

jackylee-ch commented Jan 24, 2022 •

edited