[SPARK-34859][SQL] Handle column index when using vectorized Parquet reader #32753

sunchao · 2021-06-02T17:38:27Z

What changes were proposed in this pull request?

Make the current vectorized Parquet reader to work with column index introduced in Parquet 1.11. In particular, this PR makes the following changes:

in ParquetReadState, track row ranges returned via PageReadStore.getRowIndexes as well as the first row index for each page via DataPage.getFirstRowIndex.
introduced a new API ParquetVectorUpdater.skipValues which skips a batch of values from a Parquet value reader. As part of the process also renamed existing updateBatch to readValues, and update to readValue to keep the method names consistent.
in correspondence as above, also introduced new API VectorizedValuesReader.skipXXX for different data types, as well as the implementations. These are useful when the reader knows that the given batch of values can be skipped, for instance, due to the batch is not covered in the row ranges generated by column index filtering.
changed VectorizedRleValuesReader to handle column index filtering. This is done by comparing the range that is going to be read next within the current RLE/PACKED block (let's call this block range), against the current row range. There are three cases:
- if the block range is before the current row range, skip all the values in the block range
- if the block range is after the current row range, advance the row range and repeat the steps
- if the block range overlaps with the current row range, only read the values within the overlapping area and skip the rest.

Why are the changes needed?

Parquet Column Index is a new feature in Parquet 1.11 which allows very efficient filtering on page level (some benchmark numbers can be found here), especially when data is sorted. The feature is largely implemented in parquet-mr (via classes such as ColumnIndex and ColumnIndexFilter). In Spark, the non-vectorized Parquet reader can automatically benefit from the feature after upgrading to Parquet 1.11.x, without any code change. However, the same is not true for vectorized Parquet reader since Spark chose to implement its own logic such as reading Parquet pages, handling definition levels, reading values into columnar batches, etc.

Previously, SPARK-26345 / (#31393) updated Spark to only scan pages filtered by column index from parquet-mr side. This is done by calling ParquetFileReader.readNextFilteredRowGroup and ParquetFileReader.getFilteredRecordCount API. The implementation, however, only work for a few limited cases: in the scenario where there are multiple columns and their type width are different (e.g., int and bigint), it could return incorrect result. For this issue, please see SPARK-34859 for a detailed description.

In order to fix the above, Spark needs to leverage the API PageReadStore.getRowIndexes and DataPage.getFirstRowIndex. The former returns the indexes of all rows (note the difference between rows and values: for flat schema there is no difference between the two, but for nested schema they're different) after filtering within a Parquet row group. The latter returns the first row index within a single data page. With the combination of the two, one is able to know which rows/values should be filtered while scanning a Parquet page.

Does this PR introduce any user-facing change?

Yes. Now the vectorized Parquet reader should work correctly with column index.

How was this patch tested?

Borrowed tests from #31998 and added a few more tests.

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

SparkQA · 2021-06-02T18:09:28Z

Test build #139237 has finished for PR 32753 at commit 89f90e9.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class ParquetReadState

SparkQA · 2021-06-02T18:53:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43760/

SparkQA · 2021-06-02T20:25:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43760/

SparkQA · 2021-06-16T18:07:57Z

Test build #139887 has started for PR 32753 at commit 64cfc59.

SparkQA · 2021-06-16T18:56:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44417/

SparkQA · 2021-06-16T19:28:38Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44417/

dongjoon-hyun · 2021-06-16T21:36:39Z

Thank you, @sunchao !

SparkQA · 2021-06-16T23:14:12Z

Test build #139891 has finished for PR 32753 at commit 9cfc00d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-16T23:46:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44421/

SparkQA · 2021-06-17T00:19:08Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44421/

SparkQA · 2021-06-17T20:04:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44472/

SparkQA · 2021-06-17T20:43:14Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44472/

sunchao · 2021-06-17T21:03:14Z

cc @viirya @dongjoon-hyun @cloud-fan @lxian @gszadovszky @shangxinli

dongjoon-hyun · 2021-06-17T21:03:54Z

Thank you for updates, @sunchao !

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

...ain/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java

SparkQA · 2021-06-17T21:57:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44476/

SparkQA · 2021-06-17T22:10:19Z

Test build #139952 has finished for PR 32753 at commit e11e47f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-17T22:13:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44479/

SparkQA · 2021-06-17T22:19:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44478/

SparkQA · 2021-06-17T22:30:19Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44476/

SparkQA · 2021-06-17T22:47:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44479/

SparkQA · 2021-06-17T22:52:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44478/

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

cloud-fan · 2021-06-28T17:04:31Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java

  }

  /**
-   * Called at the beginning of reading a new batch.
+   * Construct a list of row ranges from the given `rowIndexes`. For example, suppose the
+   * `rowIndexes` are `[0, 1, 2, 4, 5, 7, 8, 9]`, it will be converted into 3 row ranges:


interesting, does the parquet reader lib give you a big array containing these indexes, or it uses an algorithm to generate the indexes on the fly?

And how fast/slow the parquet reader lib can generate the indexes?

It gives you an iterator so yeah generating them on the fly: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/RowRanges.java#L253. The indexes are generated from Range which is very similar to what we defined here. I'm planning to file a JIRA in parquet-mr to just return the original Ranges so we don't have to do this step in Spark.

https://issues.apache.org/jira/browse/PARQUET-2061

viirya · 2021-06-28T23:01:42Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java

    valuesToReadInBatch -= (newOffset - offset);
-    valuesToReadInPage -= (newOffset - offset);
+    valuesToReadInPage -= (newRowId - rowId);


Should we assert newOffset - offset == newRowId - rowId?

I think this is not necessarily true: rowId tracks all the values that could be either read or skipped, while offset only tracks value that are read into the result column vector.

viirya · 2021-06-28T23:22:31Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java


        // Read and decode dictionary ids.
        defColumn.readIntegers(readState, dictionaryIds, column,
          (VectorizedValuesReader) dataColumn);

        // TIMESTAMP_MILLIS encoded as INT64 can't be lazily decoded as we need to post process
        // the values to add microseconds precision.
-        if (column.hasDictionary() || (startOffset == 0 && isLazyDecodingSupported(typeName))) {
+        if (column.hasDictionary() || (startRowId == pageFirstRowIndex &&


Hmm, based on the comment (rowId != 0) below, do we need to update it?

Oh yeah, I need to update the comment too

viirya · 2021-06-28T23:28:22Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

  private int readPageV2(DataPageV2 page) throws IOException {
+    this.pageFirstRowIndex = page.getFirstRowIndex().orElse(0L);


Maybe move to readPage? Looks like readPageV1 and readPageV2 all need it.

Good point. Let me do that.

viirya · 2021-06-28T23:31:54Z

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

-          for (int i = 0; i < n; ++i) {
-            if (currentBuffer[currentBufferIdx++] == state.maxDefinitionLevel) {
-              updater.update(offset + i, values, valueReader);
+      int n = Math.min(leftInBatch, Math.min(leftInPage, this.currentCount));


This is different to what I commented (#32753 (comment)) before. This looks more straightforward.

viirya

A few comments. Looks good, otherwise.

SparkQA · 2021-06-29T22:50:06Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44905/

SparkQA · 2021-06-30T00:18:22Z

Test build #140382 has finished for PR 32753 at commit 8396b5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-06-30T07:05:49Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java

  final int maxDefinitionLevel;

+  /** The current index overall all rows within the column chunk. This is used to check if the


Maybe, overall all -> over all?

dongjoon-hyun · 2021-06-30T07:15:46Z

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java

    this.maxDefinitionLevel = maxDefinitionLevel;
+    this.rowRanges = rowIndexes == null ? null : constructRanges(rowIndexes);


If we move this rowIndexes == null ? null ... part into constructRanges method, that will be better because it will protect both this constructor and constructRanges at the same time. For now, constructRanges seems to have the assumption that the argument is not null but there is no assertion for that. WDTY?

Yes that's a fair point. Will do.

dongjoon-hyun · 2021-06-30T07:19:44Z

...e/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdater.java

+   * @param total total number of values to skip
+   * @param valuesReader reader to skip values from
+   */
+  void skipValues(int total, VectorizedValuesReader valuesReader);


Since this is renamed, please update the following PR description accordingly.

introduced a new API ParquetVectorUpdater.skipBatch which skips a batch of values from a Parquet value reader.

And, maybe, we had better update line 40 in this file.

Skip a batch of ...

Updated the PR description. Regarding the comment, IMO even though the method name is changed, the comment is still accurate in expressing what the method does.

dongjoon-hyun · 2021-06-30T07:25:08Z

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java

+
+        // skip the part [rowId, start)
+        int toSkip = (int) (start - rowId);
+        if (toSkip > 0) {


Just a question. We may have two negative value cases.

start < rowId

(start - rowId) > Int.MaxValue

Are we considering both? Or, there is no change for case (2)?

To be safe, I'd like to recommend to move the type casting (int) into inside this if statement. For if (toSkip > 0) { check, we had better use long. If the ranges are protected by line 191 ~ 192, then ignore this comment.

start must >= rowId because it is defined as long start = Math.max(rangeStart, rowId). Therefore, the case 1 start < rowId will never happen.

The second case, (start - rowId) > Int.MaxValue, can only occur if start is equal to rangeStart. In this case we also know that rangeStart <= rowId + n (from line 183) and n is Math.min(leftInBatch, Math.min(leftInPage, this.currentCount)) which is guaranteed to be within integer range. Therefore, the cast is safe.

dongjoon-hyun

+1, LGTM from my side (with minor questions).

SparkQA · 2021-06-30T19:35:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44973/

SparkQA · 2021-06-30T20:07:53Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44973/

dongjoon-hyun · 2021-06-30T21:20:54Z

Merged to master for Apache Spark 3.2.0. Thank you, @sunchao , @viirya , @cloud-fan

cc @aokolnychyi , @RussellSpitzer

dongjoon-hyun · 2021-06-30T21:22:31Z

Also, cc @gengliangwang since this was the release blocker for Apache Spark 3.2.0.

SparkQA · 2021-06-30T23:09:30Z

Test build #140459 has finished for PR 32753 at commit 6541d99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sunchao mentioned this pull request Jun 2, 2021

[SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes #31998

Closed

github-actions bot added the SQL label Jun 2, 2021

sunchao commented Jun 2, 2021

View reviewed changes

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java Outdated Show resolved Hide resolved

sunchao force-pushed the SPARK-34859 branch 2 times, most recently from f4ce616 to 64cfc59 Compare June 16, 2021 17:37

sunchao changed the title ~~[WIP][SPARK-34859][SQL] Handle column index when using vectorized Parquet reader~~ [SPARK-34859][SQL] Handle column index when using vectorized Parquet reader Jun 17, 2021

dongjoon-hyun reviewed Jun 17, 2021

View reviewed changes

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jun 17, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jun 17, 2021

View reviewed changes

...ain/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java Show resolved Hide resolved

dongjoon-hyun reviewed Jun 26, 2021

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java Show resolved Hide resolved

dongjoon-hyun reviewed Jun 26, 2021

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java Show resolved Hide resolved

dongjoon-hyun reviewed Jun 26, 2021

View reviewed changes

sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetReadState.java Outdated Show resolved Hide resolved

comments

4eac0bc

cloud-fan reviewed Jun 28, 2021

View reviewed changes

.../main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java Show resolved Hide resolved

cloud-fan reviewed Jun 28, 2021

View reviewed changes

viirya reviewed Jun 28, 2021

View reviewed changes

address comments

8396b5f

dongjoon-hyun reviewed Jun 30, 2021

View reviewed changes

dongjoon-hyun approved these changes Jun 30, 2021

View reviewed changes

viirya approved these changes Jun 30, 2021

View reviewed changes

address comments

6541d99

dongjoon-hyun closed this in a5c8866 Jun 30, 2021

alamb mentioned this pull request Aug 10, 2021

Implement parquet page-level skipping with column index, using min/max stats apache/datafusion#847

Closed

wypoon mentioned this pull request May 30, 2024

Parquet: page skipping using filtered row groups (vectorized and non-vectorized read) apache/iceberg#10399

Open

		private int readPageV2(DataPageV2 page) throws IOException {
		this.pageFirstRowIndex = page.getFirstRowIndex().orElse(0L);

		final int maxDefinitionLevel;

		/** The current index overall all rows within the column chunk. This is used to check if the

		this.maxDefinitionLevel = maxDefinitionLevel;
		this.rowRanges = rowIndexes == null ? null : constructRanges(rowIndexes);

[SPARK-34859][SQL] Handle column index when using vectorized Parquet reader #32753

[SPARK-34859][SQL] Handle column index when using vectorized Parquet reader #32753

Conversation

sunchao commented Jun 2, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jun 2, 2021

SparkQA commented Jun 2, 2021

SparkQA commented Jun 2, 2021

SparkQA commented Jun 16, 2021

SparkQA commented Jun 16, 2021

SparkQA commented Jun 16, 2021

dongjoon-hyun commented Jun 16, 2021

SparkQA commented Jun 16, 2021

SparkQA commented Jun 16, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

sunchao commented Jun 17, 2021

dongjoon-hyun commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

SparkQA commented Jun 17, 2021

cloud-fan Jun 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Jun 29, 2021

SparkQA commented Jun 30, 2021

Choose a reason for hiding this comment

dongjoon-hyun Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Jun 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Jun 30, 2021

SparkQA commented Jun 30, 2021

dongjoon-hyun commented Jun 30, 2021

dongjoon-hyun commented Jun 30, 2021

SparkQA commented Jun 30, 2021

sunchao commented Jun 2, 2021 •

edited

Loading

cloud-fan Jun 28, 2021 •

edited

Loading

dongjoon-hyun Jun 30, 2021 •

edited

Loading

dongjoon-hyun Jun 30, 2021 •

edited

Loading

dongjoon-hyun Jun 30, 2021 •

edited

Loading