[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

bersprockets · 2023-09-07T15:47:07Z

What changes were proposed in this pull request?

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in OnHeapColumnVector and OffHeapColumnVector to use the dictionary, if present.

Why are the changes needed?

The following query gets incorrect results:

drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

The result should be:

{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}

The cast operation copies the second array by calling ColumnarArray#copy, which in turn calls ColumnarArray#toIntArray, which in turn calls ColumnVector#getInts on the underlying column vector (which is either an OnHeapColumnVector or an OffHeapColumnVector). The implementation of getInts in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:

java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4 
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:       
.f1:         
..list:      
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:         
..list:      
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

The same bug also occurs when field f2 is a map. This PR fixes that case as well.

Does this PR introduce any user-facing change?

No, except for fixing the correctness issue.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

This reverts commit 5a1f7259f0c524f6fc09585da42a9dc44d6e5639.

HyukjinKwon

Making sense to me .. but cc @cloud-fan FYI

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

wangyum · 2023-09-08T14:31:54Z

cc @cloud-fan

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2023-09-08T19:54:39Z

cc @sunchao, @viirya , too

…olumn vector that has a dictionary Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present. The following query gets incorrect results: ``` drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} ``` The result should be: ``` {"f1":[1.0,2.0,3.0],"f2":[1,2,3]} ``` The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary: ``` java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet ... row group 1: RC:1 TS:112 OFFSET:4 ------------------------------------------------------------------------------------------------------------------------------------------------------- value: .f1: ..list: ...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0] .f2: ..list: ...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0] ``` The same bug also occurs when field f2 is a map. This PR fixes that case as well. No, except for fixing the correctness issue. New tests. No. Closes #42850 from bersprockets/vector_oddity. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fac236e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2023-09-08T20:11:55Z

Merged to master/3.5/3.4/3.3.

viirya

Good catch!

sunchao · 2023-09-08T20:21:31Z

late LGTM, thanks @bersprockets !

…ector` ### What changes were proposed in this pull request? This is a small followup of #42850. `getBytes` checks if the `dictionary` is null or not, then call `getByte` which also checks if the `dictionary` is null or not. This PR avoids the repeated if checks by copying one line code from `getByte` to `getBytes`. The same applies to other `getXXX` methods. ### Why are the changes needed? Make the perf-critical path more efficient. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #42903 from cloud-fan/vector. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…olumn vector that has a dictionary Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present. The following query gets incorrect results: ``` drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} ``` The result should be: ``` {"f1":[1.0,2.0,3.0],"f2":[1,2,3]} ``` The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary: ``` java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet ... row group 1: RC:1 TS:112 OFFSET:4 ------------------------------------------------------------------------------------------------------------------------------------------------------- value: .f1: ..list: ...element: INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0] .f2: ..list: ...element: INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0] ``` The same bug also occurs when field f2 is a map. This PR fixes that case as well. No, except for fixing the correctness issue. New tests. No. Closes apache#42850 from bersprockets/vector_oddity. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit fac236e) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

bersprockets added 5 commits September 6, 2023 17:54

testing

9199bad

Revert "testing"

e7ae684

This reverts commit 5a1f7259f0c524f6fc09585da42a9dc44d6e5639.

Add failing tests

fec0e0c

fix

7d0f1f0

Bug fix to test

a481bea

github-actions bot added the SQL label Sep 7, 2023

HyukjinKwon approved these changes Sep 7, 2023

View reviewed changes

wangyum reviewed Sep 8, 2023

View reviewed changes

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala Outdated Show resolved Hide resolved

Review feedback

264b776

wangyum approved these changes Sep 8, 2023

View reviewed changes

dongjoon-hyun approved these changes Sep 8, 2023

View reviewed changes

dongjoon-hyun closed this in fac236e Sep 8, 2023

viirya reviewed Sep 8, 2023

View reviewed changes

cloud-fan mentioned this pull request Sep 13, 2023

[SPARK-45157][SQL] Avoid repeated if checks in [On|Off|HeapColumnVector #42903

Closed

bersprockets deleted the vector_oddity branch September 14, 2023 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

bersprockets commented Sep 7, 2023

HyukjinKwon left a comment

wangyum commented Sep 8, 2023

dongjoon-hyun left a comment

dongjoon-hyun commented Sep 8, 2023 •

edited

dongjoon-hyun commented Sep 8, 2023

viirya left a comment

sunchao commented Sep 8, 2023

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

Conversation

bersprockets commented Sep 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon left a comment

Choose a reason for hiding this comment

wangyum commented Sep 8, 2023

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 8, 2023 • edited

dongjoon-hyun commented Sep 8, 2023

viirya left a comment

Choose a reason for hiding this comment

sunchao commented Sep 8, 2023

dongjoon-hyun commented Sep 8, 2023 •

edited