Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44805][SQL] getBytes/getShorts/getInts/etc. should work in a column vector that has a dictionary #42850

Closed
wants to merge 6 commits into from

Conversation

bersprockets
Copy link
Contributor

What changes were proposed in this pull request?

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in OnHeapColumnVector and OffHeapColumnVector to use the dictionary, if present.

Why are the changes needed?

The following query gets incorrect results:

drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

The result should be:

{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}

The cast operation copies the second array by calling ColumnarArray#copy, which in turn calls ColumnarArray#toIntArray, which in turn calls ColumnVector#getInts on the underlying column vector (which is either an OnHeapColumnVector or an OffHeapColumnVector). The implementation of getInts in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:

java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4 
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:       
.f1:         
..list:      
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:         
..list:      
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

The same bug also occurs when field f2 is a map. This PR fixes that case as well.

Does this PR introduce any user-facing change?

No, except for fixing the correctness issue.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Sep 7, 2023
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sense to me .. but cc @cloud-fan FYI

@wangyum
Copy link
Member

wangyum commented Sep 8, 2023

cc @cloud-fan

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 8, 2023

cc @sunchao, @viirya , too

dongjoon-hyun pushed a commit that referenced this pull request Sep 8, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes #42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Sep 8, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes #42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Sep 8, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes #42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Merged to master/3.5/3.4/3.3.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@sunchao
Copy link
Member

sunchao commented Sep 8, 2023

late LGTM, thanks @bersprockets !

dongjoon-hyun pushed a commit that referenced this pull request Sep 13, 2023
…ector`

### What changes were proposed in this pull request?

This is a small followup of #42850. `getBytes` checks if the `dictionary` is null or not, then call `getByte` which also checks if the `dictionary` is null or not. This PR avoids the repeated if checks by copying one line code from `getByte` to `getBytes`. The same applies to other `getXXX` methods.

### Why are the changes needed?

Make the perf-critical path more efficient.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42903 from cloud-fan/vector.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@bersprockets bersprockets deleted the vector_oddity branch September 14, 2023 16:12
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…olumn vector that has a dictionary

Change getBytes/getShorts/getInts/getLongs/getFloats/getDoubles in `OnHeapColumnVector` and `OffHeapColumnVector` to use the dictionary, if present.

The following query gets incorrect results:
```
drop table if exists t1;

create table t1 using parquet as
select * from values
(named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2)))
as (value);

select cast(value as struct<f1:array<double>,f2:array<int>>) AS value from t1;

{"f1":[1.0,2.0,3.0],"f2":[0,0,0]}

```
The result should be:
```
{"f1":[1.0,2.0,3.0],"f2":[1,2,3]}
```
The cast operation copies the second array by calling `ColumnarArray#copy`, which in turn calls `ColumnarArray#toIntArray`, which in turn calls `ColumnVector#getInts` on the underlying column vector (which is either an `OnHeapColumnVector` or an `OffHeapColumnVector`). The implementation of `getInts` in either concrete class assumes there is no dictionary and does not use it if it is present (in fact, it even asserts that there is no dictionary). However, in the above example, the column vector associated with the second array does have a dictionary:
```
java -cp ~/github/parquet-mr/parquet-tools/target/parquet-tools-1.10.1.jar org.apache.parquet.tools.Main meta ./spark-warehouse/t1/part-00000-122fdd53-8166-407b-aec5-08e0c2845c3d-c000.snappy.parquet
...
row group 1: RC:1 TS:112 OFFSET:4
-------------------------------------------------------------------------------------------------------------------------------------------------------
value:
.f1:
..list:
...element:   INT32 SNAPPY DO:0 FPO:4 SZ:47/47/1.00 VC:3 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
.f2:
..list:
...element:   INT32 SNAPPY DO:51 FPO:80 SZ:69/65/0.94 VC:3 ENC:RLE,PLAIN_DICTIONARY ST:[min: 1, max: 2, num_nulls: 0]

```
The same bug also occurs when field f2 is a map. This PR fixes that case as well.

No, except for fixing the correctness issue.

New tests.

No.

Closes apache#42850 from bersprockets/vector_oddity.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit fac236e)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants