Skip to content

[SPARK-33726][SQL][2.4] Fix for Duplicate field names during Aggregation#31327

Closed
yliou wants to merge 1 commit intoapache:branch-2.4from
yliou:SPARK-33726_2.4
Closed

[SPARK-33726][SQL][2.4] Fix for Duplicate field names during Aggregation#31327
yliou wants to merge 1 commit intoapache:branch-2.4from
yliou:SPARK-33726_2.4

Conversation

@yliou
Copy link
Contributor

@yliou yliou commented Jan 25, 2021

What changes were proposed in this pull request?

The RowBasedKeyValueBatch has two different implementations depending on whether the aggregation key and value uses only fixed length data types (FixedLengthRowBasedKeyValueBatch) or not (VariableLengthRowBasedKeyValueBatch).

Before this PR the decision about the used implementation was based on by accessing the schema fields by their name.
But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made.

When FixedLengthRowBasedKeyValueBatch is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test:

with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x
where the 'x' column in the left side of the join is a Long but on the right side is a String.

Why are the changes needed?

Fixes the issue where duplicate field name aggregation has null values in the dataframe.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT, tested manually on spark shell.

The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`).

Before this PR the decision about the used implementation was based on by accessing the schema fields by their name.
But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made.

When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test:

`with T as (select id as a, -id as x from range(3)),
        U as (select id as b, cast(id as string) as x from range(3))
select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x`
where the 'x' column in the left side of the join is a Long but on the right side is a String.

Fixes the issue where duplicate field name aggregation has null values in the dataframe.

No

Added UT, tested manually on spark shell.

Closes apache#30788 from yliou/SPARK-33726.

Authored-by: yliou <yliou@berkeley.edu>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Jan 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39045/

@SparkQA
Copy link

SparkQA commented Jan 25, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39045/

@yliou
Copy link
Contributor Author

yliou commented Jan 25, 2021

cc @cloud-fan

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33726][SQL][2.4-Backport] Fix for Duplicate field names during Aggregation [SPARK-33726][SQL][2.4] Fix for Duplicate field names during Aggregation Jan 25, 2021
@SparkQA
Copy link

SparkQA commented Jan 25, 2021

Test build #134459 has finished for PR 31327 at commit 72eda55.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Test build #134505 has finished for PR 31327 at commit 72eda55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to 2.4!

@cloud-fan cloud-fan closed this Jan 26, 2021
cloud-fan pushed a commit that referenced this pull request Jan 26, 2021
### What changes were proposed in this pull request?
The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`).

Before this PR the decision about the used implementation was based on by accessing the schema fields by their name.
But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made.

When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test:

`with T as (select id as a, -id as x from range(3)),
        U as (select id as b, cast(id as string) as x from range(3))
select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x`
where the 'x' column in the left side of the join is a Long but on the right side is a String.

### Why are the changes needed?
Fixes the issue where duplicate field name aggregation has null values in the dataframe.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT, tested manually on spark shell.

Closes #31327 from yliou/SPARK-33726_2.4.

Authored-by: yliou <yliou@berkeley.edu>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@yliou yliou deleted the SPARK-33726_2.4 branch April 3, 2023 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants