[SPARK-33726][SQL][2.4] Fix for Duplicate field names during Aggregation by yliou · Pull Request #31327 · apache/spark

yliou · 2021-01-25T18:34:03Z

What changes were proposed in this pull request?

The RowBasedKeyValueBatch has two different implementations depending on whether the aggregation key and value uses only fixed length data types (FixedLengthRowBasedKeyValueBatch) or not (VariableLengthRowBasedKeyValueBatch).

Before this PR the decision about the used implementation was based on by accessing the schema fields by their name.
But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made.

When FixedLengthRowBasedKeyValueBatch is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test:

with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x
where the 'x' column in the left side of the join is a Long but on the right side is a String.

Why are the changes needed?

Fixes the issue where duplicate field name aggregation has null values in the dataframe.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT, tested manually on spark shell.

The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`). Before this PR the decision about the used implementation was based on by accessing the schema fields by their name. But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made. When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test: `with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x` where the 'x' column in the left side of the join is a Long but on the right side is a String. Fixes the issue where duplicate field name aggregation has null values in the dataframe. No Added UT, tested manually on spark shell. Closes apache#30788 from yliou/SPARK-33726. Authored-by: yliou <yliou@berkeley.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2021-01-25T19:48:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39045/

SparkQA · 2021-01-25T20:03:50Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39045/

yliou · 2021-01-25T20:20:37Z

cc @cloud-fan

SparkQA · 2021-01-25T21:47:50Z

Test build #134459 has finished for PR 31327 at commit 72eda55.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-26T13:18:27Z

retest this please

SparkQA · 2021-01-26T16:32:33Z

Test build #134505 has finished for PR 31327 at commit 72eda55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-26T16:38:18Z

thanks, merging to 2.4!

### What changes were proposed in this pull request? The `RowBasedKeyValueBatch` has two different implementations depending on whether the aggregation key and value uses only fixed length data types (`FixedLengthRowBasedKeyValueBatch`) or not (`VariableLengthRowBasedKeyValueBatch`). Before this PR the decision about the used implementation was based on by accessing the schema fields by their name. But if two fields has the same name and one with variable length and the other with fixed length type (and all the other fields are with fixed length types) a bad decision could be made. When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable length field then an aggregation function could calculate with invalid values. This case is illustrated by the example used in the unit test: `with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x` where the 'x' column in the left side of the join is a Long but on the right side is a String. ### Why are the changes needed? Fixes the issue where duplicate field name aggregation has null values in the dataframe. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT, tested manually on spark shell. Closes #31327 from yliou/SPARK-33726_2.4. Authored-by: yliou <yliou@berkeley.edu> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun changed the title ~~[SPARK-33726][SQL][2.4-Backport] Fix for Duplicate field names during Aggregation~~ [SPARK-33726][SQL][2.4] Fix for Duplicate field names during Aggregation Jan 25, 2021

cloud-fan closed this Jan 26, 2021

yliou deleted the SPARK-33726_2.4 branch April 3, 2023 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33726][SQL][2.4] Fix for Duplicate field names during Aggregation#31327

[SPARK-33726][SQL][2.4] Fix for Duplicate field names during Aggregation#31327
yliou wants to merge 1 commit intoapache:branch-2.4from
yliou:SPARK-33726_2.4

yliou commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

yliou commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 26, 2021

Uh oh!

SparkQA commented Jan 26, 2021

Uh oh!

cloud-fan commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yliou commented Jan 25, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

yliou commented Jan 25, 2021

Uh oh!

SparkQA commented Jan 25, 2021

Uh oh!

cloud-fan commented Jan 26, 2021

Uh oh!

SparkQA commented Jan 26, 2021

Uh oh!

cloud-fan commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants