Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43113][SQL][FOLLOWUP] Add comment about copying steam-side variables #40881

Closed

Conversation

bersprockets
Copy link
Contributor

@bersprockets bersprockets commented Apr 20, 2023

What changes were proposed in this pull request?

Add a comment explaining a tricky situation involving the evaluation of stream-side variables.

This is a follow-up to #40766.

Why are the changes needed?

Make the code more clear.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

N/A

@github-actions github-actions bot added the SQL label Apr 20, 2023
@HyukjinKwon
Copy link
Member

will leave it to @cloud-fan though.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in 6d4ed13 Apr 21, 2023
cloud-fan pushed a commit that referenced this pull request Apr 21, 2023
…iables

### What changes were proposed in this pull request?

Add a comment explaining a tricky situation involving the evaluation of stream-side variables.

This is a follow-up to #40766.

### Why are the changes needed?

Make the code more clear.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #40881 from bersprockets/SPARK-43113_followup.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6d4ed13)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
bersprockets added a commit to bersprockets/spark that referenced this pull request Apr 23, 2023
…iables

### What changes were proposed in this pull request?

Add a comment explaining a tricky situation involving the evaluation of stream-side variables.

This is a follow-up to apache#40766.

### Why are the changes needed?

Make the code more clear.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes apache#40881 from bersprockets/SPARK-43113_followup.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
HyukjinKwon pushed a commit that referenced this pull request Apr 24, 2023
…g code for a bound condition

### What changes were proposed in this pull request?

This is a back-port of #40766 and #40881.

In `JoinCodegenSupport#getJoinCondition`, evaluate any referenced stream-side variables before using them in the generated code.

This patch doesn't evaluate the passed stream-side variables directly, but instead evaluates a copy (`streamVars2`). This is because `SortMergeJoin#codegenFullOuter` will want to evaluate the stream-side vars within a different scope than the condition check, so we mustn't delete the initialization code from the original `ExprCode` instances.

### Why are the changes needed?

When a bound condition of a full outer join references the same stream-side column more than once, wholestage codegen generates bad code.

For example, the following query fails with a compilation error:

```
create or replace temp view v1 as
select * from values
(1, 1),
(2, 2),
(3, 1)
as v1(key, value);

create or replace temp view v2 as
select * from values
(1, 22, 22),
(3, -1, -1),
(7, null, null)
as v2(a, b, c);

select *
from v1
full outer join v2
on key = a
and value > b
and value > c;
```
The error is:
```
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 277, Column 9: Redefinition of local variable "smj_isNull_7"
```
The same error occurs with code generated from ShuffleHashJoinExec:
```
select /*+ SHUFFLE_HASH(v2) */ *
from v1
full outer join v2
on key = a
and value > b
and value > c;
```
In this case, the error is:
```
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 5: Redefinition of local variable "shj_value_1"
```
Neither `SortMergeJoin#codegenFullOuter` nor `ShuffledHashJoinExec#doProduce` evaluate the stream-side variables before calling `consumeFullOuterJoinRow#getJoinCondition`. As a result, `getJoinCondition` generates definition/initialization code for each referenced stream-side variable at the point of use. If a stream-side variable is used more than once in the bound condition, the definition/initialization code is generated more than once, resulting in the "Redefinition of local variable" error.

In the end, the query succeeds, since Spark disables wholestage codegen and tries again.

(In the case other join-type/strategy pairs, either the implementations don't call `JoinCodegenSupport#getJoinCondition`, or the stream-side variables are pre-evaluated before the call is made, so no error happens in those cases).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests.

Closes #40917 from bersprockets/full_join_codegen_issue_br33.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@bersprockets bersprockets deleted the SPARK-43113_followup branch April 27, 2023 22:23
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…iables

### What changes were proposed in this pull request?

Add a comment explaining a tricky situation involving the evaluation of stream-side variables.

This is a follow-up to apache#40766.

### Why are the changes needed?

Make the code more clear.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes apache#40881 from bersprockets/SPARK-43113_followup.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6d4ed13)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…iables

### What changes were proposed in this pull request?

Add a comment explaining a tricky situation involving the evaluation of stream-side variables.

This is a follow-up to apache#40766.

### Why are the changes needed?

Make the code more clear.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes apache#40881 from bersprockets/SPARK-43113_followup.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6d4ed13)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…iables

### What changes were proposed in this pull request?

Add a comment explaining a tricky situation involving the evaluation of stream-side variables.

This is a follow-up to apache#40766.

### Why are the changes needed?

Make the code more clear.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes apache#40881 from bersprockets/SPARK-43113_followup.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6d4ed13)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants