-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16633] [SPARK-16642] [SPARK-16721] [SQL] Fixes three issues related to lead and lag functions #14284
Conversation
…s explained below: * When the offset row does not exits, default values will be used. * lead/lag always respect null input values.
@@ -382,7 +382,7 @@ abstract class OffsetWindowFunction | |||
* | |||
* @param input expression to evaluate 'offset' rows after the current row. | |||
* @param offset rows to jump ahead in the partition. | |||
* @param default to use when the input value is null or when the offset is larger than the window. | |||
* @param default to use when the offset is larger than the window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell what was the reason that we changed the behavior of lead and lag on if they respect null values?
Without a good reason and providing a way to make lead and lag respect Bulls, we should not change the behavior. |
Test build #62593 has finished for PR 14284 at commit
|
test this please |
Test build #62609 has finished for PR 14284 at commit
|
Test build #62611 has finished for PR 14284 at commit
|
@@ -367,4 +367,50 @@ class SQLWindowFunctionSuite extends QueryTest with SQLTestUtils with TestHiveSi | |||
| select * from v2 order by key limit 1 | |||
""".stripMargin), Row(0, 3)) | |||
} | |||
|
|||
test("lead/lag should return the default value if the offset row does not exist") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm why is this file in hive?
can you move it in a separate pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, originally we used window functions from Hive. It should be reason that we put this file at here. Let me move it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -357,14 +356,59 @@ class SQLWindowFunctionSuite extends QueryTest with SQLTestUtils with TestHiveSi | |||
} | |||
|
|||
test("SPARK-7595: Window will cause resolve failed with self join") { | |||
sql("SELECT * FROM src") // Force loading of src table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this test, I disabled the fix (https://github.com/apache/spark/pull/6114/files) and checked that it does fail the analysis because analyzer fails to resolve conflicting references in Join. So, this test is still valid after my change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not in Hive. So there is no table called src
.
Test build #62626 has finished for PR 14284 at commit
|
Row(2, 2)) | ||
} | ||
|
||
test("lead/lag should be able to handle null input value correctly") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "correctly" is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's best if the test case name should specify the behavior, rather than saying "correctly".
Since obviously we don't want anything to be "incorrectly"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I see what @jaceklaskowski meant. I thought he was questioning the behavior of lead/lag.
Test build #62690 has finished for PR 14284 at commit
|
@@ -625,10 +643,12 @@ private[execution] final class OffsetWindowFunctionFrame( | |||
if (inputIndex >= 0 && inputIndex < input.size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a more general comment, which does not necessarily apply to this line. Since we are breaking the code up into to separate code paths (with row/without row), we might as well get rid of the joined row and the logic needed to set this up (like: Seq.fill(ordinal)(NoOp)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, we can improve this part in master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lets improve this in a follow-up PR :).
this this please |
should we also update the doc for |
retest this please |
yea. that's a good point. |
LGTM |
Test build #62854 has finished for PR 14284 at commit
|
Thanks for review. I am merging this to master and branch 2.0. |
…ed to lead and lag functions ## What changes were proposed in this pull request? This PR contains three changes. First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below: 1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value. 2. If the offset row does not exist, the default value will be used. 3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change). Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist. Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved. ## How was this patch tested? New tests in SQLWindowFunctionSuite Author: Yin Huai <yhuai@databricks.com> Closes #14284 from yhuai/lead-lag. (cherry picked from commit 815f3ee) Signed-off-by: Yin Huai <yhuai@databricks.com>
Are we able to enable ignore null feature in Spark 2.1? |
Is possible add feature to enable ignore nulls? @yhuai thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep both IGNORE NULLS | RESPECT NULLS as feature.
@chengat1314 this has never been supported, and is a new feature. There was some discussion on the JIRA a while back: https://issues.apache.org/jira/browse/SPARK-17423. Lets move the discussion there. |
@hvanhovell Nice, thank you very much! |
What changes were proposed in this pull request?
This PR contains three changes.
First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.
Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.
How was this patch tested?
New tests in SQLWindowFunctionSuite