Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16633] [SPARK-16642] [SPARK-16721] [SQL] Fixes three issues related to lead and lag functions #14284

Closed
wants to merge 9 commits into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Jul 20, 2016

What changes were proposed in this pull request?

This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:

  1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
  2. If the offset row does not exist, the default value will be used.
  3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.

How was this patch tested?

New tests in SQLWindowFunctionSuite

…s explained below:

* When the offset row does not exits, default values will be used.
* lead/lag always respect null input values.
@@ -382,7 +382,7 @@ abstract class OffsetWindowFunction
*
* @param input expression to evaluate 'offset' rows after the current row.
* @param offset rows to jump ahead in the partition.
* @param default to use when the input value is null or when the offset is larger than the window.
* @param default to use when the offset is larger than the window.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell what was the reason that we changed the behavior of lead and lag on if they respect null values?

@yhuai
Copy link
Contributor Author

yhuai commented Jul 20, 2016

Without a good reason and providing a way  to make lead and lag respect Bulls, we should not change the behavior.

@yhuai yhuai changed the title [SPARK-16633] [SPARK-16642] Fixes three issues related to window functions [SPARK-16633] [SPARK-16642] Fixes three issues related to lead and lag functions Jul 20, 2016
@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62593 has finished for PR 14284 at commit c2fd2d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented Jul 20, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62609 has finished for PR 14284 at commit c2fd2d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62611 has finished for PR 14284 at commit 51e6937.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -367,4 +367,50 @@ class SQLWindowFunctionSuite extends QueryTest with SQLTestUtils with TestHiveSi
| select * from v2 order by key limit 1
""".stripMargin), Row(0, 3))
}

test("lead/lag should return the default value if the offset row does not exist") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm why is this file in hive?

can you move it in a separate pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, originally we used window functions from Hive. It should be reason that we put this file at here. Let me move it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -357,14 +356,59 @@ class SQLWindowFunctionSuite extends QueryTest with SQLTestUtils with TestHiveSi
}

test("SPARK-7595: Window will cause resolve failed with self join") {
sql("SELECT * FROM src") // Force loading of src table.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this test, I disabled the fix (https://github.com/apache/spark/pull/6114/files) and checked that it does fail the analysis because analyzer fails to resolve conflicting references in Join. So, this test is still valid after my change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not in Hive. So there is no table called src.

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62626 has finished for PR 14284 at commit faa7d89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class WindowData(month: Int, area: String, product: Int)
    • class SQLWindowFunctionSuite extends QueryTest with SharedSQLContext

Row(2, 2))
}

test("lead/lag should be able to handle null input value correctly") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "correctly" is needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best if the test case name should specify the behavior, rather than saying "correctly".

Since obviously we don't want anything to be "incorrectly"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see what @jaceklaskowski meant. I thought he was questioning the behavior of lead/lag.

@SparkQA
Copy link

SparkQA commented Jul 21, 2016

Test build #62690 has finished for PR 14284 at commit 073ac94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -625,10 +643,12 @@ private[execution] final class OffsetWindowFunctionFrame(
if (inputIndex >= 0 && inputIndex < input.size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more general comment, which does not necessarily apply to this line. Since we are breaking the code up into to separate code paths (with row/without row), we might as well get rid of the joined row and the logic needed to set this up (like: Seq.fill(ordinal)(NoOp))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we can improve this part in master.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets improve this in a follow-up PR :).

@yhuai yhuai changed the title [SPARK-16633] [SPARK-16642] Fixes three issues related to lead and lag functions [SPARK-16633] [SPARK-16642] [SPARK-16721] [SQL] Fixes three issues related to lead and lag functions Jul 25, 2016
@yhuai
Copy link
Contributor Author

yhuai commented Jul 26, 2016

this this please

@cloud-fan
Copy link
Contributor

should we also update the doc for OffsetWindowFunction.default?

@cloud-fan
Copy link
Contributor

retest this please

@yhuai
Copy link
Contributor Author

yhuai commented Jul 26, 2016

yea. that's a good point.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62854 has finished for PR 14284 at commit ff3029e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor Author

yhuai commented Jul 26, 2016

Thanks for review. I am merging this to master and branch 2.0.

asfgit pushed a commit that referenced this pull request Jul 26, 2016
…ed to lead and lag functions

## What changes were proposed in this pull request?
This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below:
1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved.

## How was this patch tested?
New tests in SQLWindowFunctionSuite

Author: Yin Huai <yhuai@databricks.com>

Closes #14284 from yhuai/lead-lag.

(cherry picked from commit 815f3ee)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in 815f3ee Jul 26, 2016
@chengat1314
Copy link

Are we able to enable ignore null feature in Spark 2.1?
like lag(comm ignore nulls) over (order by empno) prev_comm.
thx

@chengat1314
Copy link

chengat1314 commented Jan 4, 2017

Is possible add feature to enable ignore nulls? @yhuai
for example:
LAG (value_expr [, offset ])
[ IGNORE NULLS | RESPECT NULLS ]
OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering )

thanks
Cheng Feng

Copy link

@chengat1314 chengat1314 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep both IGNORE NULLS | RESPECT NULLS as feature.

@hvanhovell
Copy link
Contributor

@chengat1314 this has never been supported, and is a new feature. There was some discussion on the JIRA a while back: https://issues.apache.org/jira/browse/SPARK-17423. Lets move the discussion there.

@chengat1314
Copy link

@hvanhovell Nice, thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants