Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27951][SQL] Support ANSI SQL NTH_VALUE window function #27440

Closed

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Feb 3, 2020

What changes were proposed in this pull request?

The NTH_VALUE function is an ANSI SQL.
For examples:

CREATE TEMPORARY TABLE empsalary (
    depname varchar,
    empno bigint,
    salary int,
    enroll_date date
);

INSERT INTO empsalary VALUES
('develop', 10, 5200, '2007-08-01'),
('sales', 1, 5000, '2006-10-01'),
('personnel', 5, 3500, '2007-12-10'),
('sales', 4, 4800, '2007-08-08'),
('personnel', 2, 3900, '2006-12-23'),
('develop', 7, 4200, '2008-01-01'),
('develop', 9, 4500, '2008-01-01'),
('sales', 3, 4800, '2007-08-01'),
('develop', 8, 6000, '2006-10-01'),
('develop', 11, 5200, '2007-08-15');

select first_value(salary) over(order by salary range between 1000 preceding and 1000 following),
	lead(salary) over(order by salary range between 1000 preceding and 1000 following),
	nth_value(salary, 1) over(order by salary range between 1000 preceding and 1000 following),
	salary from empsalary;
 first_value | lead | nth_value | salary 
-------------+------+-----------+--------
        3500 | 3900 |      3500 |   3500
        3500 | 4200 |      3500 |   3900
        3500 | 4500 |      3500 |   4200
        3500 | 4800 |      3500 |   4500
        3900 | 4800 |      3900 |   4800
        3900 | 5000 |      3900 |   4800
        4200 | 5200 |      4200 |   5000
        4200 | 5200 |      4200 |   5200
        4200 | 6000 |      4200 |   5200
        5000 |      |      5000 |   6000
(10 rows)

There are some mainstream database support the syntax.

PostgreSQL:
https://www.postgresql.org/docs/8.4/functions-window.html

Vertica:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/NTH_VALUEAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_____23

Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0

Redshift
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html

Presto
https://prestodb.io/docs/current/functions/window.html

Why are the changes needed?

The NTH_VALUE function is an ANSI SQL.
The NTH_VALUE function is very useful.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Exists and new UT.

@SparkQA
Copy link

SparkQA commented Feb 3, 2020

Test build #117768 has finished for PR 27440 at commit 2ee523b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class NthValue(input: Expression, offset: Expression)
  • class OffsetWindowFunctionFrame(
  • class FixedOffsetWindowFunctionFrame(

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117793 has finished for PR 27440 at commit c87069f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Feb 4, 2020

cc @gatorsmile @cloud-fan

inputIndex = 0
}

override def write(index: Int, current: InternalRow): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NTH_VALUE should return the same value for all rows in the window partition right? So why are you doing so much heavy lifting here? Everything can be computed in prepare.

If you think about it, then this could also be treated as an unbounded window frame. You could even move this into the UnboundedWindowFunctionFrame if you add the update logic to the NTH_VALUE aggregate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the behavior of NTH_VALUE/FIRST_VALUE/LAST_VALUE more similar to LEAD/LAG, so I make FixedOffsetWindowFunctionFrame extends OffsetWindowFunctionFrame. OffsetWindowFunctionFrame only to fetch projects but UnboundedWindowFunctionFrame will update and execute expressions.
If you think so, I can abstract a base class for FixedOffsetWindowFunctionFrame and OffsetWindowFunctionFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have optimized the code so that the skip operator runs once in prepare.

/**
* Whether the offset is based on the entire frame.
*/
val isWholeBased: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just special case the NTH_VALUE aggregate function instead of doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want NTH_VALUE/FIRST_VALUE/LAST_VALUE use this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we use this for FIRST and LAST? How would you deal with IGNORE NULLS?

Copy link
Contributor Author

@beliefer beliefer Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIRST and LAST is aggregate function, there were PRs who used them as FIRST_VALUE/LAST_VALUE.
You can reference #25082.
The above mentioned PR has been reverted.
All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE need IGNORE NULLS, I think we should handle in a consistent way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LEAD/LAG not support IGNORE NULLS now. Maybe the current implement only considered postgreSQL. So I think as long as NTH_VALUE/FIRST_VALUE/LAST_VALUE use the same way as LEAD/LAG, once the time is right, we can uniformly provide them with support for IGNORE NULLS.

} else {
offset.eval() match {
case i: Int if i <= 0 => TypeCheckFailure(
s"The 'offset' argument of nth_value must be greater than zero but it is $i.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

greater or equal to?

Copy link
Contributor Author

@beliefer beliefer Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, 'offset' must greater than zero.
The description of Redshift

offset
Determines the row number relative to the first row in the window for which to return the expression. The offset can be a constant or an expression and must be a positive integer that is greater than 0.

The description of Vertica
row‑number | Specifies the row to evaluate, where row‑number evaluates to an integer ≥ 1.
The description of Presto
It is an error for the offset to be zero or negative.
The description of Oracle
n determines the nth row for which the measure value is to be returned. n can be a constant, bind variable, column, or an expression involving them, as long as it resolves to a positive integer. The function returns NULL if the data source window has fewer than n rows. If n is null, then the function returns an error.
The description of Postgresql
returns value evaluated at the row that is the nth row of the window frame (counting from 1); null if no such row

if (check.isFailure) {
check
} else {
offset.eval() match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a check to make sure it is foldable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks.

inputSchema: Seq[Attribute],
newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection,
offset: Int)
extends OffsetWindowFunctionFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below. I don't think you should extend this, if you need some functionality just put in in a common base class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@@ -157,6 +157,38 @@ final class OffsetWindowFunctionFrame(
override def currentUpperBound(): Int = throw new UnsupportedOperationException()
}

class FixedOffsetWindowFunctionFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add documentation.

I would like to see a unit test for this.

Copy link
Contributor Author

@beliefer beliefer Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want reference the unit test for OffsetWindowFunctionFrame, can you tell me where they are? Thanks.

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118128 has finished for PR 27440 at commit 5b3caa9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 10, 2020

Test build #118146 has finished for PR 27440 at commit 5b3caa9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118229 has finished for PR 27440 at commit 4b3d80b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118301 has finished for PR 27440 at commit 591c146.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118290 has finished for PR 27440 at commit 714b6d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class OffsetWindowFunctionFrameBase(
  • class OffsetWindowFunctionFrame(

@beliefer
Copy link
Contributor Author

beliefer commented Feb 19, 2020

@hvanhovell Could you take a look again? Thanks.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 30, 2020
@beliefer
Copy link
Contributor Author

retest this please

@beliefer beliefer closed this May 30, 2020
@beliefer beliefer reopened this May 30, 2020
@SparkQA
Copy link

SparkQA commented May 30, 2020

Test build #123308 has finished for PR 27440 at commit 591c146.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 30, 2020

Test build #123319 has finished for PR 27440 at commit f3930f6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants