-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27951][SQL] Support ANSI SQL NTH_VALUE window function #27440
Conversation
Test build #117768 has finished for PR 27440 at commit
|
Test build #117793 has finished for PR 27440 at commit
|
inputIndex = 0 | ||
} | ||
|
||
override def write(index: Int, current: InternalRow): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NTH_VALUE should return the same value for all rows in the window partition right? So why are you doing so much heavy lifting here? Everything can be computed in prepare.
If you think about it, then this could also be treated as an unbounded window frame. You could even move this into the UnboundedWindowFunctionFrame
if you add the update logic to the NTH_VALUE aggregate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the behavior of NTH_VALUE/FIRST_VALUE/LAST_VALUE
more similar to LEAD/LAG
, so I make FixedOffsetWindowFunctionFrame
extends OffsetWindowFunctionFrame
. OffsetWindowFunctionFrame
only to fetch projects but UnboundedWindowFunctionFrame
will update
and execute
expressions.
If you think so, I can abstract a base class for FixedOffsetWindowFunctionFrame
and OffsetWindowFunctionFrame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have optimized the code so that the skip operator runs once in prepare
.
/** | ||
* Whether the offset is based on the entire frame. | ||
*/ | ||
val isWholeBased: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just special case the NTH_VALUE aggregate function instead of doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want NTH_VALUE/FIRST_VALUE/LAST_VALUE
use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we use this for FIRST
and LAST
? How would you deal with IGNORE NULLS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIRST
and LAST
is aggregate function, there were PRs who used them as FIRST_VALUE/LAST_VALUE
.
You can reference #25082.
The above mentioned PR has been reverted.
All of LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE
need IGNORE NULLS
, I think we should handle in a consistent way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LEAD/LAG
not support IGNORE NULLS
now. Maybe the current implement only considered postgreSQL
. So I think as long as NTH_VALUE/FIRST_VALUE/LAST_VALUE
use the same way as LEAD/LAG
, once the time is right, we can uniformly provide them with support for IGNORE NULLS
.
} else { | ||
offset.eval() match { | ||
case i: Int if i <= 0 => TypeCheckFailure( | ||
s"The 'offset' argument of nth_value must be greater than zero but it is $i.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
greater or equal to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, 'offset' must greater than zero.
The description of Redshift
offset
Determines the row number relative to the first row in the window for which to return the expression. The offset can be a constant or an expression and must be a positive integer that is greater than 0.
The description of Vertica
row‑number | Specifies the row to evaluate, where row‑number evaluates to an integer ≥ 1.
The description of Presto
It is an error for the offset to be zero or negative.
The description of Oracle
n determines the nth row for which the measure value is to be returned. n can be a constant, bind variable, column, or an expression involving them, as long as it resolves to a positive integer. The function returns NULL if the data source window has fewer than n rows. If n is null, then the function returns an error.
The description of Postgresql
returns value evaluated at the row that is the nth row of the window frame (counting from 1); null if no such row
if (check.isFailure) { | ||
check | ||
} else { | ||
offset.eval() match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a check to make sure it is foldable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks.
inputSchema: Seq[Attribute], | ||
newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, | ||
offset: Int) | ||
extends OffsetWindowFunctionFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below. I don't think you should extend this, if you need some functionality just put in in a common base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
@@ -157,6 +157,38 @@ final class OffsetWindowFunctionFrame( | |||
override def currentUpperBound(): Int = throw new UnsupportedOperationException() | |||
} | |||
|
|||
class FixedOffsetWindowFunctionFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add documentation.
I would like to see a unit test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want reference the unit test for OffsetWindowFunctionFrame
, can you tell me where they are? Thanks.
Test build #118128 has finished for PR 27440 at commit
|
retest this please |
Test build #118146 has finished for PR 27440 at commit
|
Test build #118229 has finished for PR 27440 at commit
|
Test build #118301 has finished for PR 27440 at commit
|
Test build #118290 has finished for PR 27440 at commit
|
@hvanhovell Could you take a look again? Thanks. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
retest this please |
Test build #123308 has finished for PR 27440 at commit
|
Test build #123319 has finished for PR 27440 at commit
|
What changes were proposed in this pull request?
The
NTH_VALUE
function is an ANSI SQL.For examples:
There are some mainstream database support the syntax.
PostgreSQL:
https://www.postgresql.org/docs/8.4/functions-window.html
Vertica:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/NTH_VALUEAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_____23
Oracle:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0
Redshift
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html
Presto
https://prestodb.io/docs/current/functions/window.html
Why are the changes needed?
The
NTH_VALUE
function is an ANSI SQL.The
NTH_VALUE
function is very useful.Does this PR introduce any user-facing change?
No
How was this patch tested?
Exists and new UT.