-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19451][SQL][Core] Underlying integer overflow in Window function #16818
Conversation
Test build #72432 has finished for PR 16818 at commit
|
@uncleGen I think we should limit this to allowing long values for range frames only; row frames should not get larger than Just make sure we can construct a range frame that respects longs, and throw an error for row frames. |
@hvanhovell Thanks for your suggestions, it is just what I failed to notice or consider. |
This seems totally reasonable |
Test build #72569 has started for PR 16818 at commit |
@hvanhovell After dug deeply into code, I found the range scale has nothing to do with |
cc @cloud-fan also |
Test build #72571 has finished for PR 16818 at commit
|
retest this please. |
@uncleGen could you undo the changes to the WindowFrame's and the BoundOrdering classes? Just change createBoundOrdering and make it convert the long to an int for row frames (after checking that this is possible). |
Test build #72574 has finished for PR 16818 at commit
|
} | ||
|
||
/** | ||
* Compare the input index to the bound of the output index. | ||
*/ | ||
private[window] final case class RowBoundOrdering(offset: Int) extends BoundOrdering { | ||
private[window] final case class RowBoundOrdering(offset: Long) extends BoundOrdering { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell do you mean this is unnecessary as we can only support int anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to support 64-bit values in a row frame then we also need to support a buffer that can store that many rows. WindowExec
in its current form assumes that a buffer contains less than (1 << 31) - 1
values (which is actually smaller than an 32-bit range can be). I have yet to see a use case where the buffer needs to be larger.
The current PR does not make all the necessary changes to make WindowExec
support a 64-bit buffer (see RowBuffer.size for instance), and I am slightly worried about overflows. It will also be a daunting task to test this properly (you will need to create a buffer with more than 2 billion elements). So I prefer to keep this change local to range frames only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell rowBuffer
is used to buffer all the rows of one partition. As you said, we only support 32-bit values (less than (1 << 31) - 1
). But the row frame or range frame is just to restrict the range to fetch proper value from rowBuffer
. This PR changes the type of offset from Int
to Long
, but will not make the rowBuffer
overflows. Besides, how to support a 64-bit buffer is another topic. Let me know if I understand the wrong. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make any sense to make the offsets longs. This is an execution detail, buffer indexes (which are integer bound), and you really should not be messing with those.
Try to keep your change more local, and only modify WindowExec.createBoundOrdering
and the code generating the WindowExec.windowFrameExpressionFactoryPairs
. That should be enough.
268ba58
to
7ae4e48
Compare
cc @hvanhovell and @cloud-fan |
Test build #72844 has finished for PR 16818 at commit
|
case Long.MinValue => UnboundedPreceding | ||
case x if x < 0 => ValuePreceding(-start.toInt) | ||
case x if x > 0 => ValueFollowing(start.toInt) | ||
case x if x < Int.MinValue => UnboundedPreceding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we throw an exception if x < Int.MinValue
and x > Long.MinValue
? @hvanhovell what do you think?
BTW I remember we have document to explain this behavior, we should update that too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, the doc is in rangeBetween
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, the type of start
and end
should not be Long
here, but we can not change it for compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hvanhovell any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @hvanhovell and @gatorsmile
Test build #72911 has finished for PR 16818 at commit
|
Could you bring this up-to-date? @uncleGen |
Test build #78498 has finished for PR 16818 at commit
|
ping @hvanhovell |
…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close #16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d8563) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close apache#16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d8563) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
reproduce code:
Everything seems ok, while from value is not too large... Even if the rangeBetween() method supports Long parameters.
But.... If i set -2160000000L value to from it does not work !
It seems like there is an underlying integer overflow issue here, i.e. convert Long to Int:
This pr changes the type of index from Int to Long.
BTW: Is there any reason why the type of index is Int? I do not find any strong point to set like this.
How was this patch tested?
add new unit test