[SPARK-19451][SQL][Core] Underlying integer overflow in Window function #16818

uncleGen · 2017-02-06T09:43:34Z

What changes were proposed in this pull request?

reproduce code:

val tw =  Window.orderBy("date")
      .partitionBy("id")
      .rangeBetween( from , 0)

Everything seems ok, while from value is not too large... Even if the rangeBetween() method supports Long parameters.
But.... If i set -2160000000L value to from it does not work !

It seems like there is an underlying integer overflow issue here, i.e. convert Long to Int:

private def between(typ: FrameType, start: Long, end: Long): WindowSpec = {
    val boundaryStart = start match {
      case 0 => CurrentRow
      case Long.MinValue => UnboundedPreceding
      case x if x < 0 => ValuePreceding(-start.toInt)
      case x if x > 0 => ValueFollowing(start.toInt)
    }

    val boundaryEnd = end match {
      case 0 => CurrentRow
      case Long.MaxValue => UnboundedFollowing
      case x if x < 0 => ValuePreceding(-end.toInt)
      case x if x > 0 => ValueFollowing(end.toInt)
    }

    new WindowSpec(
      partitionSpec,
      orderSpec,
      SpecifiedWindowFrame(typ, boundaryStart, boundaryEnd))
  }

This pr changes the type of index from Int to Long.

BTW: Is there any reason why the type of index is Int? I do not find any strong point to set like this.

How was this patch tested?

add new unit test

SparkQA · 2017-02-06T12:10:52Z

Test build #72432 has finished for PR 16818 at commit ea1f440.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ValuePreceding(value: Long) extends FrameBoundary
case class ValueFollowing(value: Long) extends FrameBoundary

hvanhovell · 2017-02-06T13:02:06Z

@uncleGen I think we should limit this to allowing long values for range frames only; row frames should not get larger than 1 << 31 + 1. The reason for this is that we also need to be able to buffer that many rows and that this currently both not practical (I have yet too see someone hitting this limit), and that WindowExec assumes that the buffers are integer bound (see RowBuffer.size for instance). Also testing this will be a total PITA.

Just make sure we can construct a range frame that respects longs, and throw an error for row frames.

uncleGen · 2017-02-06T14:02:18Z

@hvanhovell Thanks for your suggestions, it is just what I failed to notice or consider.

julienchamp · 2017-02-06T14:20:56Z

Just make sure we can construct a range frame that respects longs, and throw an error for row frames.

This seems totally reasonable

SparkQA · 2017-02-08T07:58:38Z

Test build #72569 has started for PR 16818 at commit 268ba58.

uncleGen · 2017-02-08T08:01:55Z

@hvanhovell After dug deeply into code, I found the range scale has nothing to do with RowBuffer, so there is no need to limit this to allowing long values for range frames only, and this pr works well with row frames, at least depends on what I test manually. I have updated the unit test, could you please take a review? Any suggestion is appreciated.

uncleGen · 2017-02-08T08:03:50Z

cc @cloud-fan also

SparkQA · 2017-02-08T08:21:20Z

Test build #72571 has finished for PR 16818 at commit 268ba58.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2017-02-08T08:23:20Z

retest this please.

hvanhovell · 2017-02-08T09:37:50Z

@uncleGen could you undo the changes to the WindowFrame's and the BoundOrdering classes? Just change createBoundOrdering and make it convert the long to an int for row frames (after checking that this is possible).

SparkQA · 2017-02-08T10:59:50Z

Test build #72574 has finished for PR 16818 at commit 268ba58.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-08T11:22:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/window/BoundOrdering.scala

 }

 /**
 * Compare the input index to the bound of the output index.
 */
-private[window] final case class RowBoundOrdering(offset: Int) extends BoundOrdering {
+private[window] final case class RowBoundOrdering(offset: Long) extends BoundOrdering {


@hvanhovell do you mean this is unnecessary as we can only support int anyway?

If we are going to support 64-bit values in a row frame then we also need to support a buffer that can store that many rows. WindowExec in its current form assumes that a buffer contains less than (1 << 31) - 1 values (which is actually smaller than an 32-bit range can be). I have yet to see a use case where the buffer needs to be larger.

The current PR does not make all the necessary changes to make WindowExec support a 64-bit buffer (see RowBuffer.size for instance), and I am slightly worried about overflows. It will also be a daunting task to test this properly (you will need to create a buffer with more than 2 billion elements). So I prefer to keep this change local to range frames only.

@hvanhovell rowBuffer is used to buffer all the rows of one partition. As you said, we only support 32-bit values (less than (1 << 31) - 1). But the row frame or range frame is just to restrict the range to fetch proper value from rowBuffer. This PR changes the type of offset from Int to Long, but will not make the rowBuffer overflows. Besides, how to support a 64-bit buffer is another topic. Let me know if I understand the wrong. Thanks!

It does not make any sense to make the offsets longs. This is an execution detail, buffer indexes (which are integer bound), and you really should not be messing with those.

Try to keep your change more local, and only modify WindowExec.createBoundOrdering and the code generating the WindowExec.windowFrameExpressionFactoryPairs. That should be enough.

uncleGen · 2017-02-14T02:58:15Z

cc @hvanhovell and @cloud-fan

SparkQA · 2017-02-14T05:13:28Z

Test build #72844 has finished for PR 16818 at commit 7ae4e48.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ValuePreceding(value: Int) extends FrameBoundary
case class ValueFollowing(value: Int) extends FrameBoundary

cloud-fan · 2017-02-14T19:27:05Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala

-      case Long.MinValue => UnboundedPreceding
-      case x if x < 0 => ValuePreceding(-start.toInt)
-      case x if x > 0 => ValueFollowing(start.toInt)
+      case x if x < Int.MinValue => UnboundedPreceding


shall we throw an exception if x < Int.MinValue and x > Long.MinValue? @hvanhovell what do you think?

BTW I remember we have document to explain this behavior, we should update that too

yea, the doc is in rangeBetween

In fact, the type of start and end should not be Long here, but we can not change it for compatibility.

cc @hvanhovell any ideas?

cc @hvanhovell and @gatorsmile

SparkQA · 2017-02-15T04:06:56Z

Test build #72911 has finished for PR 16818 at commit c65de9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-06-20T15:19:17Z

Could you bring this up-to-date? @uncleGen

SparkQA · 2017-06-23T05:58:51Z

Test build #78498 has finished for PR 16818 at commit 8722d43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-06-23T09:38:42Z

ping @hvanhovell

…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close #16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d8563) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close apache#16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d8563) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

cloud-fan reviewed Feb 8, 2017

View reviewed changes

uncleGen added 3 commits February 13, 2017 18:13

change the type of index from Int to Long

8750485

fix and add unit test

942d8f8

address the comment from hvanhovell

7ae4e48

uncleGen force-pushed the SPARK-19451 branch from 268ba58 to 7ae4e48 Compare February 14, 2017 02:56

cloud-fan reviewed Feb 14, 2017

View reviewed changes

update doc

c65de9a

Merge branch 'master' into SPARK-19451

8722d43

jiangxb1987 mentioned this pull request Jul 5, 2017

[SPARK-19451][SQL] rangeBetween method should accept Long value as boundary #18540

Closed

asfgit closed this in 92d8563 Jul 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19451][SQL][Core] Underlying integer overflow in Window function #16818

[SPARK-19451][SQL][Core] Underlying integer overflow in Window function #16818

uncleGen commented Feb 6, 2017 •

edited

Loading

SparkQA commented Feb 6, 2017

hvanhovell commented Feb 6, 2017

uncleGen commented Feb 6, 2017

julienchamp commented Feb 6, 2017

SparkQA commented Feb 8, 2017

uncleGen commented Feb 8, 2017

uncleGen commented Feb 8, 2017

SparkQA commented Feb 8, 2017

uncleGen commented Feb 8, 2017

hvanhovell commented Feb 8, 2017

SparkQA commented Feb 8, 2017

cloud-fan Feb 8, 2017

hvanhovell Feb 8, 2017

uncleGen Feb 8, 2017 •

edited

Loading

hvanhovell Feb 10, 2017

uncleGen commented Feb 14, 2017

SparkQA commented Feb 14, 2017

cloud-fan Feb 14, 2017

cloud-fan Feb 14, 2017

uncleGen Feb 15, 2017

cloud-fan Feb 16, 2017

uncleGen Feb 21, 2017

SparkQA commented Feb 15, 2017

jiangxb1987 commented Jun 20, 2017

SparkQA commented Jun 23, 2017

jiangxb1987 commented Jun 23, 2017

[SPARK-19451][SQL][Core] Underlying integer overflow in Window function #16818

[SPARK-19451][SQL][Core] Underlying integer overflow in Window function #16818

Conversation

uncleGen commented Feb 6, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 6, 2017

hvanhovell commented Feb 6, 2017

uncleGen commented Feb 6, 2017

julienchamp commented Feb 6, 2017

SparkQA commented Feb 8, 2017

uncleGen commented Feb 8, 2017

uncleGen commented Feb 8, 2017

SparkQA commented Feb 8, 2017

uncleGen commented Feb 8, 2017

hvanhovell commented Feb 8, 2017

SparkQA commented Feb 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uncleGen Feb 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uncleGen commented Feb 14, 2017

SparkQA commented Feb 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 15, 2017

jiangxb1987 commented Jun 20, 2017

SparkQA commented Jun 23, 2017

jiangxb1987 commented Jun 23, 2017

uncleGen commented Feb 6, 2017 •

edited

Loading

uncleGen Feb 8, 2017 •

edited

Loading