Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19451][SQL][Core] Underlying integer overflow in Window function #16818

Closed
wants to merge 5 commits into from

Conversation

uncleGen
Copy link
Contributor

@uncleGen uncleGen commented Feb 6, 2017

What changes were proposed in this pull request?

reproduce code:

val tw =  Window.orderBy("date")
      .partitionBy("id")
      .rangeBetween( from , 0)

Everything seems ok, while from value is not too large... Even if the rangeBetween() method supports Long parameters.
But.... If i set -2160000000L value to from it does not work !

It seems like there is an underlying integer overflow issue here, i.e. convert Long to Int:

private def between(typ: FrameType, start: Long, end: Long): WindowSpec = {
    val boundaryStart = start match {
      case 0 => CurrentRow
      case Long.MinValue => UnboundedPreceding
      case x if x < 0 => ValuePreceding(-start.toInt)
      case x if x > 0 => ValueFollowing(start.toInt)
    }

    val boundaryEnd = end match {
      case 0 => CurrentRow
      case Long.MaxValue => UnboundedFollowing
      case x if x < 0 => ValuePreceding(-end.toInt)
      case x if x > 0 => ValueFollowing(end.toInt)
    }

    new WindowSpec(
      partitionSpec,
      orderSpec,
      SpecifiedWindowFrame(typ, boundaryStart, boundaryEnd))
  }

This pr changes the type of index from Int to Long.

BTW: Is there any reason why the type of index is Int? I do not find any strong point to set like this.

How was this patch tested?

add new unit test

@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72432 has finished for PR 16818 at commit ea1f440.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ValuePreceding(value: Long) extends FrameBoundary
  • case class ValueFollowing(value: Long) extends FrameBoundary

@hvanhovell
Copy link
Contributor

@uncleGen I think we should limit this to allowing long values for range frames only; row frames should not get larger than 1 << 31 + 1. The reason for this is that we also need to be able to buffer that many rows and that this currently both not practical (I have yet too see someone hitting this limit), and that WindowExec assumes that the buffers are integer bound (see RowBuffer.size for instance). Also testing this will be a total PITA.

Just make sure we can construct a range frame that respects longs, and throw an error for row frames.

@uncleGen
Copy link
Contributor Author

uncleGen commented Feb 6, 2017

@hvanhovell Thanks for your suggestions, it is just what I failed to notice or consider.

@julienchamp
Copy link

Just make sure we can construct a range frame that respects longs, and throw an error for row frames.

This seems totally reasonable

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72569 has started for PR 16818 at commit 268ba58.

@uncleGen
Copy link
Contributor Author

uncleGen commented Feb 8, 2017

@hvanhovell After dug deeply into code, I found the range scale has nothing to do with RowBuffer, so there is no need to limit this to allowing long values for range frames only, and this pr works well with row frames, at least depends on what I test manually. I have updated the unit test, could you please take a review? Any suggestion is appreciated.

@uncleGen
Copy link
Contributor Author

uncleGen commented Feb 8, 2017

cc @cloud-fan also

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72571 has finished for PR 16818 at commit 268ba58.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

uncleGen commented Feb 8, 2017

retest this please.

@hvanhovell
Copy link
Contributor

@uncleGen could you undo the changes to the WindowFrame's and the BoundOrdering classes? Just change createBoundOrdering and make it convert the long to an int for row frames (after checking that this is possible).

@SparkQA
Copy link

SparkQA commented Feb 8, 2017

Test build #72574 has finished for PR 16818 at commit 268ba58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Compare the input index to the bound of the output index.
*/
private[window] final case class RowBoundOrdering(offset: Int) extends BoundOrdering {
private[window] final case class RowBoundOrdering(offset: Long) extends BoundOrdering {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell do you mean this is unnecessary as we can only support int anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to support 64-bit values in a row frame then we also need to support a buffer that can store that many rows. WindowExec in its current form assumes that a buffer contains less than (1 << 31) - 1 values (which is actually smaller than an 32-bit range can be). I have yet to see a use case where the buffer needs to be larger.

The current PR does not make all the necessary changes to make WindowExec support a 64-bit buffer (see RowBuffer.size for instance), and I am slightly worried about overflows. It will also be a daunting task to test this properly (you will need to create a buffer with more than 2 billion elements). So I prefer to keep this change local to range frames only.

Copy link
Contributor Author

@uncleGen uncleGen Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell rowBuffer is used to buffer all the rows of one partition. As you said, we only support 32-bit values (less than (1 << 31) - 1). But the row frame or range frame is just to restrict the range to fetch proper value from rowBuffer. This PR changes the type of offset from Int to Long, but will not make the rowBuffer overflows. Besides, how to support a 64-bit buffer is another topic. Let me know if I understand the wrong. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make any sense to make the offsets longs. This is an execution detail, buffer indexes (which are integer bound), and you really should not be messing with those.

Try to keep your change more local, and only modify WindowExec.createBoundOrdering and the code generating the WindowExec.windowFrameExpressionFactoryPairs. That should be enough.

@uncleGen
Copy link
Contributor Author

cc @hvanhovell and @cloud-fan

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72844 has finished for PR 16818 at commit 7ae4e48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ValuePreceding(value: Int) extends FrameBoundary
  • case class ValueFollowing(value: Int) extends FrameBoundary

case Long.MinValue => UnboundedPreceding
case x if x < 0 => ValuePreceding(-start.toInt)
case x if x > 0 => ValueFollowing(start.toInt)
case x if x < Int.MinValue => UnboundedPreceding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we throw an exception if x < Int.MinValue and x > Long.MinValue? @hvanhovell what do you think?

BTW I remember we have document to explain this behavior, we should update that too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, the doc is in rangeBetween

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, the type of start and end should not be Long here, but we can not change it for compatibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hvanhovell any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72911 has finished for PR 16818 at commit c65de9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

Could you bring this up-to-date? @uncleGen

@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78498 has finished for PR 16818 at commit 8722d43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor

ping @hvanhovell

asfgit pushed a commit that referenced this pull request Jul 29, 2017
…undary

## What changes were proposed in this pull request?

Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.

Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.

This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c

After this been merged, we can close #16818 .

## How was this patch tested?

Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18540 from jiangxb1987/rangeFrame.

(cherry picked from commit 92d8563)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in 92d8563 Jul 29, 2017
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…undary

## What changes were proposed in this pull request?

Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.

Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.

This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c

After this been merged, we can close apache#16818 .

## How was this patch tested?

Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes apache#18540 from jiangxb1987/rangeFrame.

(cherry picked from commit 92d8563)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants