[SPARK-19727][SQL] Fix for round function that modifies original column #17075

wojtek-szymanski · 2017-02-27T02:04:51Z

What changes were proposed in this pull request?

Fix for SQL round function that modifies original column when underlying data frame is created from a local product.

import org.apache.spark.sql.functions._

case class NumericRow(value: BigDecimal)

val df = spark.createDataFrame(Seq(NumericRow(BigDecimal("1.23456789"))))

df.show()
+--------------------+
|               value|
+--------------------+
|1.234567890000000000|
+--------------------+

df.withColumn("value_rounded", round('value)).show()

// before
+--------------------+-------------+
|               value|value_rounded|
+--------------------+-------------+
|1.000000000000000000|            1|
+--------------------+-------------+

// after
+--------------------+-------------+
|               value|value_rounded|
+--------------------+-------------+
|1.234567890000000000|            1|
+--------------------+-------------+

How was this patch tested?

New unit test added to existing suite org.apache.spark.sql.MathFunctionsSuite

srowen · 2017-02-27T10:08:06Z

I dont' know the code well enough to really evaluate this, but, I see that .clone() is called in a similar context in decimalExpressions. There are also similar usages of changePrecision in UnsafeArrayWriter and UnsafeRowWriter; I wonder if they are affected too?

CC maybe @cloud-fan or @yjshen ?

cloud-fan · 2017-02-27T18:27:02Z

I think we should fix changePrecison to return a new instance instead of updating itself.

wojtek-szymanski · 2017-02-27T22:30:01Z

Good idea @cloud-fan. I will look for usages of changePrecision then.

wojtek-szymanski · 2017-02-28T00:31:19Z

I have just started refactoring of changePrecission in order to make it immutable.
My idea was to change the signature from:
def changePrecision(precision: Int, scale: Int, mode: Int): Boolean
into
def changePrecision(precision: Int, scale: Int, mode: Int): Option[Decimal]

Here are my first thoughts:

org.apache.spark.sql.types.Decimal is mutable by definition, so making one method immutable makes its contract very inconsistent
I am afraid of performance degradation in micro-benchmarks since in some use cases, an instance needs to be created twice
changePrecission is called 10 times in Scala, 10 times in code gen functions and 3 times in Java unsafe writers (UnsafeArrayWriter, UnsafeRowWriter)

I would be grateful if you could confirm if it's the right way to go.

cloud-fan · 2017-02-28T19:47:17Z

how about we add a new method toPrecision that returns Option[Decimal]? Most of the time we should call toPrecision, but for some performance critical path we should call changePrecission

wojtek-szymanski · 2017-03-01T00:44:34Z

It seems it makes more sense now, please have a look.

cloud-fan · 2017-03-01T00:50:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala


  private[this] def castToDecimal(from: DataType, target: DecimalType): Any => Any = from match {
    case StringType =>
      buildCast[UTF8String](_, s => try {
-        changePrecision(Decimal(new JavaBigDecimal(s.toString)), target)
+        toPrecision(Decimal(new JavaBigDecimal(s.toString)), target)


Looks like here we don't need to create a new instance?

cloud-fan · 2017-03-01T00:51:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

    case dt: DecimalType =>
-      b => changePrecision(b.asInstanceOf[Decimal].clone(), target)


I think this is the only case we need toPewcision

Nope, there is one more here:

case BooleanType => buildCast[Boolean](_, b => toPrecision(if (b) Decimal.ONE else Decimal.ZERO, target))

Both, ONE and ZERO are singletons so changing precision on themselves is not a good idea.

cloud-fan · 2017-03-01T00:51:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

+   *
+   * @return `Some(decimal)` if successful or `None` if overflow would occur
+   */
+  private[sql] def toPrecision(precision: Int, scale: Int,


nit: the style should be

def xxx( para1: xxx, para2: xxx): T = {

Fixed, thanks

cloud-fan · 2017-03-01T00:53:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

-    value.changePrecision(
-      DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR)
-    value
+    toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR)


shall we assume toPrecision will always return Some here?

Theoretically, it should be Some. On the other hand if something goes wrong when setting new precision in floor or ceil, I would raise an internal error:

def floor: Decimal = if (scale == 0) this else { val newPrecision = DecimalType.bounded(precision - scale + 1, 0).precision toPrecision(newPrecision, 0, ROUND_FLOOR).getOrElse( throw new AnalysisException(s"Overflow when setting precision to $newPrecision")) }

cloud-fan · 2017-03-01T00:54:45Z

sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala

@@ -233,6 +233,18 @@ class MathFunctionsSuite extends QueryTest with SharedSQLContext {
    )
  }

+  test("round/bround with data frame from a local Seq of Product") {
+    val df = spark.createDataFrame(Seq(NumericRow(BigDecimal("5.9"))))


we don't need to create NumericRow, try Seq(BigDecimal("5.9")).toDF("value")

Actually, the problem occurs only when creating data frame from Product. Unable to reproduce the issue with Seq(BigDecimal("5.9")).toDF("value")

this is weird, can you look into it? spark.createDataset(Seq(BigDecimal("5.9"))) should produce the same result.

Sure, I will try to investigate where is the difference

The fundamental difference is in the underlying row type assigned to dataframe/dataset. Dataframe is based on GenericInternalRow, while Dataset uses UnsafeRow. During evaluation of round expression, method getDecimal is called on a row, see BoundAttribute.scala#L52. As a result GenericInternalRow returns just an element of an array, which points to the reference of the original column, see rows.scala#L200. Strategy used in UnsafeRow is completely different, so new decimal instance is created, see UnsafeRow.java#L399.
I hope it helps to explain why only dataframe is affected.

tejasapatil · 2017-03-01T02:14:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

-      DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR)
-    value
+    toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR)
+      .getOrElse(clone())


This might end up creating two copies of the object in worst case :

once in toPrecision

second time on this line

Old logic would guarantee a single copy.

You're right, thanks. My suggestion is to raise an internal error if setting new precision in floor or ceil would fail.

tejasapatil · 2017-03-01T02:14:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

-      DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_CEILING)
-    value
+    toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_CEILING)
+      .getOrElse(clone())


same as above

See my comment above

tejasapatil · 2017-03-01T02:16:16Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala

@@ -193,7 +193,7 @@ class DecimalSuite extends SparkFunSuite with PrivateMethodTester {
    assert(Decimal(Long.MaxValue, 100, 0).toUnscaledLong === Long.MaxValue)
  }

-  test("changePrecision() on compact decimal should respect rounding mode") {
+  test("changePrecision/toPrecission on compact decimal should respect rounding mode") {


nit: typo in toPrecission

Thanks, fixed

cloud-fan · 2017-03-03T20:18:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

+   * @return `Some(decimal)` if successful or `None` if overflow would occur
+   */
+  private[sql] def toPrecision(
+                     precision: Int, scale: Int,


code style...

Fixed, thanks

cloud-fan · 2017-03-03T20:27:41Z

sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala

+    )
+    checkAnswer(
+      df.withColumn("value_rounded", bround('value)),
+      Seq(Row(BigDecimal("5.9"), BigDecimal("6")))


why test it twice?

bround function is also affected. Column value_rounded renamed to value_brounded

cloud-fan · 2017-03-06T00:06:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

+   *
+   * @return `Some(decimal)` if successful or `None` if overflow would occur
+   */
+  private[sql] def toPrecision(


style:

def xxx( para1: xxx, para2: xxx): XXX

cloud-fan · 2017-03-06T00:06:26Z

sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala

@@ -422,3 +434,4 @@ class MathFunctionsSuite extends QueryTest with SharedSQLContext {
    checkAnswer(df.selectExpr("positive(b)"), Row(-1))
  }
 }
+case class NumericRow(value : BigDecimal)


let's just use Tuple1 instead of creating this class

replaced with Tuple1

cloud-fan · 2017-03-06T07:05:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

+   * @return `Some(decimal)` if successful or `None` if overflow would occur
+   */
+  private[sql] def toPrecision(
+                   precision: Int,


4 spaces indention here. please take a look at other methods in spark and follow the code style.

Actually I did, but I saw so many different styles and I had no idea which one is correct. Thanks again for your patience

cloud-fan · 2017-03-06T08:10:39Z

LGTM, pending tests

wojtek-szymanski · 2017-03-07T22:03:19Z

@cloud-fan could you please give the green light to tests?

cloud-fan · 2017-03-07T22:06:58Z

ok to test

cloud-fan · 2017-03-07T22:07:15Z

sorry forgot the trigger the test...

SparkQA · 2017-03-07T23:52:14Z

Test build #74135 has finished for PR 17075 at commit fc0f2d1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-08T00:17:20Z

retest this please

SparkQA · 2017-03-08T02:16:32Z

Test build #74152 has finished for PR 17075 at commit fc0f2d1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-08T04:37:44Z

retest this please

SparkQA · 2017-03-08T06:15:56Z

Test build #74181 has finished for PR 17075 at commit fc0f2d1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-08T06:54:57Z

retest this please

SparkQA · 2017-03-08T06:57:33Z

Test build #74186 has started for PR 17075 at commit fc0f2d1.

cloud-fan · 2017-03-08T08:36:41Z

retest this please

SparkQA · 2017-03-08T10:59:50Z

Test build #74195 has finished for PR 17075 at commit fc0f2d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-08T20:36:37Z

thanks, merging to master!

…ginal column ## What changes were proposed in this pull request? This is a followup of #17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ginal column ## What changes were proposed in this pull request? This is a followup of apache#17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

Fix for SQL round function modifying original column

ea7c2d3

wojtek-szymanski added 2 commits March 1, 2017 01:23

Decimal.toPrecision introduced

f1d902d

Tests assertions for Decimal.toPrecision

d207d0e

cloud-fan reviewed Mar 1, 2017

View reviewed changes

tejasapatil reviewed Mar 1, 2017

View reviewed changes

code review fixes

51a0aea

cloud-fan reviewed Mar 3, 2017

View reviewed changes

code style and tests improvements

6557cba

cloud-fan reviewed Mar 6, 2017

View reviewed changes

code style, NumericRow replaced with Tuple1

c03a876

cloud-fan reviewed Mar 6, 2017

View reviewed changes

code style - 4 spaces indention in method args

fc0f2d1

asfgit closed this in e9e2c61 Mar 8, 2017

cloud-fan mentioned this pull request Oct 25, 2017

[SPARK-19727][SQL][followup] Fix for round function that modifies original column #19576

Closed

		case dt: DecimalType =>
		b => changePrecision(b.asInstanceOf[Decimal].clone(), target)

[SPARK-19727][SQL] Fix for round function that modifies original column #17075

[SPARK-19727][SQL] Fix for round function that modifies original column #17075

Conversation

wojtek-szymanski commented Feb 27, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Feb 27, 2017

cloud-fan commented Feb 27, 2017

wojtek-szymanski commented Feb 27, 2017

wojtek-szymanski commented Feb 28, 2017

cloud-fan commented Feb 28, 2017

wojtek-szymanski commented Mar 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Mar 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Mar 6, 2017

wojtek-szymanski commented Mar 7, 2017

cloud-fan commented Mar 7, 2017

cloud-fan commented Mar 7, 2017

SparkQA commented Mar 7, 2017

cloud-fan commented Mar 8, 2017

SparkQA commented Mar 8, 2017

cloud-fan commented Mar 8, 2017

SparkQA commented Mar 8, 2017

cloud-fan commented Mar 8, 2017

SparkQA commented Mar 8, 2017

cloud-fan commented Mar 8, 2017

SparkQA commented Mar 8, 2017

cloud-fan commented Mar 8, 2017

wojtek-szymanski commented Feb 27, 2017 •

edited

Loading

cloud-fan Mar 1, 2017 •

edited

Loading