-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19727][SQL] Fix for round function that modifies original column #17075
Conversation
I dont' know the code well enough to really evaluate this, but, I see that CC maybe @cloud-fan or @yjshen ? |
I think we should fix |
Good idea @cloud-fan. I will look for usages of |
I have just started refactoring of Here are my first thoughts:
I would be grateful if you could confirm if it's the right way to go. |
how about we add a new method |
It seems it makes more sense now, please have a look. |
|
||
private[this] def castToDecimal(from: DataType, target: DecimalType): Any => Any = from match { | ||
case StringType => | ||
buildCast[UTF8String](_, s => try { | ||
changePrecision(Decimal(new JavaBigDecimal(s.toString)), target) | ||
toPrecision(Decimal(new JavaBigDecimal(s.toString)), target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like here we don't need to create a new instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
case dt: DecimalType => | ||
b => changePrecision(b.asInstanceOf[Decimal].clone(), target) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the only case we need toPewcision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, there is one more here:
case BooleanType =>
buildCast[Boolean](_, b => toPrecision(if (b) Decimal.ONE else Decimal.ZERO, target))
Both, ONE
and ZERO
are singletons so changing precision on themselves is not a good idea.
* | ||
* @return `Some(decimal)` if successful or `None` if overflow would occur | ||
*/ | ||
private[sql] def toPrecision(precision: Int, scale: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the style should be
def xxx(
para1: xxx,
para2: xxx): T = {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks
value.changePrecision( | ||
DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR) | ||
value | ||
toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we assume toPrecision
will always return Some here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, it should be Some
. On the other hand if something goes wrong when setting new precision in floor
or ceil
, I would raise an internal error:
def floor: Decimal = if (scale == 0) this else {
val newPrecision = DecimalType.bounded(precision - scale + 1, 0).precision
toPrecision(newPrecision, 0, ROUND_FLOOR).getOrElse(
throw new AnalysisException(s"Overflow when setting precision to $newPrecision"))
}
@@ -233,6 +233,18 @@ class MathFunctionsSuite extends QueryTest with SharedSQLContext { | |||
) | |||
} | |||
|
|||
test("round/bround with data frame from a local Seq of Product") { | |||
val df = spark.createDataFrame(Seq(NumericRow(BigDecimal("5.9")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to create NumericRow
, try Seq(BigDecimal("5.9")).toDF("value")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the problem occurs only when creating data frame from Product
. Unable to reproduce the issue with Seq(BigDecimal("5.9")).toDF("value")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is weird, can you look into it? spark.createDataset(Seq(BigDecimal("5.9")))
should produce the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will try to investigate where is the difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental difference is in the underlying row type assigned to dataframe/dataset. Dataframe is based on GenericInternalRow
, while Dataset uses UnsafeRow
. During evaluation of round expression, method getDecimal
is called on a row, see BoundAttribute.scala#L52. As a result GenericInternalRow
returns just an element of an array, which points to the reference of the original column, see rows.scala#L200. Strategy used in UnsafeRow
is completely different, so new decimal instance is created, see UnsafeRow.java#L399.
I hope it helps to explain why only dataframe is affected.
DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR) | ||
value | ||
toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_FLOOR) | ||
.getOrElse(clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might end up creating two copies of the object in worst case :
- once in
toPrecision
- second time on this line
Old logic would guarantee a single copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thanks. My suggestion is to raise an internal error if setting new precision in floor
or ceil
would fail.
DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_CEILING) | ||
value | ||
toPrecision(DecimalType.bounded(precision - scale + 1, 0).precision, 0, ROUND_CEILING) | ||
.getOrElse(clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above
@@ -193,7 +193,7 @@ class DecimalSuite extends SparkFunSuite with PrivateMethodTester { | |||
assert(Decimal(Long.MaxValue, 100, 0).toUnscaledLong === Long.MaxValue) | |||
} | |||
|
|||
test("changePrecision() on compact decimal should respect rounding mode") { | |||
test("changePrecision/toPrecission on compact decimal should respect rounding mode") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo in toPrecission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed
* @return `Some(decimal)` if successful or `None` if overflow would occur | ||
*/ | ||
private[sql] def toPrecision( | ||
precision: Int, scale: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code style...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks
) | ||
checkAnswer( | ||
df.withColumn("value_rounded", bround('value)), | ||
Seq(Row(BigDecimal("5.9"), BigDecimal("6"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why test it twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bround function is also affected. Column value_rounded
renamed to value_brounded
* | ||
* @return `Some(decimal)` if successful or `None` if overflow would occur | ||
*/ | ||
private[sql] def toPrecision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:
def xxx(
para1: xxx,
para2: xxx): XXX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -422,3 +434,4 @@ class MathFunctionsSuite extends QueryTest with SharedSQLContext { | |||
checkAnswer(df.selectExpr("positive(b)"), Row(-1)) | |||
} | |||
} | |||
case class NumericRow(value : BigDecimal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just use Tuple1
instead of creating this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with Tuple1
* @return `Some(decimal)` if successful or `None` if overflow would occur | ||
*/ | ||
private[sql] def toPrecision( | ||
precision: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 spaces indention here. please take a look at other methods in spark and follow the code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I did, but I saw so many different styles and I had no idea which one is correct. Thanks again for your patience
LGTM, pending tests |
@cloud-fan could you please give the green light to tests? |
ok to test |
sorry forgot the trigger the test... |
Test build #74135 has finished for PR 17075 at commit
|
retest this please |
Test build #74152 has finished for PR 17075 at commit
|
retest this please |
Test build #74181 has finished for PR 17075 at commit
|
retest this please |
Test build #74186 has started for PR 17075 at commit |
retest this please |
Test build #74195 has finished for PR 17075 at commit
|
thanks, merging to master! |
…ginal column ## What changes were proposed in this pull request? This is a followup of #17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…ginal column ## What changes were proposed in this pull request? This is a followup of apache#17075 , to fix the bug in codegen path. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19576 from cloud-fan/bug. (cherry picked from commit 7fdacbc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Fix for SQL round function that modifies original column when underlying data frame is created from a local product.
How was this patch tested?
New unit test added to existing suite
org.apache.spark.sql.MathFunctionsSuite