[SPARK-19691][SQL] Fix ClassCastException when calculating percentile of decimal column #17028

maropu · 2017-02-22T14:09:52Z

What changes were proposed in this pull request?

This pr fixed a class-cast exception below;

scala> spark.range(10).selectExpr("cast (id as decimal) as x").selectExpr("percentile(x, 0.5)").collect()
 java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Number
	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
	at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
	at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
	at

This fix simply converts catalyst values (i.e., Decimal) into scala ones by using CatalystTypeConverters.

How was this patch tested?

Added a test in DataFrameSuite.

hvanhovell · 2017-02-22T15:32:16Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

@@ -138,7 +138,8 @@ case class Percentile(
  override def update(
      buffer: OpenHashMap[Number, Long],
      input: InternalRow): OpenHashMap[Number, Long] = {
-    val key = child.eval(input).asInstanceOf[Number]
+    val scalaValue = CatalystTypeConverters.convertToScala(child.eval(input), child.dataType)


I think it is better to open up the signature of the OpenHashMap and use Ordered or AnyRef as its key type.

okay, I'll fix that way. Thanks!

Should we create a converter and re-use it with createToScalaConverter(...) rather than type-dispatching every time maybe?

I'm not 100% sure though, it seems the cost of converting Decimal to BigDecimal every-time is some higher than that of using catalyst values as it is.

SparkQA · 2017-02-22T16:10:33Z

Test build #73280 has finished for PR 17028 at commit b216fa1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-02-22T18:36:42Z

Just a sec, I'll apply the @hvanhovell suggestion...

SparkQA · 2017-02-22T20:32:26Z

Test build #73290 has finished for PR 17028 at commit 325c95d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-23T06:40:54Z

Test build #73320 has finished for PR 17028 at commit ef26f26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-02-23T07:21:04Z

@HyukjinKwon @hvanhovell How about the latest fix?

hvanhovell · 2017-02-23T09:38:32Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

    val frqValue = frequencyExpression.eval(input)

    // Null values are ignored in counts map.
    if (key != null && frqValue != null) {
-      val frqLong = frqValue.asInstanceOf[Number].longValue()
+      val frqLong = toLongValue(frqValue)


frqValue is guaranteed to return a integral value. So this is not needed. We could also force it to be a Long, that would make this even simpler.

I'll revert this part.

hvanhovell · 2017-02-23T09:40:47Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

@@ -274,7 +283,8 @@ case class Percentile(
        val row = new UnsafeRow(2)
        row.pointTo(bs, sizeOfNextRow)
        // Insert the pairs into counts map.
-        val key = row.get(0, child.dataType).asInstanceOf[Number]
+        val catalystValue = row.get(0, child.dataType)


NIT: Just change the cast in the old code.

oh..., I'll fix

hvanhovell · 2017-02-23T09:42:36Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala

    assert(compareEquals(agg.deserialize(agg.serialize(buffer)), buffer))

    // Check non-empty buffer serializa and deserialize.
    data.foreach { key =>
-      buffer.changeValue(key, 1L, _ + 1L)
+      buffer.changeValue(new Integer(key), 1L, _ + 1L)


To we need to explicitly type this? I thoughtscala boxed automatically.

If this boxing does not exist, it throws an exception below;

[error] /Users/maropu/IdeaProjects/spark/spark-master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Perc entileSuite.scala:46: the result type of an implicit conversion must be more specific than AnyRef

hvanhovell · 2017-02-23T09:42:49Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala

    }
    assert(compareEquals(agg.deserialize(agg.serialize(buffer)), buffer))
  }

  test("class Percentile, high level interface, update, merge, eval...") {
    val count = 10000
    val percentages = Seq(0, 0.25, 0.5, 0.75, 1)
-    val expectedPercentiles = Seq(1, 2500.75, 5000.5, 7500.25, 10000)
+    val expectedPercentiles = Seq[Double](1, 2500.75, 5000.5, 7500.25, 10000)


Do we need to type this?

Since my Intellij makes an alert on this part, I added this. But, you're right and this is not necessary. I reverted this.

hvanhovell · 2017-02-23T09:43:37Z

@maropu This looks pretty good. I left a few minor comments/questions.

maropu · 2017-02-23T09:47:26Z

Thanks for your review! I'm fixing now.

maropu · 2017-02-23T10:11:10Z

Done. I wait for tests finished.

SparkQA · 2017-02-23T12:12:17Z

Test build #73342 has finished for PR 17028 at commit 88f4f47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-02-23T15:29:35Z

LGTM - merging to master.

maropu · 2017-02-23T15:35:22Z

Thanks!

hvanhovell · 2017-02-23T17:12:41Z

@maropu can you open a backport if you feel we should also put this in 2.1?

maropu · 2017-02-23T23:53:03Z

@hvanhovell okay, I'll open soon.

… of decimal column ## What changes were proposed in this pull request? This pr fixed a class-cast exception below; ``` scala> spark.range(10).selectExpr("cast (id as decimal) as x").selectExpr("percentile(x, 0.5)").collect() java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Number at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141) at org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58) at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at ``` This fix simply converts catalyst values (i.e., `Decimal`) into scala ones by using `CatalystTypeConverters`. ## How was this patch tested? Added a test in `DataFrameSuite`. Author: Takeshi Yamamuro <yamamuro@apache.org> Closes apache#17028 from maropu/SPARK-19691.

Fix ClassCastException

b216fa1

hvanhovell reviewed Feb 22, 2017

View reviewed changes

Reuse converter funcs

325c95d

Replace Number with AnyRef

ef26f26

hvanhovell reviewed Feb 23, 2017

View reviewed changes

Apply review comments

88f4f47

asfgit closed this in 93aa427 Feb 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19691][SQL] Fix ClassCastException when calculating percentile of decimal column #17028

[SPARK-19691][SQL] Fix ClassCastException when calculating percentile of decimal column #17028

maropu commented Feb 22, 2017

hvanhovell Feb 22, 2017

maropu Feb 22, 2017

HyukjinKwon Feb 22, 2017 •

edited

Loading

maropu Feb 22, 2017

SparkQA commented Feb 22, 2017

maropu commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 23, 2017

maropu commented Feb 23, 2017

hvanhovell Feb 23, 2017

maropu Feb 23, 2017

hvanhovell Feb 23, 2017

maropu Feb 23, 2017

hvanhovell Feb 23, 2017

maropu Feb 23, 2017

hvanhovell Feb 23, 2017

maropu Feb 23, 2017

hvanhovell commented Feb 23, 2017

maropu commented Feb 23, 2017 •

edited

Loading

maropu commented Feb 23, 2017

SparkQA commented Feb 23, 2017

hvanhovell commented Feb 23, 2017

maropu commented Feb 23, 2017

hvanhovell commented Feb 23, 2017

maropu commented Feb 23, 2017 •

edited

Loading

[SPARK-19691][SQL] Fix ClassCastException when calculating percentile of decimal column #17028

[SPARK-19691][SQL] Fix ClassCastException when calculating percentile of decimal column #17028

Conversation

maropu commented Feb 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 22, 2017

maropu commented Feb 22, 2017

SparkQA commented Feb 22, 2017

SparkQA commented Feb 23, 2017

maropu commented Feb 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell commented Feb 23, 2017

maropu commented Feb 23, 2017 • edited Loading

maropu commented Feb 23, 2017

SparkQA commented Feb 23, 2017

hvanhovell commented Feb 23, 2017

maropu commented Feb 23, 2017

hvanhovell commented Feb 23, 2017

maropu commented Feb 23, 2017 • edited Loading

HyukjinKwon Feb 22, 2017 •

edited

Loading

maropu commented Feb 23, 2017 •

edited

Loading

maropu commented Feb 23, 2017 •

edited

Loading