[SPARK-8635][SQL] improve performance of CatalystTypeConverters #7018

cloud-fan · 2015-06-25T15:33:11Z

In CatalystTypeConverters.createToCatalystConverter, we add special handling for primitive types. We can apply this strategy to more places to improve performance.

cloud-fan · 2015-06-25T15:39:48Z

the benchmark (ScalaUdf will convert from catalyst to scala and back again):

case class Floor(child: Expression) extends UnaryExpression with Predicate {
  override def toString = s"Floor $child"

  override def eval(input: InternalRow): Any = {
    child.eval(input) match {
      case null => null
      case s: Seq[Int] => s.sum
    }
  }
}

object T {
  def benchmark(count: Int, expr: Expression): Unit = {
    var i = 0
    val row = new GenericRow(Array[Any]((1 to 10).toSeq))
    val s = System.currentTimeMillis()
    while (i < count) {
      expr.eval(row)
      i += 1
    }
    val e = System.currentTimeMillis()

    println (s"${expr.getClass.getSimpleName}  -- ${e - s} ms")
  }
  def main(args: Array[String]) {
    def func(s: Seq[Int]) = s.sum
    val attr = BoundReference(0, ArrayType(IntegerType), true)
    val udf0 = ScalaUdf(func _, IntegerType, attr :: Nil)
    val udf1 = Floor(attr)

    benchmark(1000000, udf0)
    benchmark(1000000, udf0)
    benchmark(1000000, udf0)

    benchmark(1000000, udf1)
    benchmark(1000000, udf1)
    benchmark(1000000, udf1)
  }
}

before:
ScalaUdf -- 321 ms
ScalaUdf -- 313 ms
ScalaUdf -- 232 ms
Floor -- 40 ms
Floor -- 7 ms
Floor -- 7 ms

after:
ScalaUdf -- 73 ms
ScalaUdf -- 26 ms
ScalaUdf -- 23 ms
Floor -- 34 ms
Floor -- 7 ms
Floor -- 7 ms

cloud-fan · 2015-06-25T15:41:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala

@@ -258,16 +273,13 @@ object CatalystTypeConverters {
      toScala(row(column).asInstanceOf[InternalRow])
  }

-  private object StringConverter extends CatalystTypeConverter[Any, String, Any] {
+  private object StringConverter extends CatalystTypeConverter[Any, String, UTF8String] {


The internal type of StringType should always be UTF8String.

cloud-fan · 2015-06-25T16:08:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala

@@ -90,7 +90,7 @@ private[sql] object FrequentItems extends Logging {
      (name, originalSchema.fields(index).dataType)
    }

-    val freqItems = df.select(cols.map(Column(_)) : _*).rdd.aggregate(countMaps)(
+    val freqItems = df.select(cols.map(Column(_)) : _*).internalRowRdd.aggregate(countMaps)(


When we calculate singlePassFreqItems, we don't need to convert catalyst types to scala types before calculation. DataFrame.rdd is RDD[Row] and we need RDD[InternalRow] here.

cloud-fan · 2015-06-25T16:09:17Z

cc @rxin @marmbrus

SparkQA · 2015-06-25T17:37:03Z

Test build #35787 has finished for PR 7018 at commit 326c82c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-25T18:08:54Z

Test build #35789 has finished for PR 7018 at commit 8b16630.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-06-26T05:40:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala

+   * Typical use case would be converting a collection of rows that have the same schema. You will
+   * call this function once to get a converter, and apply it to every row.
+   */
+  private[sql] def createToScalaConverter(dataType: DataType): Any => Any = {


Could you update the StructConverter to use createToScalaConverter?

never mind, the StructConverter also need toCatalystConverter.

davies · 2015-06-26T05:42:04Z

LGTM, only one minor comment.

davies · 2015-06-26T05:47:11Z

merged this into master, thanks!

optimize type converter

326c82c

cloud-fan reviewed Jun 25, 2015
View reviewed changes

another fix

8b16630

cloud-fan reviewed Jun 25, 2015
View reviewed changes

davies reviewed Jun 26, 2015
View reviewed changes

asfgit closed this in 1a79f0e Jun 26, 2015

cloud-fan deleted the converter branch June 26, 2015 05:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8635][SQL] improve performance of CatalystTypeConverters #7018

[SPARK-8635][SQL] improve performance of CatalystTypeConverters #7018

cloud-fan commented Jun 25, 2015

cloud-fan commented Jun 25, 2015

cloud-fan Jun 25, 2015

cloud-fan Jun 25, 2015

cloud-fan commented Jun 25, 2015

SparkQA commented Jun 25, 2015

SparkQA commented Jun 25, 2015

davies Jun 26, 2015

davies Jun 26, 2015

davies commented Jun 26, 2015

davies commented Jun 26, 2015

[SPARK-8635][SQL] improve performance of CatalystTypeConverters #7018

[SPARK-8635][SQL] improve performance of CatalystTypeConverters #7018

Conversation

cloud-fan commented Jun 25, 2015

cloud-fan commented Jun 25, 2015

cloud-fan Jun 25, 2015

Choose a reason for hiding this comment

cloud-fan Jun 25, 2015

Choose a reason for hiding this comment

cloud-fan commented Jun 25, 2015

SparkQA commented Jun 25, 2015

SparkQA commented Jun 25, 2015

davies Jun 26, 2015

Choose a reason for hiding this comment

davies Jun 26, 2015

Choose a reason for hiding this comment

davies commented Jun 26, 2015

davies commented Jun 26, 2015