[SPARK-9526][SQL]Utilize randomized tests to reveal potential bugs in sql expressions #7855

yjshen · 2015-08-01T16:11:42Z

JIRA: https://issues.apache.org/jira/browse/SPARK-9526

This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression.

yjshen · 2015-08-01T16:27:30Z

Opening this early to get high level feed back ASAP.

Note: The current merge build should fail due to ~~three~~ two bugs:

UnaryMinus's codegen version would fail to compile when the input is Long.MinValue
Remainder would fail due to codegen and interpret mode returning different result for same input.
~~MaxOf/MinOf would fail due to ClassCastException: BinaryType's ordering need Array[Byte] as input but GenericArrayData is given.~~ Not a problem

These bugs are not fixed yet since I just finished prototyping.

yjshen · 2015-08-01T16:39:42Z

cc @rxin @davies

JoshRosen · 2015-08-01T16:53:57Z

For remainder, my hunch is that it's probably failing for extreme floating point values (e.g. take the remainder of a giant float by another giant float). I found a similar failure in #7625, an experimental branch of mine which contains some code for using reflection to write tests against all Expression subclasses.

The code in my branch lags a bit behind what I have locally (e.g. it may be missing some of the interpreted vs. codegen comparison code) so I can see about pushing the rest of my changes later. The approach in my branch probably definitely isn't the right one for unit testing; it was more intended to be an experiment to see whether it would be possible to do this all via reflection.

SparkQA · 2015-08-01T18:15:19Z

Test build #39363 has finished for PR 7855 at commit daffd80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StopWordsRemover(override val uid: String)

yjshen · 2015-08-02T02:52:11Z

@JoshRosen , thanks for the information about #7625, it's great!
I'll read that in detail and see how I can refine my implementation accordingly.

SparkQA · 2015-08-02T08:30:00Z

Test build #39409 has finished for PR 7855 at commit e3bbe4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RequestExecutors(appId: String, requestedTotal: Int)
- case class KillExecutors(appId: String, executorIds: Seq[String])
- class SpecificSafeProjection extends $
- case class FromUTCTimestamp(left: Expression, right: Expression)
- case class ToUTCTimestamp(left: Expression, right: Expression)
- case class DateDiff(endDate: Expression, startDate: Expression)
- case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

SparkQA · 2015-08-02T10:05:56Z

Test build #39417 has finished for PR 7855 at commit 42769b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class RequestExecutors(appId: String, requestedTotal: Int)
- case class KillExecutors(appId: String, executorIds: Seq[String])
- class SpecificSafeProjection extends $
- case class FromUTCTimestamp(left: Expression, right: Expression)
- case class ToUTCTimestamp(left: Expression, right: Expression)
- case class DateDiff(endDate: Expression, startDate: Expression)
- case class InitCap(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

JoshRosen · 2015-08-02T19:37:13Z

Did this end up finding any new bugs?

yjshen · 2015-08-03T00:18:46Z

All bugs revealed until now:

UnaryMinus's codegen version would fail to compile when the input is Long.MinValue
Remainder would fail due to codegen and interpret mode returning different result for same input. (yes, for remainding between giant values)
BinaryComparison would fail to compile in codegen mode when comparing Boolean types.
AddMonth would fail if passed a huge negative month, which would lead accessing negative index of monthDays array.

And I also fixed Nanvl by upcasting its operand if the are of different type.

rxin · 2015-08-03T01:10:57Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeTestUtils.scala

+  val numericTypeWithoutDecimal: Set[DataType] = integralType ++ Set(DoubleType, FloatType)
+
+  /**
+   * Instances of all [[NumericType]]s and CalendarIntervalType


put [[ ]] around CalendarIntervalType so IntelliJ can find it during refactoring

rxin · 2015-08-03T01:53:39Z

@yjshen

to help reviewing, and separate important fixes from nice to have tests, can you submit a separate pull request that includes all the bug fixes, along with deterministic unit tests that would trigger those cases?

Then this pull request can be just about the randomized tests.

JoshRosen · 2015-08-03T15:05:11Z

Bugfixes were done in #7882, so this should be ready for rebasing.

yjshen · 2015-08-04T04:00:00Z

Ah, forgot the scaladoc on property check, will do now.

JoshRosen · 2015-08-04T04:17:22Z

This is on my review queue for tomorrow.

SparkQA · 2015-08-04T05:52:30Z

Test build #39650 has finished for PR 7855 at commit b2c6543.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public static final class SortedIterator extends UnsafeSorterIterator
- public class KVSorterIterator extends KVIterator<UnsafeRow, UnsafeRow>

SparkQA · 2015-08-04T07:08:21Z

Test build #39676 has finished for PR 7855 at commit 5301891.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yjshen · 2015-08-04T07:18:09Z

Jenkins, retest this please.

SparkQA · 2015-08-04T08:12:02Z

Test build #199 has finished for PR 7855 at commit 5301891.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yjshen · 2015-08-04T08:13:19Z

Unrelated failure again and again.
org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.(It is not a test)
org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.(It is not a test)

yjshen · 2015-08-04T09:13:40Z

Jenkins, retest this please.

SparkQA · 2015-08-04T10:05:59Z

Test build #39696 has finished for PR 7855 at commit 5301891.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-04T10:06:07Z

Test build #203 has finished for PR 7855 at commit 5301891.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yjshen · 2015-08-04T10:23:52Z

Jenkins, retest this please.

SparkQA · 2015-08-04T12:22:44Z

Test build #204 has finished for PR 7855 at commit 5301891.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-04T12:57:32Z

Test build #39702 has finished for PR 7855 at commit 5301891.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-14T20:56:07Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

@@ -211,4 +215,80 @@ trait ExpressionEvalHelper {
      plan(inputRow)).get(0, expression.dataType)
    assert(checkResult(actual, expected))
  }
+
+  def checkConsistency(dt: DataType, clazz: Class[_]): Unit = {


What do you think about giving this a more specific name, such as checkConsistencyBetweenInterpretedAndCodegen? It would also be good to add Scaladoc to these methods to explain what they're doing, since the use of reflection might be non-obvious.

For instance, this method's Scaladoc could explain that it tests the expression's one-argument constructor with randomized literals of the given data type.

Also, I think that we might be able to clean up the code slightly by adding a type to this method:

def checkConsistency[E <: Expression: ClassTag](dt: DataType)

to let callers write something like

checkConsistencyBetweenInterpretedAndCodegen[Sinh](DoubleType)

Actually, do we even need reflection for this? Can we do something like https://github.com/yjshen/spark/blob/property_check/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/MathFunctionsSuite.scala#L59 instead?

JoshRosen · 2015-08-14T21:15:05Z

The basic approach here seems reasonable to me but I left a couple of comments regarding whether we need to use reflection and RE: some documentation / naming issues.

SparkQA · 2015-08-15T08:06:34Z

Test build #40944 has finished for PR 7855 at commit 0a5bdc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FilterNode(condition: Expression, child: LocalNode) extends UnaryLocalNode
- abstract class LocalNode extends TreeNode[LocalNode]
- abstract class LeafLocalNode extends LocalNode
- abstract class UnaryLocalNode extends LocalNode
- case class ProjectNode(projectList: Seq[NamedExpression], child: LocalNode) extends UnaryLocalNode
- case class SeqScanNode(output: Seq[Attribute], data: Seq[InternalRow]) extends LeafLocalNode

yjshen · 2015-08-15T08:17:28Z

@JoshRosen , I've changed my implementation, do you mind review this again?

JoshRosen · 2015-08-16T21:07:40Z

LGTM pending Jenkins; thanks!

SparkQA · 2015-08-16T22:47:05Z

Test build #1627 has finished for PR 7855 at commit 0a5bdc9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class FilterNode(condition: Expression, child: LocalNode) extends UnaryLocalNode
- abstract class LocalNode extends TreeNode[LocalNode]
- abstract class LeafLocalNode extends LocalNode
- abstract class UnaryLocalNode extends LocalNode
- case class ProjectNode(projectList: Seq[NamedExpression], child: LocalNode) extends UnaryLocalNode
- case class SeqScanNode(output: Seq[Attribute], data: Seq[InternalRow]) extends LeafLocalNode

yjshen · 2015-08-17T00:20:08Z

jenkins, retest this please.

yjshen · 2015-08-17T00:27:16Z

unrelated failure, org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-8368: includes jars passed in through --jars

yjshen · 2015-08-17T01:29:41Z

jenkins, retest this please.

SparkQA · 2015-08-17T03:58:46Z

Test build #41004 has finished for PR 7855 at commit 0a5bdc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks

rxin · 2015-08-17T04:25:35Z

@JoshRosen I will let you merge this one.

JoshRosen · 2015-08-17T17:13:15Z

Will merge provided that this still compiles.

JoshRosen · 2015-08-17T18:39:20Z

Jenkins, retest this please.

SparkQA · 2015-08-17T21:00:45Z

Test build #41043 has finished for PR 7855 at commit 0a5bdc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks

JoshRosen · 2015-08-17T21:09:22Z

Alright, merging this to master and branch-1.5. Thanks!

…in sql expressions JIRA: https://issues.apache.org/jira/browse/SPARK-9526 This PR is a follow up of #7830, aiming at utilizing randomized tests to reveal more potential bugs in sql expression. Author: Yijie Shen <henry.yijieshen@gmail.com> Closes #7855 from yjshen/property_check. (cherry picked from commit b265e28) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

yjshen force-pushed the property_check branch from daffd80 to e3bbe4c Compare August 2, 2015 06:46

yjshen changed the title ~~[SPARK-9526][SQL][WIP] Utilize randomized tests to reveal potential bugs in sql expressions~~ [SPARK-9526][SQL] Utilize randomized tests to reveal potential bugs in sql expressions Aug 2, 2015

rxin reviewed Aug 3, 2015
View reviewed changes

yjshen force-pushed the property_check branch from 42769b0 to b2c6543 Compare August 4, 2015 03:57

JoshRosen reviewed Aug 14, 2015
View reviewed changes

yjshen added 7 commits August 15, 2015 12:34

[WIP] Utilize ScalaCheck to reveal potential bugs in sql expressions

0d3bb3c

property check more expressions

e05bbd0

Finish first pass of property check

2100600

address comments

4e36204

rename & add javadoc

645df77

typo fix

963af5f

address comments

0a5bdc9

yjshen force-pushed the property_check branch from 2149f31 to 0a5bdc9 Compare August 15, 2015 05:44

asfgit closed this in b265e28 Aug 17, 2015

[SPARK-9526][SQL]Utilize randomized tests to reveal potential bugs in sql expressions #7855

[SPARK-9526][SQL]Utilize randomized tests to reveal potential bugs in sql expressions #7855

Conversation

yjshen commented Aug 1, 2015

yjshen commented Aug 1, 2015

yjshen commented Aug 1, 2015

JoshRosen commented Aug 1, 2015

SparkQA commented Aug 1, 2015

yjshen commented Aug 2, 2015

SparkQA commented Aug 2, 2015

SparkQA commented Aug 2, 2015

JoshRosen commented Aug 2, 2015

yjshen commented Aug 3, 2015

rxin Aug 3, 2015

Choose a reason for hiding this comment

rxin commented Aug 3, 2015

JoshRosen commented Aug 3, 2015

yjshen commented Aug 4, 2015

JoshRosen commented Aug 4, 2015

SparkQA commented Aug 4, 2015

SparkQA commented Aug 4, 2015

yjshen commented Aug 4, 2015

SparkQA commented Aug 4, 2015

yjshen commented Aug 4, 2015

yjshen commented Aug 4, 2015

SparkQA commented Aug 4, 2015

SparkQA commented Aug 4, 2015

yjshen commented Aug 4, 2015

SparkQA commented Aug 4, 2015

SparkQA commented Aug 4, 2015

JoshRosen Aug 14, 2015

Choose a reason for hiding this comment

JoshRosen Aug 14, 2015

Choose a reason for hiding this comment

JoshRosen Aug 14, 2015

Choose a reason for hiding this comment

JoshRosen Aug 14, 2015

Choose a reason for hiding this comment

JoshRosen commented Aug 14, 2015

SparkQA commented Aug 15, 2015

yjshen commented Aug 15, 2015

JoshRosen commented Aug 16, 2015

SparkQA commented Aug 16, 2015

yjshen commented Aug 17, 2015

yjshen commented Aug 17, 2015

yjshen commented Aug 17, 2015

SparkQA commented Aug 17, 2015

rxin commented Aug 17, 2015

JoshRosen commented Aug 17, 2015

JoshRosen commented Aug 17, 2015

SparkQA commented Aug 17, 2015

JoshRosen commented Aug 17, 2015