Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [skip ci] Fuzz testing in Spark SQL #7625

Closed
wants to merge 87 commits into from

Conversation

JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented Jul 23, 2015

[skip ci]

This is a WIP pull request for some expression fuzz testing code that I'm working on as part of a
hackathon. I'm creating this pull request now in order to share the code and to have a pull request that I can reference from my other pull requests for fixing bugs that were found using this tester.

Features on my TODO list

  • Better logging to aid debuggability.
  • "Continuous" mode which dumps all results to files and keeps going when errors occur (designed to run overnight).
  • Validator which asserts that random queries return equivalent answers when run under different configuration modes (safe vs. unsafe vs safe w/o codegen, plus a few other permutations).
  • Plan transformer which takes valid logical query plans and transforms the into equivalent ones, then checks that both the original and transformed plans produce equivalent answers. This style of test is used in MySQL's testing tools.

List of potential bugs found during this testing

Note that most of these bugs are problems in analysis error reporting and not legitimate bugs in query execution. This tool isn't really capable of finding "wrong answer" bugs yet because it lacks an oracle for determining what the proper query answers are.

(:white_check_mark: indicates fixed, :construction: indicates a fix in progress)

Analysis issues:

  • The createDataFrame() methods should guard against null values being passed in (e.g. the user passes null instead of Row).

  • ✅ The analyzer should check that join conditions have BooleanType: [SPARK-9292] Analysis should check that join conditions' data types are BooleanType #7630.

  • ✅ The analyzer should ensure that set operations (union, intersect, and except) are only performed on tables that have the same number of columns: [SPARK-9293] [SPARK-9813] Analysis should check that set operations are only performed on tables with equal numbers of columns  #7631

  • ✅ Sorting based on array-typed columns should print an error at analysis time, not runtime. [SPARK-9295] Analysis should detect sorting on unsupported column types #7633

  • ✅ - DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns: https://issues.apache.org/jira/browse/SPARK-9323

  • The DATAFRAME_EAGER_ANALYSIS configuration flag does not work properly in all cases: there are still many corner-cases where invalid queries will eagerly throw analysis errors.

  • Type mismatches in joins are sometimes confusing. Let's say that we have two RDDs with columns that have the same name, but where one column is a struct and another is a boolean. If we try to join on a nested field then this can result in a confusing "Can't extract value" message instead of a more informative message that explains that the types are mismatched:

    val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
    val df2 = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": false}""" :: Nil))
    df.join(df2, "a.b")
    
    org.apache.spark.sql.AnalysisException: Can't extract value from a#26607;
    at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:63)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:264)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:263)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:263)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:127)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:137)
    at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
    at org.apache.spark.sql.DataFrame.join(DataFrame.scala:404)

Execution issues:

Expression issues:

  • UTF8String.repeat can throw NegativeArraySizeException when applied to random bytes which have been casted to a string. This is caused by extreme array sizes which overflow intmax.

  • UTF8String.reverse can throw ArrayIndexOutOfBoundsException when applied to random bytes which have been casted to a string.

  • ✅ The methods in the Unevaluable trait should be final and the some of the new aggregate functions should inherit from this trait ([SPARK-9286] [SQL] Methods in Unevaluable should be final and AlgebraicAggregate should extend Unevaluable. #7627).

  • For extremely small inputs, the results of the Remainder expression can differ in the codegen and non-codegen paths:

    (CAST(-2147483648, FloatType) % -1.8938038E-30) (types: List(FloatType, FloatType) [-4.0832423E-31] did not equal [-8.263847E-31]
    

    This is most likely a numeric stability issue.

  • Code generation frequently crashes for expressions containing null literals, but this isn't a problem that will impact users due to our codegen fallback path.

  • NaNvl should check that its two arguments are of the same floating point type: [SPARK-9549][SQL] fix bugs in expressions #7882

  • ✅ Code-generated numeric comparison expressions may fail to compile for Boolean types: [SPARK-9549][SQL] fix bugs in expressions #7882

Minor UX issues:
  • The ORC writer could log a more informative error message when the user isn't using a HiveSQLContext:

    java.lang.ClassNotFoundException: org.apache.spark.sql.hive.orc.DefaultSource
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ddl.scala:206)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ddl.scala:313)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
    
  • Confusing unresolved alias errors are thrown somewhat later in analysis than I'd expect. Ideally we would never see UnresolvedException: Invalid call to dataType on unresolved object since we would have ideally checked for resolution before inspecting the data types.

    • dropDuplicates seems especially prone to this problem.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38263 has finished for PR 7625 at commit 4c5dc9c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExpressionFuzzingSuite extends SparkFunSuite with Logging

@JoshRosen JoshRosen changed the title [WIP][CI-SKIP] Expression fuzz testing in Spark SQL [WIP] [skip ci] Expression fuzz testing in Spark SQL Jul 23, 2015
@JoshRosen JoshRosen changed the title [WIP] [skip ci] Expression fuzz testing in Spark SQL [WIP] [skip ci] Fuzz testing in Spark SQL Jul 23, 2015
@JoshRosen
Copy link
Contributor Author

I just pushed a commit which adds some randomized tests of the DataFrame API and it appears to have uncovered some runtime crashes for some simple queries. Going to investigate to try to find some deterministic minimal reproductions.

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38283 has finished for PR 7625 at commit 133b27a.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
    • abstract class AggregateFunction1 extends LeafExpression with Serializable

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38289 has finished for PR 7625 at commit dd16f4d.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
    • abstract class AggregateFunction1 extends LeafExpression with Serializable

@SparkQA
Copy link

SparkQA commented Jul 24, 2015

Test build #38302 has finished for PR 7625 at commit 37e4ce8.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class ChangePrecision(child: Expression) extends UnaryExpression
    • abstract class AlgebraicAggregate extends AggregateFunction2 with Serializable with Unevaluable
    • abstract class AggregateFunction1 extends LeafExpression with Serializable
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class DecimalType(precision: Int, scale: Int) extends FractionalType
    • case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion

@SparkQA
Copy link

SparkQA commented Aug 16, 2015

Test build #41000 has finished for PR 7625 at commit 0c7e9d0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
    • case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
    • case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

@JoshRosen JoshRosen closed this Sep 2, 2015
@SparkQA
Copy link

SparkQA commented Aug 16, 2016

Test build #63813 has finished for PR 7625 at commit 7664e37.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 16, 2016

Test build #63815 has finished for PR 7625 at commit d1d3d53.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen JoshRosen closed this Aug 18, 2016
@JoshRosen JoshRosen deleted the fuzz-test branch August 18, 2016 22:27
@joshrosen-stripe joshrosen-stripe restored the fuzz-test branch August 30, 2019 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants