[SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests #14472

petermaxlee · 2016-08-03T02:36:37Z

What changes were proposed in this pull request?

This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648 -214748364[8]", but got "...147483648   -214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)

How was this patch tested?

This is a test infrastructure change.

… tests

petermaxlee · 2016-08-03T02:37:29Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+  }
+
+  private def listTestCases(): Seq[TestCase] = {
+    listFilesRecursively(new File(inputFilePath)).map { file =>


this might not work for Maven - I will look into this later.

now it should work for maven

petermaxlee · 2016-08-03T02:37:41Z

cc @cloud-fan and @rxin for feedback.

SparkQA · 2016-08-03T02:39:00Z

Test build #63148 has finished for PR 14472 at commit ba9b678.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SQLQueryTestSuite extends QueryTest with SharedSQLContext

petermaxlee · 2016-08-03T02:39:56Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+      QueryOutput(
+        sql = sql,
+        schema = df.schema.map(_.dataType.simpleString).mkString(", "),
+        output = df.showString(_numRows = 10000, truncate = 10000).trim)


using showString might not be the most friendly when there is a mismatch and the output is huge, but should work very well with smaller outputs.

org.apache.spark.sql.catalyst.util.sideBySide can show which line is mismatched. Can we borrow this idea?

rxin · 2016-08-03T03:27:23Z

Seems like a good idea.

@cloud-fan can you review?

SparkQA · 2016-08-03T04:43:08Z

Test build #63149 has finished for PR 14472 at commit 9b360da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-03T04:55:07Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+
+  /** A single SQL query's output. */
+  private case class QueryOutput(sql: String, schema: String, output: String) {
+    def toXML: String = {


is it a convention to use XML as golden file? It's really hard to read...

petermaxlee · 2016-08-03T05:06:26Z

For output file -- one other way I thought about was to use

-- query 1
query
-- query 1: schema
schema
-- query 1: result
result
-- query 2
query
-- query2: schema
schema
-- query 2: result
result

We can do string parsing there but it'd be less rigorous than XML.

cloud-fan · 2016-08-03T05:33:02Z

Is there any case that people need to write a golden file themselves? If we wanna use some standard formats, I prefer json over xml.

petermaxlee · 2016-08-03T05:35:52Z

I don't think these golden files should be manually created. They should always be generated. I tried JSON earlier and it was not very friendly either (worse than XML in this case).

Do you like the format I showed above?

cloud-fan · 2016-08-03T05:46:07Z

yea, it looks much better. Although golden files are always generated, we should make it easy to read and verify its correctness.

petermaxlee · 2016-08-04T01:13:12Z

I have updated this to use a custom format that is more readable.

SparkQA · 2016-08-04T02:39:06Z

Test build #63197 has finished for PR 14472 at commit a1e1b57.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-04T05:12:46Z

Test build #3201 has finished for PR 14472 at commit a1e1b57.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-04T07:22:06Z

dev/.rat-excludes

@@ -99,4 +99,5 @@ spark-deps-.*
 .*tsv
 org.apache.spark.scheduler.ExternalClusterManager
 .*\.sql
+.*\.sql\.xml


should we add a rule for .out?

SparkQA · 2016-08-06T01:57:36Z

Test build #63297 has finished for PR 14472 at commit 2352d6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-08T02:32:26Z

dev/.rat-excludes

@@ -99,4 +99,5 @@ spark-deps-.*
 .*tsv
 org.apache.spark.scheduler.ExternalClusterManager
 .*\.sql
+.*\.sql\.out


this is not needed, we already have a rule: .*out

petermaxlee · 2016-08-09T21:37:45Z

I have updated this based on review feedback. The harness is now using hiveResultString() rather than show(), similar to the existing Hive compatibility test.

SparkQA · 2016-08-09T23:23:17Z

Test build #63471 has finished for PR 14472 at commit 0359756.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T00:39:10Z

Test build #63476 has finished for PR 14472 at commit 5eb01fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-10T05:57:19Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+      // val cleaned = input.split("\n").filterNot(_.matches("--.*(?<=[^\\\\]);")).mkString("\n")
+      val cleaned = input.split("\n").filterNot(_.startsWith("--")).mkString("\n")
+      // note: this is not a robust way to split queries using semicolon, but works for now.
+      cleaned.split("(?<=[^\\\\]);").map(_.trim).filterNot(q => q == "").toSeq


.filter(_.nonEmpty)

Isn't it strictly less clear? It is more obvious that this is a string this way.

I changed it to filter(_ != "")

cloud-fan · 2016-08-10T06:13:02Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+
+    // List of SQL queries to run
+    val queries: Seq[String] = {
+      // val cleaned = input.split("\n").filterNot(_.matches("--.*(?<=[^\\\\]);")).mkString("\n")


remove this line?

cloud-fan · 2016-08-10T06:20:24Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+    val queries: Seq[String] = {
+      val cleaned = input.split("\n").filterNot(_.startsWith("--")).mkString("\n")
+      // note: this is not a robust way to split queries using semicolon, but works for now.
+      cleaned.split("(?<=[^\\\\]);").map(_.trim).filter(_ != "").toSeq


can you explain a bit more about this regex? Why can't we just use ; here?

I copied this from HiveComparisonTest. I think it is done to avoid escaping e.g. ";". It is really not very robust, but seems fine for now.

cloud-fan · 2016-08-10T06:21:31Z

mostly LGTM, can you try some error cases and put the error message in PR description? Then we can have a better understanding about how this framework report errors. Thanks!

petermaxlee · 2016-08-10T06:26:41Z

Updated the description. I think it might make sense to switch over to the same way HiveCompatibilitySuite reports mismatches, but I think we should do that after porting a few tests that have larger outputs and then decide.

cloud-fan · 2016-08-10T06:49:59Z

number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)

do you have any idea why the test is so slow?

petermaxlee · 2016-08-10T06:52:21Z

It was the first time any query was run, so JIT hasn't really kicked in yet, and many classes needed to be loaded into the JVM.

cloud-fan · 2016-08-10T06:57:05Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+    val expectedOutputs: Seq[QueryOutput] = {
+      val goldenOutput = fileToString(new File(testCase.resultFile))
+      val segments = goldenOutput.split("-- !query.+\n")
+      assert(segments.size == outputs.size * 3 + 1)  // each query has 3 segments, plus the header


assert((segments.size - 1) % 3 == 0)? Or we will never hit this https://github.com/apache/spark/pull/14472/files#diff-432455394ca50800d5de508861984ca5R164

Is that a problem that some asserts are never hit? Some logic might change and then one of the asserts can fail. I prefer to be more conservative for asserts.

But this assert doesn't match the logic in this branch. What we do here is skipping the first segment, and grouping the rest segments by 3, each group means a QueryOutput. We are not comparing the real output with expected output in this branch.

But then why would % 3 be better? Are you arguing for a better message when it fails?

I don't get "this assert doesn't match the logic in this branch". There is no logic that dictates we cannot verify the number of blocks here.

Even if segments.size == outputs.size * 3 + 1 fails, we can still finish this branch right?
However if (segments.size - 1) % 3 == 0 fails, we will throw ArrayIndexOutOfBound in this branch.

Anyway it's bad to have dead code, if we wanna keep this assert, we should remove https://github.com/apache/spark/pull/14472/files#diff-432455394ca50800d5de508861984ca5R164 and move its error message here.

I will add a better error message, but I'm afraid I disagree with you on removing the other assert. It is not dead code because it is exercised at runtime. They are making different assumptions at different places in the code. We could change the way we arrange blocks in the future and then the other assert would be useful.

Anyway I am not sure why you are nitpicking on this. It seems very minor and we are simply wasting time.

Basically asserts are used as defensive guards against program errors. By your definition almost all asserts are "dead code".

I think I get where you are coming from. You think the assert as something to verify correctness for a test case (in the ScalaTest sense). I was using assert as a defensive guard to catch error (as in basic invariants that shouldn't have been violated for this tiny block).

SparkQA · 2016-08-10T08:07:47Z

Test build #63506 has finished for PR 14472 at commit 7497742.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-10T08:15:36Z

Test build #63507 has finished for PR 14472 at commit 14f4959.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-08-10T09:20:07Z

thanks, merging to master!

SparkQA · 2016-08-10T09:33:58Z

Test build #63515 has finished for PR 14472 at commit 288b699.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

petermaxlee · 2016-08-11T01:06:50Z

Thanks, @cloud-fan !

## What changes were proposed in this pull request? This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files. This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework. Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form). When there is a mismatch, the error message looks like the following: ``` [info] - blacklist.sql !!! IGNORED !!! [info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds) [info] Expected "...147483648 -214748364[8]", but got "...147483648 -214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171) ``` ## How was this patch tested? This is a test infrastructure change. Author: petermaxlee <petermaxlee@gmail.com> Closes #14472 from petermaxlee/SPARK-16866. (cherry picked from commit b9f8a11) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2016-08-11T08:45:01Z

backport to 2.0!

[SPARK-16866][SQL] Basic infrastructure for file-based SQL end-to-end…

ba9b678

… tests

petermaxlee reviewed Aug 3, 2016
View reviewed changes

rat exclude

9b360da

cloud-fan reviewed Aug 3, 2016
View reviewed changes

Use a more concise and human readable format.

a1e1b57

cloud-fan reviewed Aug 4, 2016
View reviewed changes

gatorsmile mentioned this pull request Aug 5, 2016

[SPARK-16904] [SQL] Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry #14498

Closed

rat exclude

2352d6f

cloud-fan reviewed Aug 8, 2016
View reviewed changes

petermaxlee added 2 commits August 9, 2016 14:24

Merge remote-tracking branch 'apache/master' into SPARK-16866

34b6704

code review

0359756

make it work for maven

5eb01fe

cloud-fan reviewed Aug 10, 2016
View reviewed changes

petermaxlee added 2 commits August 9, 2016 23:02

Update example

26c7771

Simplify filter

7497742

cloud-fan reviewed Aug 10, 2016
View reviewed changes

cr

14f4959

cloud-fan reviewed Aug 10, 2016
View reviewed changes

better message for assert.

288b699

asfgit closed this in b9f8a11 Aug 10, 2016

[SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests #14472

[SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests #14472

Conversation

petermaxlee commented Aug 3, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petermaxlee commented Aug 3, 2016

SparkQA commented Aug 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Aug 3, 2016

SparkQA commented Aug 3, 2016

Choose a reason for hiding this comment

petermaxlee commented Aug 3, 2016

cloud-fan commented Aug 3, 2016

petermaxlee commented Aug 3, 2016

cloud-fan commented Aug 3, 2016

petermaxlee commented Aug 4, 2016

SparkQA commented Aug 4, 2016

SparkQA commented Aug 4, 2016

Choose a reason for hiding this comment

SparkQA commented Aug 6, 2016

Choose a reason for hiding this comment

petermaxlee commented Aug 9, 2016

SparkQA commented Aug 9, 2016

SparkQA commented Aug 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Aug 10, 2016

petermaxlee commented Aug 10, 2016

cloud-fan commented Aug 10, 2016

petermaxlee commented Aug 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 10, 2016

SparkQA commented Aug 10, 2016

cloud-fan commented Aug 10, 2016

SparkQA commented Aug 10, 2016

petermaxlee commented Aug 11, 2016

cloud-fan commented Aug 11, 2016

petermaxlee commented Aug 3, 2016 •

edited

Loading

petermaxlee commented Aug 10, 2016 •

edited

Loading