Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests #14472

Closed
wants to merge 11 commits into from

Conversation

petermaxlee
Copy link
Contributor

@petermaxlee petermaxlee commented Aug 3, 2016

What changes were proposed in this pull request?

This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648 -214748364[8]", but got "...147483648   -214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)

How was this patch tested?

This is a test infrastructure change.

}

private def listTestCases(): Seq[TestCase] = {
listFilesRecursively(new File(inputFilePath)).map { file =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not work for Maven - I will look into this later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it should work for maven

@petermaxlee
Copy link
Contributor Author

cc @cloud-fan and @rxin for feedback.

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63148 has finished for PR 14472 at commit ba9b678.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SQLQueryTestSuite extends QueryTest with SharedSQLContext

QueryOutput(
sql = sql,
schema = df.schema.map(_.dataType.simpleString).mkString(", "),
output = df.showString(_numRows = 10000, truncate = 10000).trim)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using showString might not be the most friendly when there is a mismatch and the output is huge, but should work very well with smaller outputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.spark.sql.catalyst.util.sideBySide can show which line is mismatched. Can we borrow this idea?

@rxin
Copy link
Contributor

rxin commented Aug 3, 2016

Seems like a good idea.

@cloud-fan can you review?

@SparkQA
Copy link

SparkQA commented Aug 3, 2016

Test build #63149 has finished for PR 14472 at commit 9b360da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/** A single SQL query's output. */
private case class QueryOutput(sql: String, schema: String, output: String) {
def toXML: String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a convention to use XML as golden file? It's really hard to read...

@petermaxlee
Copy link
Contributor Author

For output file -- one other way I thought about was to use

-- query 1
query
-- query 1: schema
schema
-- query 1: result
result
-- query 2
query
-- query2: schema
schema
-- query 2: result
result

We can do string parsing there but it'd be less rigorous than XML.

@cloud-fan
Copy link
Contributor

Is there any case that people need to write a golden file themselves? If we wanna use some standard formats, I prefer json over xml.

@petermaxlee
Copy link
Contributor Author

I don't think these golden files should be manually created. They should always be generated. I tried JSON earlier and it was not very friendly either (worse than XML in this case).

Do you like the format I showed above?

@cloud-fan
Copy link
Contributor

yea, it looks much better. Although golden files are always generated, we should make it easy to read and verify its correctness.

@petermaxlee
Copy link
Contributor Author

I have updated this to use a custom format that is more readable.

@SparkQA
Copy link

SparkQA commented Aug 4, 2016

Test build #63197 has finished for PR 14472 at commit a1e1b57.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2016

Test build #3201 has finished for PR 14472 at commit a1e1b57.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -99,4 +99,5 @@ spark-deps-.*
.*tsv
org.apache.spark.scheduler.ExternalClusterManager
.*\.sql
.*\.sql\.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a rule for .out?

@SparkQA
Copy link

SparkQA commented Aug 6, 2016

Test build #63297 has finished for PR 14472 at commit 2352d6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -99,4 +99,5 @@ spark-deps-.*
.*tsv
org.apache.spark.scheduler.ExternalClusterManager
.*\.sql
.*\.sql\.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed, we already have a rule: .*out

@petermaxlee
Copy link
Contributor Author

I have updated this based on review feedback. The harness is now using hiveResultString() rather than show(), similar to the existing Hive compatibility test.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63471 has finished for PR 14472 at commit 0359756.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63476 has finished for PR 14472 at commit 5eb01fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// val cleaned = input.split("\n").filterNot(_.matches("--.*(?<=[^\\\\]);")).mkString("\n")
val cleaned = input.split("\n").filterNot(_.startsWith("--")).mkString("\n")
// note: this is not a robust way to split queries using semicolon, but works for now.
cleaned.split("(?<=[^\\\\]);").map(_.trim).filterNot(q => q == "").toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.filter(_.nonEmpty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it strictly less clear? It is more obvious that this is a string this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to filter(_ != "")


// List of SQL queries to run
val queries: Seq[String] = {
// val cleaned = input.split("\n").filterNot(_.matches("--.*(?<=[^\\\\]);")).mkString("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

val queries: Seq[String] = {
val cleaned = input.split("\n").filterNot(_.startsWith("--")).mkString("\n")
// note: this is not a robust way to split queries using semicolon, but works for now.
cleaned.split("(?<=[^\\\\]);").map(_.trim).filter(_ != "").toSeq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain a bit more about this regex? Why can't we just use ; here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied this from HiveComparisonTest. I think it is done to avoid escaping e.g. ";". It is really not very robust, but seems fine for now.

@cloud-fan
Copy link
Contributor

mostly LGTM, can you try some error cases and put the error message in PR description? Then we can have a better understanding about how this framework report errors. Thanks!

@petermaxlee
Copy link
Contributor Author

Updated the description. I think it might make sense to switch over to the same way HiveCompatibilitySuite reports mismatches, but I think we should do that after porting a few tests that have larger outputs and then decide.

@cloud-fan
Copy link
Contributor

number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)

do you have any idea why the test is so slow?

@petermaxlee
Copy link
Contributor Author

petermaxlee commented Aug 10, 2016

It was the first time any query was run, so JIT hasn't really kicked in yet, and many classes needed to be loaded into the JVM.

val expectedOutputs: Seq[QueryOutput] = {
val goldenOutput = fileToString(new File(testCase.resultFile))
val segments = goldenOutput.split("-- !query.+\n")
assert(segments.size == outputs.size * 3 + 1) // each query has 3 segments, plus the header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert((segments.size - 1) % 3 == 0)? Or we will never hit this https://github.com/apache/spark/pull/14472/files#diff-432455394ca50800d5de508861984ca5R164

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a problem that some asserts are never hit? Some logic might change and then one of the asserts can fail. I prefer to be more conservative for asserts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this assert doesn't match the logic in this branch. What we do here is skipping the first segment, and grouping the rest segments by 3, each group means a QueryOutput. We are not comparing the real output with expected output in this branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then why would % 3 be better? Are you arguing for a better message when it fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get "this assert doesn't match the logic in this branch". There is no logic that dictates we cannot verify the number of blocks here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if segments.size == outputs.size * 3 + 1 fails, we can still finish this branch right?
However if (segments.size - 1) % 3 == 0 fails, we will throw ArrayIndexOutOfBound in this branch.

Anyway it's bad to have dead code, if we wanna keep this assert, we should remove https://github.com/apache/spark/pull/14472/files#diff-432455394ca50800d5de508861984ca5R164 and move its error message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a better error message, but I'm afraid I disagree with you on removing the other assert. It is not dead code because it is exercised at runtime. They are making different assumptions at different places in the code. We could change the way we arrange blocks in the future and then the other assert would be useful.

Anyway I am not sure why you are nitpicking on this. It seems very minor and we are simply wasting time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically asserts are used as defensive guards against program errors. By your definition almost all asserts are "dead code".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get where you are coming from. You think the assert as something to verify correctness for a test case (in the ScalaTest sense). I was using assert as a defensive guard to catch error (as in basic invariants that shouldn't have been violated for this tiny block).

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63506 has finished for PR 14472 at commit 7497742.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63507 has finished for PR 14472 at commit 14f4959.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in b9f8a11 Aug 10, 2016
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@SparkQA
Copy link

SparkQA commented Aug 10, 2016

Test build #63515 has finished for PR 14472 at commit 288b699.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@petermaxlee
Copy link
Contributor Author

Thanks, @cloud-fan !

asfgit pushed a commit that referenced this pull request Aug 11, 2016
## What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.

This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.

Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).

When there is a mismatch, the error message looks like the following:

```
[info] - blacklist.sql !!! IGNORED !!!
[info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds)
[info]   Expected "...147483648	-214748364[8]", but got "...147483648	-214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171)
```

## How was this patch tested?
This is a test infrastructure change.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14472 from petermaxlee/SPARK-16866.

(cherry picked from commit b9f8a11)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

backport to 2.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants