[SPARK-31372][SQL][TEST] Display expression schema for double check. #28194

beliefer · 2020-04-12T06:05:12Z

What changes were proposed in this pull request?

Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement.
We need to add more powerful guarantees so that aliases outputed by built-in functions are correct.
This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file.
By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them.

Why are the changes needed?

Ensure that the output alias is correct

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Jenkins test.

SparkQA · 2020-04-12T07:05:01Z

Test build #121141 has finished for PR 28194 at commit 8d976a2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-12T07:05:01Z

Test build #121140 has finished for PR 28194 at commit 36a7ce8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession
protected case class QueryOutput(sql: String, schema: String)

beliefer · 2020-04-12T13:17:02Z

retest this please

SparkQA · 2020-04-12T17:28:41Z

Test build #121145 has finished for PR 28194 at commit 8d976a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-04-12T18:27:13Z

sql/core/src/test/resources/sql-functions/output.out

+-- !query
+SELECT forall(array(2, null, 8), x -> x % 2 == 0)
+-- !query schema
+struct<forall(array(2, CAST(NULL AS INT), 8), lambdafunction(((namedlambdavariable() % 2) = 0), namedlambdavariable())):boolean>


Actually, we do not need all of them. How about just showing the first one?

OK. we only need the first one. Because other SQL and schema no more value. Thanks for your remind.

SparkQA · 2020-04-13T07:05:02Z

Test build #121165 has finished for PR 28194 at commit 31c7984.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-04-13T07:08:29Z

retest this please

SparkQA · 2020-04-13T11:09:59Z

Test build #121194 has finished for PR 28194 at commit c7b565b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-04-13T11:47:08Z

retest this please

SparkQA · 2020-04-13T13:24:10Z

Test build #121177 has finished for PR 28194 at commit 31c7984.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-13T16:52:52Z

Test build #121205 has finished for PR 28194 at commit c7b565b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-14T07:05:02Z

Test build #121245 has finished for PR 28194 at commit 42910c2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-04-14T07:08:30Z

retest this please

SparkQA · 2020-04-14T12:02:34Z

Test build #121258 has finished for PR 28194 at commit 42910c2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-04-14T12:53:36Z

cc @gatorsmile @cloud-fan @maropu

gatorsmile · 2020-04-15T06:59:07Z

discussed this with @beliefer offline. Let us generate a MD file as the golden file. We can build a MD table to present the schema info.

SparkQA · 2020-04-16T12:11:35Z

Test build #121359 has finished for PR 28194 at commit c6ed125.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
protected case class QueryOutput(

beliefer · 2020-04-16T12:14:49Z

retest this please

cloud-fan · 2020-04-29T09:27:56Z

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala

+      sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME"))
+    }
+
+    java.nio.file.Paths.get(sparkHome,


In BenchmarkBase.main, we generate the path by simply doing

val file = new File(s"benchmarks/$resultFileName") if (!file.exists()) { file.createNewFile() }

Does it work here?

It's work too.

then why we write such complex code here to generate the path?

There need to create the parent dir sql-functions first.
I can replace the code below

java.nio.file.Paths.get(sparkHome, "sql", "core", "src", "test", "resources", "sql-functions").toFile

as

val file = new File(s"$sparkHome/sql/core/src/test/resources/sql-functions") if (!file.exists()) { file.mkdir() }

But I am neutral about this change.

cloud-fan · 2020-04-29T09:28:17Z

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala

+
+  /** A single SQL query's SQL and schema. */
+  protected case class QueryOutput(
+    className: String,


nit: 4 space identation for parameters.

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala

cloud-fan · 2020-04-29T09:36:56Z

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala

+    }
+
+    // Compare results.
+    assertResult(expectedOutputs.size, s"Number of queries should be ${expectedOutputs.size}") {


nit: isn't it simply assert(expectedOutputs.size == outputs.size, error message ...)?

SparkQA · 2020-04-29T12:03:38Z

Test build #122044 has finished for PR 28194 at commit 460da00.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-29T13:08:10Z

Test build #122049 has finished for PR 28194 at commit 133456d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-29T17:35:55Z

Test build #122065 has finished for PR 28194 at commit a4d4de9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-29T17:39:10Z

Test build #122063 has finished for PR 28194 at commit e571667.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-30T03:56:30Z

sql/core/src/test/resources/sql-functions/sql-expression-schema.md

+| org.apache.spark.sql.catalyst.expressions.Abs | abs | SELECT abs(-1) | struct<abs(-1):int> |
+| org.apache.spark.sql.catalyst.expressions.Acos | acos | SELECT acos(1) | struct<ACOS(CAST(1 AS DOUBLE)):double> |
+| org.apache.spark.sql.catalyst.expressions.Acosh | acosh | SELECT acosh(1) | struct<ACOSH(CAST(1 AS DOUBLE)):double> |
+| org.apache.spark.sql.catalyst.expressions.Add | + | SELECT 1 + 2 | struct<(1 + 2):int> |


One case that running multiple examples is useful: date + interval will be replaced by DateAddInterval during analysis, and it's better to test it as well.

We can fix it in a followup.

I checked the code of DateAddInterval
override def sql: String = s"${left.sql} + ${right.sql}"
The alias of DateAddInterval is consistent with Add.
If we think DateAddInterval is one of inner implement for Add. This is not affect the user's use.

cloud-fan · 2020-04-30T03:57:57Z

thanks, merging to master/3.0!

### What changes were proposed in this pull request? Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement. We need to add more powerful guarantees so that aliases outputed by built-in functions are correct. This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file. By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them. ### Why are the changes needed? Ensure that the output alias is correct ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28194 from beliefer/check-expression-schema. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1d1bb79) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

beliefer · 2020-04-30T04:06:21Z

@cloud-fan @gatorsmile @maropu Thanks for all your help!

dongjoon-hyun · 2020-05-01T03:05:08Z

Hi, guys.
This seems to break branch-3.0 UT. All Jenkins jobs on branch-3.0 are failing. Could you take a look?

org.apache.spark.sql.ExpressionsSchemaSuite.Check schemas for expression examples

maropu · 2020-05-01T03:06:35Z

@beliefer could you do follow-up? It seems we just need to update the golden file for 3.0.

maropu · 2020-05-01T03:28:29Z

#28427

maropu · 2020-05-01T03:28:54Z

To recover 3.0 asap, I opened a PR to fix it.

maropu · 2020-05-01T03:33:02Z

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala

+    if (regenerateGoldenFiles) {
+      val missingExampleStr = missingExamples.mkString(",")
+      val goldenOutput = {
+        s"<!-- Automatically generated by${getClass.getSimpleName} -->\n" +


nit: Needs to a space by ${getClass.getSimpleName}. We can fix it when updating this file next time.

Let me do it.
#28164

Ah, I see. Looks okay.

HyukjinKwon · 2020-05-01T03:35:14Z

sql/core/src/test/resources/sql-functions/sql-expression-schema.md

@@ -0,0 +1,341 @@
+<!-- Automatically generated byExpressionsSchemaSuite -->


Why is this file md specifically?

See #28194 (comment)

I guess it's easier for humans the read the table.

Okay, but seems it's more difficult to track the diff. See https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-sbt-hadoop-2.7-hive-2.3/431/testReport/junit/org.apache.spark.sql/ExpressionsSchemaSuite/Check_schemas_for_expression_examples/

Hopefully we can improve the diff here ...

This should be row-by-row diff, @beliefer can you help to fix it?

@cloud-fan @HyukjinKwon
This checked exception is thrown when the number of expected SQL and the number of actual SQL are not equal.
Could I not output all the SQL to the checked exception ?

… that easy to track the diff ### What changes were proposed in this pull request? This PR follows up #28194. As discussed at https://github.com/apache/spark/pull/28194/files#r418418796. This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff. Although `ExpressionsSchemaSuite` at line https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L165 just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output. This PR will try to avoid this issue. After this PR, the exception looks like below: ``` [info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds) [info] 340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167) ``` ### Why are the changes needed? Make the exception more concise and clear. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28430 from beliefer/improve-expressions-schema-suite. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… that easy to track the diff ### What changes were proposed in this pull request? This PR follows up #28194. As discussed at https://github.com/apache/spark/pull/28194/files#r418418796. This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff. Although `ExpressionsSchemaSuite` at line https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L165 just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output. This PR will try to avoid this issue. After this PR, the exception looks like below: ``` [info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds) [info] 340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) [info] at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167) ``` ### Why are the changes needed? Make the exception more concise and clear. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #28430 from beliefer/improve-expressions-schema-suite. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b949420) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ment in output schema This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic. Before this PR (a column name changes run-by-run): ``` scala> sql("select rand()").show() +-------------------------+ |rand(7986133828002692830)| +-------------------------+ | 0.9524061403696937| +-------------------------+ ``` After this PR (a column name fixed): ``` scala> sql("select rand()").show() +------------------+ | rand()| +------------------+ |0.7137935639522275| +------------------+ // If a seed given, it is still shown in a column name // (the same with the current behaviour) scala> sql("select rand(1)").show() +------------------+ | rand(1)| +------------------+ |0.6363787615254752| +------------------+ // We can still check a seed in explain output: scala> sql("select rand()").explain() == Physical Plan == *(1) Project [rand(-2282124938778456838) AS rand()#0] +- *(1) Scan OneRowRelation[] ``` Note: This fix comes from apache#28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests. To make output schema deterministic. No. Added unit tests. Closes apache#28392 from maropu/SPARK-31594. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

beliefer added 2 commits April 12, 2020 13:54

Display expression schema for double check.

36a7ce8

Optimize code

8d976a2

beliefer changed the title ~~[SPARK-31372][SQL][Test] Display expression schema for double check.~~ [SPARK-31372][SQL][TEST] Display expression schema for double check. Apr 12, 2020

gatorsmile reviewed Apr 12, 2020

View reviewed changes

Only preserve the first SQL and its schema.

31c7984

probot-autolabeler bot added the SQL label Apr 13, 2020

beliefer added 2 commits April 13, 2020 17:36

Optimize code

de6f7ad

Optimize code

c7b565b

Optimize code

42910c2

beliefer mentioned this pull request Apr 14, 2020

[SPARK-31393][SQL] Show the correct alias in schema for expression #28164

Closed

Change golden file to markdown

c6ed125

beliefer added 2 commits April 16, 2020 21:05

Optimize code

55b60bb

Optimize code

57f78fd

beliefer added 2 commits April 29, 2020 15:28

Merge branch 'master' into check-expression-schema

83734af

Not ignore expression.

133456d

cloud-fan reviewed Apr 29, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala Show resolved Hide resolved

cloud-fan reviewed Apr 29, 2020

View reviewed changes

beliefer added 2 commits April 29, 2020 20:28

Optimize code

e571667

Simplify check

a4d4de9

cloud-fan reviewed Apr 30, 2020

View reviewed changes

cloud-fan closed this in 1d1bb79 Apr 30, 2020

HyukjinKwon mentioned this pull request May 1, 2020

[SPARK-31549][PYSPARK] Add a develop API invoking collect on Python RDD with user-specified job group #28395

Closed

maropu reviewed May 1, 2020

View reviewed changes

HyukjinKwon reviewed May 1, 2020

View reviewed changes

HyukjinKwon mentioned this pull request May 1, 2020

[SPARK-31372][SQL][TEST][FOLLOWUP][3.0] Update the golden file of ExpressionsSchemaSuite #28427

Closed

beliefer mentioned this pull request May 1, 2020

[SPARK-31372][SQL][TEST][FOLLOW-UP] Improve ExpressionsSchemaSuite so that easy to track the diff. #28430

Closed

beliefer deleted the check-expression-schema branch April 23, 2024 07:23

		@@ -0,0 +1,341 @@
		<!-- Automatically generated byExpressionsSchemaSuite -->

[SPARK-31372][SQL][TEST] Display expression schema for double check. #28194

[SPARK-31372][SQL][TEST] Display expression schema for double check. #28194

Conversation

beliefer commented Apr 12, 2020 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Apr 12, 2020

SparkQA commented Apr 12, 2020

beliefer commented Apr 12, 2020

SparkQA commented Apr 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 13, 2020

beliefer commented Apr 13, 2020

SparkQA commented Apr 13, 2020

beliefer commented Apr 13, 2020

SparkQA commented Apr 13, 2020

SparkQA commented Apr 13, 2020

SparkQA commented Apr 14, 2020

beliefer commented Apr 14, 2020

SparkQA commented Apr 14, 2020

beliefer commented Apr 14, 2020

gatorsmile commented Apr 15, 2020

SparkQA commented Apr 16, 2020

beliefer commented Apr 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer Apr 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 29, 2020

SparkQA commented Apr 29, 2020

SparkQA commented Apr 29, 2020

SparkQA commented Apr 29, 2020

cloud-fan Apr 30, 2020 • edited

Choose a reason for hiding this comment

beliefer Apr 30, 2020 • edited

Choose a reason for hiding this comment

cloud-fan commented Apr 30, 2020

beliefer commented Apr 30, 2020

dongjoon-hyun commented May 1, 2020 • edited

maropu commented May 1, 2020 • edited

maropu commented May 1, 2020

maropu commented May 1, 2020

Choose a reason for hiding this comment

beliefer May 1, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beliefer commented Apr 12, 2020 •

edited

beliefer Apr 30, 2020 •

edited

cloud-fan Apr 30, 2020 •

edited

beliefer Apr 30, 2020 •

edited

dongjoon-hyun commented May 1, 2020 •

edited

maropu commented May 1, 2020 •

edited

beliefer May 1, 2020 •

edited