Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31372][SQL][TEST] Display expression schema for double check. #28194

Closed
wants to merge 24 commits into from

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Apr 12, 2020

What changes were proposed in this pull request?

Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement.
We need to add more powerful guarantees so that aliases outputed by built-in functions are correct.
This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file.
By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them.

Why are the changes needed?

Ensure that the output alias is correct

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Jenkins test.

@SparkQA
Copy link

SparkQA commented Apr 12, 2020

Test build #121141 has finished for PR 28194 at commit 8d976a2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2020

Test build #121140 has finished for PR 28194 at commit 36a7ce8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession
  • protected case class QueryOutput(sql: String, schema: String)

@beliefer beliefer changed the title [SPARK-31372][SQL][Test] Display expression schema for double check. [SPARK-31372][SQL][TEST] Display expression schema for double check. Apr 12, 2020
@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2020

Test build #121145 has finished for PR 28194 at commit 8d976a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

-- !query
SELECT forall(array(2, null, 8), x -> x % 2 == 0)
-- !query schema
struct<forall(array(2, CAST(NULL AS INT), 8), lambdafunction(((namedlambdavariable() % 2) = 0), namedlambdavariable())):boolean>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we do not need all of them. How about just showing the first one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. we only need the first one. Because other SQL and schema no more value. Thanks for your remind.

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121165 has finished for PR 28194 at commit 31c7984.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121194 has finished for PR 28194 at commit c7b565b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121177 has finished for PR 28194 at commit 31c7984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 13, 2020

Test build #121205 has finished for PR 28194 at commit c7b565b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121245 has finished for PR 28194 at commit 42910c2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 14, 2020

Test build #121258 has finished for PR 28194 at commit 42910c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

cc @gatorsmile @cloud-fan @maropu

@gatorsmile
Copy link
Member

discussed this with @beliefer offline. Let us generate a MD file as the golden file. We can build a MD table to present the schema info.

@SparkQA
Copy link

SparkQA commented Apr 16, 2020

Test build #121359 has finished for PR 28194 at commit c6ed125.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • protected case class QueryOutput(

@beliefer
Copy link
Contributor Author

retest this please

sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME"))
}

java.nio.file.Paths.get(sparkHome,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In BenchmarkBase.main, we generate the path by simply doing

val file = new File(s"benchmarks/$resultFileName")
  if (!file.exists()) {
    file.createNewFile()
   }

Does it work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's work too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why we write such complex code here to generate the path?

Copy link
Contributor Author

@beliefer beliefer Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There need to create the parent dir sql-functions first.
I can replace the code below

    java.nio.file.Paths.get(sparkHome,
      "sql", "core", "src", "test", "resources", "sql-functions").toFile

as

val file = new File(s"$sparkHome/sql/core/src/test/resources/sql-functions")
if (!file.exists()) {
  file.mkdir()
}

But I am neutral about this change.


/** A single SQL query's SQL and schema. */
protected case class QueryOutput(
className: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 4 space identation for parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

}

// Compare results.
assertResult(expectedOutputs.size, s"Number of queries should be ${expectedOutputs.size}") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't it simply assert(expectedOutputs.size == outputs.size, error message ...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122044 has finished for PR 28194 at commit 460da00.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122049 has finished for PR 28194 at commit 133456d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122065 has finished for PR 28194 at commit a4d4de9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 29, 2020

Test build #122063 has finished for PR 28194 at commit e571667.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

| org.apache.spark.sql.catalyst.expressions.Abs | abs | SELECT abs(-1) | struct<abs(-1):int> |
| org.apache.spark.sql.catalyst.expressions.Acos | acos | SELECT acos(1) | struct<ACOS(CAST(1 AS DOUBLE)):double> |
| org.apache.spark.sql.catalyst.expressions.Acosh | acosh | SELECT acosh(1) | struct<ACOSH(CAST(1 AS DOUBLE)):double> |
| org.apache.spark.sql.catalyst.expressions.Add | + | SELECT 1 + 2 | struct<(1 + 2):int> |
Copy link
Contributor

@cloud-fan cloud-fan Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case that running multiple examples is useful: date + interval will be replaced by DateAddInterval during analysis, and it's better to test it as well.

We can fix it in a followup.

Copy link
Contributor Author

@beliefer beliefer Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code of DateAddInterval
override def sql: String = s"${left.sql} + ${right.sql}"
The alias of DateAddInterval is consistent with Add.
If we think DateAddInterval is one of inner implement for Add. This is not affect the user's use.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 1d1bb79 Apr 30, 2020
cloud-fan pushed a commit that referenced this pull request Apr 30, 2020
### What changes were proposed in this pull request?
Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement.
We need to add more powerful guarantees so that aliases outputed by built-in functions are correct.
This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file.
By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them.

### Why are the changes needed?
Ensure that the output alias is correct

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28194 from beliefer/check-expression-schema.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1d1bb79)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@beliefer
Copy link
Contributor Author

@cloud-fan @gatorsmile @maropu Thanks for all your help!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 1, 2020

Hi, guys.
This seems to break branch-3.0 UT. All Jenkins jobs on branch-3.0 are failing. Could you take a look?

  • org.apache.spark.sql.ExpressionsSchemaSuite.Check schemas for expression examples

Screen Shot 2020-04-30 at 8 04 28 PM

Screen Shot 2020-04-30 at 8 07 03 PM

@maropu
Copy link
Member

maropu commented May 1, 2020

@beliefer could you do follow-up? It seems we just need to update the golden file for 3.0.

@maropu
Copy link
Member

maropu commented May 1, 2020

#28427

@maropu
Copy link
Member

maropu commented May 1, 2020

To recover 3.0 asap, I opened a PR to fix it.

if (regenerateGoldenFiles) {
val missingExampleStr = missingExamples.mkString(",")
val goldenOutput = {
s"<!-- Automatically generated by${getClass.getSimpleName} -->\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Needs to a space by ${getClass.getSimpleName}. We can fix it when updating this file next time.

Copy link
Contributor Author

@beliefer beliefer May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do it.
#28164

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Looks okay.

@@ -0,0 +1,341 @@
<!-- Automatically generated byExpressionsSchemaSuite -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file md specifically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #28194 (comment)

I guess it's easier for humans the read the table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we can improve the diff here ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be row-by-row diff, @beliefer can you help to fix it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @HyukjinKwon
This checked exception is thrown when the number of expected SQL and the number of actual SQL are not equal.
Could I not output all the SQL to the checked exception ?

HyukjinKwon pushed a commit that referenced this pull request May 5, 2020
… that easy to track the diff

### What changes were proposed in this pull request?
This PR follows up #28194.
As discussed at https://github.com/apache/spark/pull/28194/files#r418418796.
This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff.
Although `ExpressionsSchemaSuite` at line
https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L165
just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output.
This PR will try to avoid this issue.
After this PR, the exception looks like below:
```
[info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds)
[info]   340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info]   at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167)
```

### Why are the changes needed?
Make the exception more concise and clear.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28430 from beliefer/improve-expressions-schema-suite.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request May 5, 2020
… that easy to track the diff

### What changes were proposed in this pull request?
This PR follows up #28194.
As discussed at https://github.com/apache/spark/pull/28194/files#r418418796.
This PR will improve `ExpressionsSchemaSuite` so that easy to track the diff.
Although `ExpressionsSchemaSuite` at line
https://github.com/apache/spark/blob/b7cde42b04b21c9bfee6535199cf385855c15853/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala#L165
just want to compare the total size between expected output size and the newest output size, the scalatest framework will output the extra information contains all the content of expected output and newest output.
This PR will try to avoid this issue.
After this PR, the exception looks like below:
```
[info] - Check schemas for expression examples *** FAILED *** (7 seconds, 336 milliseconds)
[info]   340 did not equal 341 Expected 332 blocks in result file but got 333. Try regenerate the result files. (ExpressionsSchemaSuite.scala:167)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
[info]   at org.apache.spark.sql.ExpressionsSchemaSuite.$anonfun$new$1(ExpressionsSchemaSuite.scala:167)
```

### Why are the changes needed?
Make the exception more concise and clear.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28430 from beliefer/improve-expressions-schema-suite.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit b949420)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
jdcasale pushed a commit to palantir/spark that referenced this pull request Jun 22, 2021
…ment in output schema

This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic.

Before this PR (a column name changes run-by-run):
```
scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
|       0.9524061403696937|
+-------------------------+
```
After this PR (a column name fixed):
```
scala> sql("select rand()").show()
+------------------+
|            rand()|
+------------------+
|0.7137935639522275|
+------------------+

// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+
|           rand(1)|
+------------------+
|0.6363787615254752|
+------------------+

// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]
```

Note: This fix comes from apache#28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.

To make output schema deterministic.

No.

Added unit tests.

Closes apache#28392 from maropu/SPARK-31594.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@beliefer beliefer deleted the check-expression-schema branch April 23, 2024 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants