Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31291][SQL][TEST] SQLQueryTestSuite: Sharing test data and test tables among multiple test cases #28060

Closed

Conversation

beliefer
Copy link
Contributor

@beliefer beliefer commented Mar 28, 2020

What changes were proposed in this pull request?

SQLQueryTestSuite spend 35 minutes time to test.
I've listed the 10 test cases that took the longest time in the SQL module below.

Class Spend time  ↑ Failure Skip Pass Total test case
SQLQueryTestSuite 35 minutes 0 1 230 231
TPCDSQuerySuite 3 minutes 8 seconds 0 0 156 156
SQLQuerySuite 2 minutes 52 seconds 0 0 185 185
DynamicPartitionPruningSuiteAEOff 1 minutes 52 seconds 0 0 22 22
DataFrameFunctionsSuite 1 minutes 37 seconds 0 0 102 102
DynamicPartitionPruningSuiteAEOn 1 minutes 24 seconds 0 0 22 22
DataFrameSuite 1 minutes 14 seconds 0 2 157 159
SubquerySuite 1 minutes 12 seconds 0 1 70 71
SingleLevelAggregateHashMapSuite 1 minutes 1 seconds 0 0 50 50
DataFrameAggregateSuite 59 seconds 0 0 50 50

I checked the code of SQLQueryTestSuite and found SQLQueryTestSuite load test data repeatedly.
This PR will improve the performance of SQLQueryTestSuite.

The total time run SQLQueryTestSuite before and after this PR show below.
Before

No Time
1 20 minutes, 22 seconds
2 23 minutes, 21 seconds
3 21 minutes, 19 seconds
4 22 minutes, 26 seconds
5 20 minutes, 8 seconds

After

No Time
1 20 minutes, 52 seconds
2 20 minutes, 47 seconds
3 20 minutes, 7 seconds
4 21 minutes, 10 seconds
5 20 minutes, 4 seconds

Why are the changes needed?

Improve the performance of SQLQueryTestSuite.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Jenkins test

@SparkQA
Copy link

SparkQA commented Mar 28, 2020

Test build #120521 has finished for PR 28060 at commit 0511d99.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2020

Test build #120522 has finished for PR 28060 at commit 8eac8d6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member

wangyum commented Mar 28, 2020

retest this please

@maropu
Copy link
Member

maropu commented Mar 28, 2020

Looks cool, thanks for the work, @beliefer ! btw, how long will SQLQueryTestSuite take with this fix? I just want to know the total running time.

@SparkQA
Copy link

SparkQA commented Mar 28, 2020

Test build #120525 has finished for PR 28060 at commit 8eac8d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

Looks cool, thanks for the work, @beliefer ! btw, how long will SQLQueryTestSuite take with this fix? I just want to know the total running time.

The total time after optimization is less than that before optimization by nearly one minute.

@SparkQA
Copy link

SparkQA commented Mar 28, 2020

Test build #120530 has finished for PR 28060 at commit c392c03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (testTables.contains("arraydata")) {
((Seq(1, 2, 3), Seq(Seq(1, 2, 3))) :: (Seq(2, 3, 4), Seq(Seq(2, 3, 4))) :: Nil)
.toDF("arraycol", "nestedarraycol")
.createOrReplaceTempView("arraydata")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the overhead of per-session init, we cannot just move these local temp views into a session-independent place, e.g., global temp views?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key point of conflict is not this. For example, test case A will create view testdata, and test case B will also create view testdata. However, the schema information of the two testdata is different. If the same session is shared globally, it will cause conflicts, especially in parallel execution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same session is shared globally

I think that's not what @maropu means. We still create a fresh session for each testing file, but the testing views are created as global temp view, which are shared between all sessions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thanks. I under @maropu 's suggestion now. I will try to use createGlobalTempView and shared these views between all sessions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I missed your reply, @beliefer. Yea, that's what I wanted to say, thanks, @cloud-fan . I'll check later.

Copy link
Contributor Author

@beliefer beliefer Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
If use sparkSession.newSession(), test case failed. such as:

23:25:05.078 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using configs: 
[info] - operators.sql *** FAILED *** (5 seconds, 722 milliseconds)
[info]   operators.sql
[info]   Expected "struct<[(- key):int,(+ key):int]>", but got "struct<[]>" Schema did not match for query #4
[info]   select -key, +key from testdata where key = 2: -- !query
[info]   select -key, +key from testdata where key = 2
[info]   -- !query schema
[info]   struct<>
[info]   -- !query output
[info]   org.apache.spark.sql.AnalysisException
[info]   Table or view not found: testdata; line 1 pos 23 (SQLQueryTestSuite.scala:464)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)

Copy link
Contributor Author

@beliefer beliefer Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val a = spark.sql("show views;")
a.show();

+---------+---------+-----------+
|namespace| viewName|isTemporary|
+---------+---------+-----------+
| | aggtest| true|
| |arraydata| true|
| | mapdata| true|
| | onek| true|
| | tenk1| true|
+---------+---------+-----------+

val localSparkSession = spark.newSession()
val a2 = localSparkSession.sql("show views;")
a2.show();

+---------+--------+-----------+
|namespace|viewName|isTemporary|
+---------+--------+-----------+
+---------+--------+-----------+
Maybe I lost some thing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master...maropu:SPARK-31291

This is a method but will cause too many changes. After a discussion offline between @cloud-fan and me, I will try to use df.write.saveAsTable replace df.createTempView.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan @maropu I have used df.write.saveAsTable replace df.createTempView.
Because the origin temp view changed to tables, I have to regenerate some golden files.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 30, 2020

Hi, All.
I converted this issue into a subtask of SPARK-25604.
FYI, technically, SPARK-25604 is already resolved by enhancing the test framework by parallelizing execution of the slow test suite and SQLQueryTestSuite and ThriftServerQueryTestSuite.scala are already parallelized in all SBT build and tests (including PRBuilder). So, this doesn't improve the total testing time in the SBT environment at all. The benefit of this PR is limited to only Maven environment.

cc @gatorsmile and @gengliangwang

@gatorsmile
Copy link
Member

This was assigned to @beliefer after our offline talk. He is trying to find out the reasons why SQLQueryTestSuite took 35 minutes to finish. The time costs of each step/phase can help us locate the root cause. It would be interesting to know whether our compiler overhead are too big for these short queries.

@beliefer
Copy link
Contributor Author

beliefer commented Apr 3, 2020

@dongjoon-hyun I think even in parallel execution, this PR will still help.

@beliefer
Copy link
Contributor Author

beliefer commented Apr 7, 2020

cc @cloud-fan

@beliefer beliefer force-pushed the avoid-load-test-data-repeatedly branch from c392c03 to 45bba4c Compare April 8, 2020 15:03
@SparkQA
Copy link

SparkQA commented Apr 8, 2020

Test build #120977 has finished for PR 28060 at commit 45bba4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -7,8 +7,8 @@ SELECT * FROM testdata LIMIT 2
-- !query schema
struct<key:int,value:string>
-- !query output
1 1
2 2
51 51
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a sort before limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use repartition resolved the issue.

@@ -668,6 +690,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSparkSession {
try {
TimeZone.setDefault(originalTimeZone)
Locale.setDefault(originalLocale)
unloadTestData(spark)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer createTestTables and removeTestTables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@cloud-fan
Copy link
Contributor

the change LGTM, can you regenerate the benchmark numbers?

@beliefer
Copy link
Contributor Author

beliefer commented Apr 9, 2020

the change LGTM, can you regenerate the benchmark numbers?

OK.

@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121005 has finished for PR 28060 at commit af42f50.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Apr 9, 2020

retest this please

@beliefer beliefer changed the title [SPARK-31291][SQL][TEST] Avoid load test data if test case not uses them [SPARK-31291][SQL][TEST] Sharing test data and test tables among multiple test cases Apr 9, 2020
@beliefer beliefer changed the title [SPARK-31291][SQL][TEST] Sharing test data and test tables among multiple test cases [SPARK-31291][SQL][TEST] SQLQueryTestSuite: Sharing test data and test tables among multiple test cases Apr 9, 2020
@SparkQA
Copy link

SparkQA commented Apr 9, 2020

Test build #121015 has finished for PR 28060 at commit af42f50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 014d335 Apr 9, 2020
cloud-fan pushed a commit that referenced this pull request Apr 9, 2020
…t tables among multiple test cases

### What changes were proposed in this pull request?
`SQLQueryTestSuite` spend 35 minutes time to test.
I've listed the 10 test cases that took the longest time in the `SQL` module below.

Class | Spend time  ↑ | Failure | Skip | Pass | Total test case
-- | -- | -- | -- | -- | --
SQLQueryTestSuite | 35 minutes | 0 | 1 | 230 | 231
TPCDSQuerySuite | 3 minutes 8 seconds | 0 | 0 | 156 | 156
SQLQuerySuite | 2 minutes 52 seconds | 0 | 0 | 185 | 185
DynamicPartitionPruningSuiteAEOff | 1 minutes 52 seconds | 0 | 0 | 22 | 22
DataFrameFunctionsSuite | 1 minutes 37 seconds | 0 | 0 | 102 | 102
DynamicPartitionPruningSuiteAEOn | 1 minutes 24 seconds | 0 | 0 | 22 | 22
DataFrameSuite | 1 minutes 14 seconds | 0 | 2 | 157 | 159
SubquerySuite | 1 minutes 12 seconds | 0 | 1 | 70 | 71
SingleLevelAggregateHashMapSuite | 1 minutes 1 seconds | 0 | 0 | 50 | 50
DataFrameAggregateSuite | 59 seconds | 0 | 0 | 50 | 50

I checked the code of `SQLQueryTestSuite` and found `SQLQueryTestSuite` load test data repeatedly.
This PR will improve the performance of `SQLQueryTestSuite`.

The total time run `SQLQueryTestSuite` before and after this PR show below.
Before
No | Time
-- | --
1 | 20 minutes, 22 seconds
2 | 23 minutes, 21 seconds
3 | 21 minutes, 19 seconds
4 | 22 minutes, 26 seconds
5 | 20 minutes, 8 seconds

After
No | Time
-- | --
1 | 20 minutes, 52 seconds
2 | 20 minutes, 47 seconds
3 | 20 minutes, 7 seconds
4 | 21 minutes, 10 seconds
5 | 20 minutes, 4 seconds

### Why are the changes needed?
Improve the performance of `SQLQueryTestSuite`.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #28060 from beliefer/avoid-load-test-data-repeatedly.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 014d335)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@beliefer
Copy link
Contributor Author

beliefer commented Apr 9, 2020

@cloud-fan @maropu Thanks for review this PR.
@dongjoon-hyun @gatorsmile @wangyum Thanks for your help!

cloud-fan pushed a commit that referenced this pull request Apr 10, 2020
…a and test tables among multiple test cases

### What changes were proposed in this pull request?
This PR is related to #28060.
`ThriftServerQueryTestSuite` spend 17 minutes time to test.
I checked the code and found `ThriftServerQueryTestSuite` load test data repeatedly.
I've listed all the test cases order by time with desc in the `hive-thriftserver` module below.

Class | Spend time  ↑ | Failure | Skip | Pass | Total test case
-- | -- | -- | -- | -- | --
ThriftServerQueryTestSuite | 17 minutes | 0 | 15 | 140 | 155
CliSuite | 8 minutes 24 seconds | 0 | 0 | 24 | 24
SparkThriftServerProtocolVersionsSuite | 59 seconds | 0 | 0 | 210 | 210
HiveThriftBinaryServerSuite | 36 seconds | 0 | 1 | 21 | 22
SparkMetadataOperationSuite | 19 seconds | 0 | 0 | 7 | 7
HiveCliSessionStateSuite | 16 seconds | 0 | 0 | 2 | 2
SparkSQLEnvSuite | 16 seconds | 0 | 0 | 1 | 1
HiveThriftHttpServerSuite | 15 seconds | 0 | 0 | 3 | 3
SingleSessionSuite | 14 seconds | 0 | 0 | 3 | 3
JdbcConnectionUriSuite | 2.1 seconds | 0 | 0 | 1 | 1
ThriftServerWithSparkContextSuite | 1.4 seconds | 0 | 0 | 1 | 1
SparkExecuteStatementOperationSuite | 63 millseconds | 0 | 0 | 2 | 2
UISeleniumSuite | -1 millseconds | 0 | 1 | 0 | 1

I checked the code of `ThriftServerQueryTestSuite` and found `ThriftServerQueryTestSuite` load test data repeatedly.
This PR will improve the performance of `ThriftServerQueryTestSuite`.
Because #28060 provides `createTestTables`(https://github.com/apache/spark/blob/e42a3945acd614a26c7941a9eed161b500fb4520/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L574) and `removeTestTables`(https://github.com/apache/spark/blob/e42a3945acd614a26c7941a9eed161b500fb4520/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L666), this PR will still uses them.
The total time run `ThriftServerQueryTestSuite` before and after this PR show below.
Before
No | Time
-- | --
1 | 18 minutes, 8 seconds
2 | 22 minutes, 44 seconds
3 | 17 minutes, 48 seconds
4 | 18 minutes, 30 seconds

After
No | Time
-- | --
1 | 16 minutes, 11 seconds
2 | 17 minutes, 19 seconds
3 | 18 minutes, 15 seconds
4 | 17 minutes, 27 seconds

### Why are the changes needed?
Improve the performance of `ThriftServerQueryTestSuite`.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #28180 from beliefer/avoid-load-thrift-test-data-repeatedly.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 10, 2020
…a and test tables among multiple test cases

### What changes were proposed in this pull request?
This PR is related to #28060.
`ThriftServerQueryTestSuite` spend 17 minutes time to test.
I checked the code and found `ThriftServerQueryTestSuite` load test data repeatedly.
I've listed all the test cases order by time with desc in the `hive-thriftserver` module below.

Class | Spend time  ↑ | Failure | Skip | Pass | Total test case
-- | -- | -- | -- | -- | --
ThriftServerQueryTestSuite | 17 minutes | 0 | 15 | 140 | 155
CliSuite | 8 minutes 24 seconds | 0 | 0 | 24 | 24
SparkThriftServerProtocolVersionsSuite | 59 seconds | 0 | 0 | 210 | 210
HiveThriftBinaryServerSuite | 36 seconds | 0 | 1 | 21 | 22
SparkMetadataOperationSuite | 19 seconds | 0 | 0 | 7 | 7
HiveCliSessionStateSuite | 16 seconds | 0 | 0 | 2 | 2
SparkSQLEnvSuite | 16 seconds | 0 | 0 | 1 | 1
HiveThriftHttpServerSuite | 15 seconds | 0 | 0 | 3 | 3
SingleSessionSuite | 14 seconds | 0 | 0 | 3 | 3
JdbcConnectionUriSuite | 2.1 seconds | 0 | 0 | 1 | 1
ThriftServerWithSparkContextSuite | 1.4 seconds | 0 | 0 | 1 | 1
SparkExecuteStatementOperationSuite | 63 millseconds | 0 | 0 | 2 | 2
UISeleniumSuite | -1 millseconds | 0 | 1 | 0 | 1

I checked the code of `ThriftServerQueryTestSuite` and found `ThriftServerQueryTestSuite` load test data repeatedly.
This PR will improve the performance of `ThriftServerQueryTestSuite`.
Because #28060 provides `createTestTables`(https://github.com/apache/spark/blob/e42a3945acd614a26c7941a9eed161b500fb4520/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L574) and `removeTestTables`(https://github.com/apache/spark/blob/e42a3945acd614a26c7941a9eed161b500fb4520/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L666), this PR will still uses them.
The total time run `ThriftServerQueryTestSuite` before and after this PR show below.
Before
No | Time
-- | --
1 | 18 minutes, 8 seconds
2 | 22 minutes, 44 seconds
3 | 17 minutes, 48 seconds
4 | 18 minutes, 30 seconds

After
No | Time
-- | --
1 | 16 minutes, 11 seconds
2 | 17 minutes, 19 seconds
3 | 18 minutes, 15 seconds
4 | 17 minutes, 27 seconds

### Why are the changes needed?
Improve the performance of `ThriftServerQueryTestSuite`.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes #28180 from beliefer/avoid-load-thrift-test-data-repeatedly.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2d3692e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 11, 2020

Hi, all.
This seems to break all Maven Jenkins jobs in both master and branch-3.0. The following is the example.

org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite *** ABORTED ***

Could you take a look?

.createOrReplaceTempView("mapdata")
.write
.format("parquet")
.saveAsTable("mapdata")

session
.read
.format("csv")
.options(Map("delimiter" -> "\t", "header" -> "false"))
.schema("a int, b float")
.load(testFile("test-data/postgresql/agg.data"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the root cause of failure.

 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/sql/core/target/spark-sql_2.12-3.0.1-SNAPSHOT-tests.jar!/test-data/postgresql/agg.data

@dongjoon-hyun
Copy link
Member

Since I found the root cause, I'll make a follow-up PR soon.

@dongjoon-hyun
Copy link
Member

The one quick fix is copying the test file from jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/sql/core/target/spark-sql_2.12-3.0.1-SNAPSHOT-tests.jar!/test-data/postgresql/agg.data to the local file.

Since this PR is about the performance, the fix will increase the test time a little for copying.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 11, 2020

I made a follow-up PR to recover master and branch-3.0.

dongjoon-hyun added a commit that referenced this pull request Apr 11, 2020
…ftServerQueryTestSuite

### What changes were proposed in this pull request?

[SPARK-31291](#28060) broke `ThriftServerQueryTestSuite` in Maven environment. This PR fixes it by copying the resource file from jars to local temp file.

### Why are the changes needed?

To recover the Jenkins jobs in `master` and `branch-3.0`.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/211/
```
org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite *** ABORTED ***
...
java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/sql/core/target/
spark-sql_2.12-3.0.1-SNAPSHOT-tests.jar!/test-data/postgresql/agg.data
```

![Screen Shot 2020-04-10 at 9 54 28 PM](https://user-images.githubusercontent.com/9700541/79035702-f03ad900-7b75-11ea-9eee-0c1581a28838.png)

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with SBT and Maven.
- [x] Sbt (`Test build #121117` #28186 (comment))
- [x] Maven (`Test build #121118` #28186 (comment))

Closes #28186 from dongjoon-hyun/SPARK-31291.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Apr 11, 2020
…ftServerQueryTestSuite

### What changes were proposed in this pull request?

[SPARK-31291](#28060) broke `ThriftServerQueryTestSuite` in Maven environment. This PR fixes it by copying the resource file from jars to local temp file.

### Why are the changes needed?

To recover the Jenkins jobs in `master` and `branch-3.0`.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/211/
```
org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite *** ABORTED ***
...
java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/sql/core/target/
spark-sql_2.12-3.0.1-SNAPSHOT-tests.jar!/test-data/postgresql/agg.data
```

![Screen Shot 2020-04-10 at 9 54 28 PM](https://user-images.githubusercontent.com/9700541/79035702-f03ad900-7b75-11ea-9eee-0c1581a28838.png)

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with SBT and Maven.
- [x] Sbt (`Test build #121117` #28186 (comment))
- [x] Maven (`Test build #121118` #28186 (comment))

Closes #28186 from dongjoon-hyun/SPARK-31291.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit b4c438a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…t tables among multiple test cases

### What changes were proposed in this pull request?
`SQLQueryTestSuite` spend 35 minutes time to test.
I've listed the 10 test cases that took the longest time in the `SQL` module below.

Class | Spend time  ↑ | Failure | Skip | Pass | Total test case
-- | -- | -- | -- | -- | --
SQLQueryTestSuite | 35 minutes | 0 | 1 | 230 | 231
TPCDSQuerySuite | 3 minutes 8 seconds | 0 | 0 | 156 | 156
SQLQuerySuite | 2 minutes 52 seconds | 0 | 0 | 185 | 185
DynamicPartitionPruningSuiteAEOff | 1 minutes 52 seconds | 0 | 0 | 22 | 22
DataFrameFunctionsSuite | 1 minutes 37 seconds | 0 | 0 | 102 | 102
DynamicPartitionPruningSuiteAEOn | 1 minutes 24 seconds | 0 | 0 | 22 | 22
DataFrameSuite | 1 minutes 14 seconds | 0 | 2 | 157 | 159
SubquerySuite | 1 minutes 12 seconds | 0 | 1 | 70 | 71
SingleLevelAggregateHashMapSuite | 1 minutes 1 seconds | 0 | 0 | 50 | 50
DataFrameAggregateSuite | 59 seconds | 0 | 0 | 50 | 50

I checked the code of `SQLQueryTestSuite` and found `SQLQueryTestSuite` load test data repeatedly.
This PR will improve the performance of `SQLQueryTestSuite`.

The total time run `SQLQueryTestSuite` before and after this PR show below.
Before
No | Time
-- | --
1 | 20 minutes, 22 seconds
2 | 23 minutes, 21 seconds
3 | 21 minutes, 19 seconds
4 | 22 minutes, 26 seconds
5 | 20 minutes, 8 seconds

After
No | Time
-- | --
1 | 20 minutes, 52 seconds
2 | 20 minutes, 47 seconds
3 | 20 minutes, 7 seconds
4 | 21 minutes, 10 seconds
5 | 20 minutes, 4 seconds

### Why are the changes needed?
Improve the performance of `SQLQueryTestSuite`.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes apache#28060 from beliefer/avoid-load-test-data-repeatedly.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…a and test tables among multiple test cases

### What changes were proposed in this pull request?
This PR is related to apache#28060.
`ThriftServerQueryTestSuite` spend 17 minutes time to test.
I checked the code and found `ThriftServerQueryTestSuite` load test data repeatedly.
I've listed all the test cases order by time with desc in the `hive-thriftserver` module below.

Class | Spend time  ↑ | Failure | Skip | Pass | Total test case
-- | -- | -- | -- | -- | --
ThriftServerQueryTestSuite | 17 minutes | 0 | 15 | 140 | 155
CliSuite | 8 minutes 24 seconds | 0 | 0 | 24 | 24
SparkThriftServerProtocolVersionsSuite | 59 seconds | 0 | 0 | 210 | 210
HiveThriftBinaryServerSuite | 36 seconds | 0 | 1 | 21 | 22
SparkMetadataOperationSuite | 19 seconds | 0 | 0 | 7 | 7
HiveCliSessionStateSuite | 16 seconds | 0 | 0 | 2 | 2
SparkSQLEnvSuite | 16 seconds | 0 | 0 | 1 | 1
HiveThriftHttpServerSuite | 15 seconds | 0 | 0 | 3 | 3
SingleSessionSuite | 14 seconds | 0 | 0 | 3 | 3
JdbcConnectionUriSuite | 2.1 seconds | 0 | 0 | 1 | 1
ThriftServerWithSparkContextSuite | 1.4 seconds | 0 | 0 | 1 | 1
SparkExecuteStatementOperationSuite | 63 millseconds | 0 | 0 | 2 | 2
UISeleniumSuite | -1 millseconds | 0 | 1 | 0 | 1

I checked the code of `ThriftServerQueryTestSuite` and found `ThriftServerQueryTestSuite` load test data repeatedly.
This PR will improve the performance of `ThriftServerQueryTestSuite`.
Because apache#28060 provides `createTestTables`(https://github.com/apache/spark/blob/e42a3945acd614a26c7941a9eed161b500fb4520/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L574) and `removeTestTables`(https://github.com/apache/spark/blob/e42a3945acd614a26c7941a9eed161b500fb4520/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala#L666), this PR will still uses them.
The total time run `ThriftServerQueryTestSuite` before and after this PR show below.
Before
No | Time
-- | --
1 | 18 minutes, 8 seconds
2 | 22 minutes, 44 seconds
3 | 17 minutes, 48 seconds
4 | 18 minutes, 30 seconds

After
No | Time
-- | --
1 | 16 minutes, 11 seconds
2 | 17 minutes, 19 seconds
3 | 18 minutes, 15 seconds
4 | 17 minutes, 27 seconds

### Why are the changes needed?
Improve the performance of `ThriftServerQueryTestSuite`.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test

Closes apache#28180 from beliefer/avoid-load-thrift-test-data-repeatedly.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ftServerQueryTestSuite

### What changes were proposed in this pull request?

[SPARK-31291](apache#28060) broke `ThriftServerQueryTestSuite` in Maven environment. This PR fixes it by copying the resource file from jars to local temp file.

### Why are the changes needed?

To recover the Jenkins jobs in `master` and `branch-3.0`.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/211/
```
org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite *** ABORTED ***
...
java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/sql/core/target/
spark-sql_2.12-3.0.1-SNAPSHOT-tests.jar!/test-data/postgresql/agg.data
```

![Screen Shot 2020-04-10 at 9 54 28 PM](https://user-images.githubusercontent.com/9700541/79035702-f03ad900-7b75-11ea-9eee-0c1581a28838.png)

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with SBT and Maven.
- [x] Sbt (`Test build #121117` apache#28186 (comment))
- [x] Maven (`Test build #121118` apache#28186 (comment))

Closes apache#28186 from dongjoon-hyun/SPARK-31291.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@beliefer beliefer deleted the avoid-load-test-data-repeatedly branch April 23, 2024 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants