Skip to content

[SPARK-44045][SQL][TESTS] Mark WholeStageCodegenSparkSubmitSuite as ExtendedSQLTest#41579

Closed
dongjoon-hyun wants to merge 1 commit intoapache:masterfrom
dongjoon-hyun:SPARK-44045
Closed

[SPARK-44045][SQL][TESTS] Mark WholeStageCodegenSparkSubmitSuite as ExtendedSQLTest#41579
dongjoon-hyun wants to merge 1 commit intoapache:masterfrom
dongjoon-hyun:SPARK-44045

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 13, 2023

What changes were proposed in this pull request?

This PR aims to move WholeStageCodegenSparkSubmitSuite to sql - slow pipeline to mitigate the recent sql - others pipeline's flakiness.

Why are the changes needed?

WholeStageCodegenSparkSubmitSuite is the only test suite using SparkSubmitTestUtils in sql module.

$ git grep 'SparkSubmitTestUtils' | grep sql/core
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:import org.apache.spark.deploy.SparkSubmitTestUtils
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:class WholeStageCodegenSparkSubmitSuite extends SparkSubmitTestUtils

Like the following, this test case contributes the flakiness.

2023-06-13T11:05:31.3387316Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32mWholeStageCodegenSparkSubmitSuite:�[0m�[0m
2023-06-13T11:05:36.6680896Z 2023-06-13 04:05:36.667 - stderr> 23/06/13 11:05:36 INFO SparkContext: Running Spark version 3.5.0-SNAPSHOT
...
2023-06-13T11:06:47.4402222Z 2023-06-13 04:06:47.408 - stderr> 23/06/13 11:06:47 INFO TaskSetManager: Finished task 52.0 in stage 2.0 (TID 63) in 148 ms on 127.0.0.1 (executor 0) (60/200)
2023-06-13T11:06:48.1484169Z 
2023-06-13T11:06:48.8633864Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2023-06-13T11:06:48.9660849Z Session terminated, killing shell...
2023-06-13T11:06:49.2756183Z ##[error]The operation was canceled.
2023-06-13T11:06:49.4597252Z Cleaning up orphan processes
2023-06-13T11:06:49.6684941Z Terminate orphan process: pid (4061) (java)
2023-06-13T11:06:49.7698091Z Terminate orphan process: pid (661115) (java)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

@github-actions github-actions bot added the SQL label Jun 13, 2023
@dongjoon-hyun
Copy link
Member Author

cc @HyukjinKwon , @LuciferYang , @viirya

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why moving to sql - slow pipeline can mitigate flakiness?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 13, 2023

Thank you for review. The pipeline is flaky due to The runner has received a shutdown signal.. We are suspecting the spiky tests in these days. A SparkSubmit-based test is one of the heavy known tests.

@dongjoon-hyun
Copy link
Member Author

#41533 is one of the example we tried. And more bigger approach is here, #41552 .

@dongjoon-hyun
Copy link
Member Author

BTW, I must say that this is not the only reason why the sql - others pipeline is flaky. So, I wrote like 'mitigate' as a best-effort approach.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Okay, I think we can try this and see if it can mitigate the issue.

@dongjoon-hyun
Copy link
Member Author

Thank you!

@dongjoon-hyun
Copy link
Member Author

I verified that it's moved into sql - slow correctly.

https://github.com/dongjoon-hyun/spark/actions/runs/5259819811/jobs/9505917572

2023-06-13T21:43:35.3930154Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32mWholeStageCodegenSparkSubmitSuite:�[0m�[0m
2023-06-13T21:43:40.9666468Z 2023-06-13 14:43:40.965 - stderr> 23/06/13 21:43:40 INFO SparkContext: Running Spark version 3.5.0-SNAPSHOT
2023-06-13T21:43:41.0897608Z 2023-06-13 14:43:41.089 - stderr> 23/06/13 21:43:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

@dongjoon-hyun
Copy link
Member Author

Merged to master.

@LuciferYang
Copy link
Contributor

late LGTM

@HyukjinKwon
Copy link
Member

LGTM2

@dongjoon-hyun
Copy link
Member Author

Thank you, @LuciferYang and @HyukjinKwon !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-44045 branch June 14, 2023 01:12
@panbingkun
Copy link
Contributor

late LGTM

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 14, 2023

Now, it seems to be much better although it's still too early to say. Two commits (SPARK-44045 and SPARK-44021) passes the tests without flakiness.

Screenshot 2023-06-13 at 9 40 11 PM

czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
… `ExtendedSQLTest`

### What changes were proposed in this pull request?

This PR aims to move `WholeStageCodegenSparkSubmitSuite` to `sql - slow` pipeline to mitigate the recent `sql - others` pipeline's flakiness.

### Why are the changes needed?

`WholeStageCodegenSparkSubmitSuite` is the only test suite using `SparkSubmitTestUtils` in `sql` module.

```
$ git grep 'SparkSubmitTestUtils' | grep sql/core
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:import org.apache.spark.deploy.SparkSubmitTestUtils
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:class WholeStageCodegenSparkSubmitSuite extends SparkSubmitTestUtils
```

Like the following, this test case contributes the flakiness.

- https://github.com/wangyum/spark/actions/runs/5253058423/jobs/9489919333

```
2023-06-13T11:05:31.3387316Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32mWholeStageCodegenSparkSubmitSuite:�[0m�[0m
2023-06-13T11:05:36.6680896Z 2023-06-13 04:05:36.667 - stderr> 23/06/13 11:05:36 INFO SparkContext: Running Spark version 3.5.0-SNAPSHOT
...
2023-06-13T11:06:47.4402222Z 2023-06-13 04:06:47.408 - stderr> 23/06/13 11:06:47 INFO TaskSetManager: Finished task 52.0 in stage 2.0 (TID 63) in 148 ms on 127.0.0.1 (executor 0) (60/200)
2023-06-13T11:06:48.1484169Z
2023-06-13T11:06:48.8633864Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2023-06-13T11:06:48.9660849Z Session terminated, killing shell...
2023-06-13T11:06:49.2756183Z ##[error]The operation was canceled.
2023-06-13T11:06:49.4597252Z Cleaning up orphan processes
2023-06-13T11:06:49.6684941Z Terminate orphan process: pid (4061) (java)
2023-06-13T11:06:49.7698091Z Terminate orphan process: pid (661115) (java)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Pass the CIs.

Closes apache#41579 from dongjoon-hyun/SPARK-44045.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
… `ExtendedSQLTest`

### What changes were proposed in this pull request?

This PR aims to move `WholeStageCodegenSparkSubmitSuite` to `sql - slow` pipeline to mitigate the recent `sql - others` pipeline's flakiness.

### Why are the changes needed?

`WholeStageCodegenSparkSubmitSuite` is the only test suite using `SparkSubmitTestUtils` in `sql` module.

```
$ git grep 'SparkSubmitTestUtils' | grep sql/core
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:import org.apache.spark.deploy.SparkSubmitTestUtils
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSparkSubmitSuite.scala:class WholeStageCodegenSparkSubmitSuite extends SparkSubmitTestUtils
```

Like the following, this test case contributes the flakiness.

- https://github.com/wangyum/spark/actions/runs/5253058423/jobs/9489919333

```
2023-06-13T11:05:31.3387316Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32mWholeStageCodegenSparkSubmitSuite:�[0m�[0m
2023-06-13T11:05:36.6680896Z 2023-06-13 04:05:36.667 - stderr> 23/06/13 11:05:36 INFO SparkContext: Running Spark version 3.5.0-SNAPSHOT
...
2023-06-13T11:06:47.4402222Z 2023-06-13 04:06:47.408 - stderr> 23/06/13 11:06:47 INFO TaskSetManager: Finished task 52.0 in stage 2.0 (TID 63) in 148 ms on 127.0.0.1 (executor 0) (60/200)
2023-06-13T11:06:48.1484169Z
2023-06-13T11:06:48.8633864Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
2023-06-13T11:06:48.9660849Z Session terminated, killing shell...
2023-06-13T11:06:49.2756183Z ##[error]The operation was canceled.
2023-06-13T11:06:49.4597252Z Cleaning up orphan processes
2023-06-13T11:06:49.6684941Z Terminate orphan process: pid (4061) (java)
2023-06-13T11:06:49.7698091Z Terminate orphan process: pid (661115) (java)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
Pass the CIs.

Closes apache#41579 from dongjoon-hyun/SPARK-44045.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments