[SPARK-45511][SS] Fix state reader suite flakiness by clean up resources after each test run by chaoqin-li1123 · Pull Request #43831 · apache/spark

chaoqin-li1123 · 2023-11-16T05:50:07Z

What changes were proposed in this pull request?

Fix state reader suite flakiness by clean up resources after each test.

The reason we have to clean up StateStore per test is due to maintenance task. When we run the streaming query, state store is being initialized in to the executor, and registration is performed against the coordinator in driver. The lifecycle of the state store provider is not strictly tied to the the lifecycle of the streaming query - the executor closes the state store provider when coordinator indicates to the executor that the state store provider is no longer valid, which is not immediately after the streaming query has stopped. The lifecycle of the state store provider can overlap among tests.

This means maintenance task against the provider can run after test A. We are clearing the temp directory in test A after the test A has completed, which can break the operation being performed against state store provider being used in test A. E.g. directory no longer exists while maintenance task is running.

This won't be an issue in practice because we do not expect the checkpoint location to be temporary, but it is indeed an issue for how we setup and cleanup env for tests.

Why are the changes needed?

To deflake the test.

chaoqin-li1123 · 2023-11-16T05:53:00Z

@HeartSaVioR

HeartSaVioR

+1 Nice finding! Pending CI.

dongjoon-hyun · 2023-11-16T08:02:57Z

...test/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceTestBase.scala

 import org.apache.spark.sql.streaming.util.StreamManualClock

-trait StateDataSourceTestBase extends StreamTest with StateStoreMetricsTest {
+trait StateDataSourceTestBase extends StreamTest with BeforeAndAfter with StateStoreMetricsTest {


StreamTest already has BeforeAndAfterAll. Can we use beforeAll and afterAll instead, @chaoqin-li1123 ?

The reason we have to clean up StateStore per test is due to maintenance task. When we run the streaming query, state store is being initialized in to the executor, and registration is performed against the coordinator in driver. The lifecycle of the state store provider is not strictly tied to the the lifecycle of the streaming query - the executor closes the state store provider when coordinator indicates to the executor that the state store provider is no longer valid, which is not immediately after the streaming query has stopped. The lifecycle of the state store provider can overlap among tests.

This means maintenance task against the provider can run after test A. We are clearing the temp directory in test A after the test A has completed, which can break the operation being performed against state store provider being used in test A. E.g. directory no longer exists while maintenance task is running.
(It's far more problematic as exception in maintenance task will unload all state store providers which corresponding tasks may run concurrently, leading failures on running queries, or even JVM crash for RocksDB state store provider. That's scary, but it happens seldom/rarely so we can have time to revisit later.)

This won't be an issue in practice because we do not expect the checkpoint location to be temporary, but it is indeed an issue for how we setup and cleanup env for tests.

Arguably we can defer the cleanup of temp directory till VM shutdown or test suite cleanup and move this to afterAll, but we do this for stream-stream join test suite already, so would like to be consistent with existing practice.

Thank you. Got it. Then, shall we use beforeEach and afterEach because SparkFunSuite has BeforeAndAfterEach?

I've just updated the description of PR to have full context on the problem.

This follows the pattern we used in the existing suite.

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala#L42-L54

But I'm fine either way. Good to know we already extend BeforeAndAfterEach by default and no need to extend others.

The reason why I requested is that BeforeAndAfter is deprecated, @HeartSaVioR . I hope we follow the scalatest recommendation in Apache Spark 4.0.0 and avoid adding more instances of this deprecated class.

https://www.scalatest.org/scaladoc/1.4.1/org/scalatest/BeforeAndAfter.html

This trait has been deprecated and will be removed in a future version of ScalaTest. If you are only using its beforeEach and/or afterEach methods, mix in BeforeAndAfterEach instead.

Ah, great point. Thanks for the context! @chaoqin-li1123 Shall we update the trait?

dongjoon-hyun

+1, LGTM (Pending CIs). Thank you, @chaoqin-li1123 and @HeartSaVioR .

HeartSaVioR · 2023-11-16T22:26:55Z

Thanks! Merging to master.

HeartSaVioR · 2023-11-16T22:29:17Z

Ah I missed the title - @chaoqin-li1123 let's add [FOLLOWUP] tag in the PR title when you add a small change on top of resolved JIRA and want to reuse an already resolved JIRA.

fix

9ccd936

github-actions bot added the SQL label Nov 16, 2023

chaoqin-li1123 changed the title ~~[SPARK-45511] fix state reader suite flakiness by clean up resources after each test run~~ [SPARK-45511] Fix state reader suite flakiness by clean up resources after each test run Nov 16, 2023

empty

6173417

chaoqin-li1123 changed the title ~~[SPARK-45511] Fix state reader suite flakiness by clean up resources after each test run~~ [SPARK-45511][SS] Fix state reader suite flakiness by clean up resources after each test run Nov 16, 2023

HeartSaVioR approved these changes Nov 16, 2023

View reviewed changes

dongjoon-hyun reviewed Nov 16, 2023

View reviewed changes

chaoqin-li1123 added 2 commits November 16, 2023 09:55

use before and after each

40847a4

fix

7c08ad8

dongjoon-hyun approved these changes Nov 16, 2023

View reviewed changes

HeartSaVioR closed this in aff9eab Nov 16, 2023

Comments

Conversation

chaoqin-li1123 commented Nov 16, 2023 • edited by HeartSaVioR Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

chaoqin-li1123 commented Nov 16, 2023

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Nov 16, 2023

Uh oh!

HeartSaVioR commented Nov 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chaoqin-li1123 commented Nov 16, 2023 •

edited by HeartSaVioR

Loading

HeartSaVioR Nov 16, 2023 •

edited

Loading

dongjoon-hyun Nov 16, 2023 •

edited

Loading