Skip to content

Comments

[SPARK-45511][SS] Fix state reader suite flakiness by clean up resources after each test run#43831

Closed
chaoqin-li1123 wants to merge 4 commits intoapache:masterfrom
chaoqin-li1123:fix_state_reader_suite
Closed

[SPARK-45511][SS] Fix state reader suite flakiness by clean up resources after each test run#43831
chaoqin-li1123 wants to merge 4 commits intoapache:masterfrom
chaoqin-li1123:fix_state_reader_suite

Conversation

@chaoqin-li1123
Copy link
Contributor

@chaoqin-li1123 chaoqin-li1123 commented Nov 16, 2023

What changes were proposed in this pull request?

Fix state reader suite flakiness by clean up resources after each test.

The reason we have to clean up StateStore per test is due to maintenance task. When we run the streaming query, state store is being initialized in to the executor, and registration is performed against the coordinator in driver. The lifecycle of the state store provider is not strictly tied to the the lifecycle of the streaming query - the executor closes the state store provider when coordinator indicates to the executor that the state store provider is no longer valid, which is not immediately after the streaming query has stopped. The lifecycle of the state store provider can overlap among tests.

This means maintenance task against the provider can run after test A. We are clearing the temp directory in test A after the test A has completed, which can break the operation being performed against state store provider being used in test A. E.g. directory no longer exists while maintenance task is running.

This won't be an issue in practice because we do not expect the checkpoint location to be temporary, but it is indeed an issue for how we setup and cleanup env for tests.

Why are the changes needed?

To deflake the test.

@github-actions github-actions bot added the SQL label Nov 16, 2023
@chaoqin-li1123 chaoqin-li1123 changed the title [SPARK-45511] fix state reader suite flakiness by clean up resources after each test run [SPARK-45511] Fix state reader suite flakiness by clean up resources after each test run Nov 16, 2023
@chaoqin-li1123
Copy link
Contributor Author

@HeartSaVioR

@chaoqin-li1123 chaoqin-li1123 changed the title [SPARK-45511] Fix state reader suite flakiness by clean up resources after each test run [SPARK-45511][SS] Fix state reader suite flakiness by clean up resources after each test run Nov 16, 2023
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Nice finding! Pending CI.

import org.apache.spark.sql.streaming.util.StreamManualClock

trait StateDataSourceTestBase extends StreamTest with StateStoreMetricsTest {
trait StateDataSourceTestBase extends StreamTest with BeforeAndAfter with StateStoreMetricsTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamTest already has BeforeAndAfterAll. Can we use beforeAll and afterAll instead, @chaoqin-li1123 ?

Copy link
Contributor

@HeartSaVioR HeartSaVioR Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we have to clean up StateStore per test is due to maintenance task. When we run the streaming query, state store is being initialized in to the executor, and registration is performed against the coordinator in driver. The lifecycle of the state store provider is not strictly tied to the the lifecycle of the streaming query - the executor closes the state store provider when coordinator indicates to the executor that the state store provider is no longer valid, which is not immediately after the streaming query has stopped. The lifecycle of the state store provider can overlap among tests.

This means maintenance task against the provider can run after test A. We are clearing the temp directory in test A after the test A has completed, which can break the operation being performed against state store provider being used in test A. E.g. directory no longer exists while maintenance task is running.
(It's far more problematic as exception in maintenance task will unload all state store providers which corresponding tasks may run concurrently, leading failures on running queries, or even JVM crash for RocksDB state store provider. That's scary, but it happens seldom/rarely so we can have time to revisit later.)

This won't be an issue in practice because we do not expect the checkpoint location to be temporary, but it is indeed an issue for how we setup and cleanup env for tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably we can defer the cleanup of temp directory till VM shutdown or test suite cleanup and move this to afterAll, but we do this for stream-stream join test suite already, so would like to be consistent with existing practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Got it. Then, shall we use beforeEach and afterEach because SparkFunSuite has BeforeAndAfterEach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just updated the description of PR to have full context on the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the pattern we used in the existing suite.

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala#L42-L54

But I'm fine either way. Good to know we already extend BeforeAndAfterEach by default and no need to extend others.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I requested is that BeforeAndAfter is deprecated, @HeartSaVioR . I hope we follow the scalatest recommendation in Apache Spark 4.0.0 and avoid adding more instances of this deprecated class.

This trait has been deprecated and will be removed in a future version of ScalaTest. If you are only using its beforeEach and/or afterEach methods, mix in BeforeAndAfterEach instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, great point. Thanks for the context! @chaoqin-li1123 Shall we update the trait?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you, @chaoqin-li1123 and @HeartSaVioR .

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

@HeartSaVioR
Copy link
Contributor

Ah I missed the title - @chaoqin-li1123 let's add [FOLLOWUP] tag in the PR title when you add a small change on top of resolved JIRA and want to reuse an already resolved JIRA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants