[SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData #25731

HeartSaVioR · 2019-09-09T12:18:07Z

What changes were proposed in this pull request?

This patch fixes the bug regarding accessing DStreamCheckpointData.currentCheckpointFiles without guarding which makes the test basic rdd checkpoints + dstream graph checkpoint recovery being flaky.

There're two possible points to make test failing:

checkpoint logic is too slow so that checkpoint cannot be handled within real delay
There's multithreads-unsafe point in DStreamCheckpointData.update: it clears currentCheckpointFiles and adds new checkpointFiles. Race condition can happen between main thread for test and JobGenerator's event loop thread.

lastProcessedBatch guarantees that all events for given time are processed, as commented:
// last batch whose completion,checkpointing and metadata cleanup has been completed. That means, if we wait for time for exactly same amount as advanced the time in test (multiply of checkpoint interval as well as batch duration) we can expect nothing will happen in DStreamCheckpointData afterwards unless we advance the clock.

This patch applies the observation above.

Why are the changes needed?

The test is reported as flaky as SPARK-28214, and the test code seems unsafe.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modified UT. I've added some debug messages and confirmed no method in DStreamCheckpointData is being called between "after waiting lastProcessedBatch" and "advancing clock" even I added huge amount of sleep between twos, which avoids race-condition.

I was also able to make existing test artificially failing (not 100% consistently but high likely) via adding sleep between currentCheckpointFiles.clear() and currentCheckpointFiles ++= checkpointFiles in DStreamCheckpointData.update, and confirmed modified test doesn't fail the test multiple times.

…processed before accessing DStreamCheckpointData

HeartSaVioR · 2019-09-09T12:22:47Z

We can guard DStreamCheckpointData.currentCheckpointFiles to synchronize it among any methods in DStreamCheckpointData, but it seems like overkill just for testing. It is only accessed "concurrently" when test code is accessing directly.

SparkQA · 2019-09-09T13:37:24Z

Test build #110349 has finished for PR 25731 at commit bd8470f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-09-09T22:36:11Z

Looks good, merging to master.

HeartSaVioR · 2019-09-09T23:20:33Z

Thanks for the quick review and merge!

… fully processed before accessing DStreamCheckpointData ### What changes were proposed in this pull request? This patch fixes the bug regarding accessing `DStreamCheckpointData.currentCheckpointFiles` without guarding which makes the test `basic rdd checkpoints + dstream graph checkpoint recovery` being flaky. There're two possible points to make test failing: 1. checkpoint logic is too slow so that checkpoint cannot be handled within real delay 2. There's multithreads-unsafe point in `DStreamCheckpointData.update`: it clears `currentCheckpointFiles` and adds new checkpointFiles. Race condition can happen between main thread for test and JobGenerator's event loop thread. `lastProcessedBatch` guarantees that all events for given time are processed, as commented: `// last batch whose completion,checkpointing and metadata cleanup has been completed`. That means, if we wait for time for exactly same amount as advanced the time in test (multiply of checkpoint interval as well as batch duration) we can expect nothing will happen in DStreamCheckpointData afterwards unless we advance the clock. This patch applies the observation above. ### Why are the changes needed? The test is reported as flaky as [SPARK-28214](https://issues.apache.org/jira/browse/SPARK-28214), and the test code seems unsafe. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified UT. I've added some debug messages and confirmed no method in DStreamCheckpointData is being called between "after waiting lastProcessedBatch" and "advancing clock" even I added huge amount of sleep between twos, which avoids race-condition. I was also able to make existing test artificially failing (not 100% consistently but high likely) via adding sleep between `currentCheckpointFiles.clear()` and `currentCheckpointFiles ++= checkpointFiles` in `DStreamCheckpointData.update`, and confirmed modified test doesn't fail the test multiple times. Closes apache#25731 from HeartSaVioR/SPARK-28214. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

[SPARK-28214][STREAMING] CheckpointSuite: wait for batch to be fully …

bd8470f

…processed before accessing DStreamCheckpointData

HeartSaVioR changed the title ~~[SPARK-28214][STREAMING] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData~~ [SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData Sep 9, 2019

dongjoon-hyun added the TESTS label Sep 9, 2019

vanzin closed this in 8018ded Sep 9, 2019

HeartSaVioR deleted the SPARK-28214 branch September 9, 2019 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData #25731

[SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData #25731

HeartSaVioR commented Sep 9, 2019 •

edited

HeartSaVioR commented Sep 9, 2019

SparkQA commented Sep 9, 2019

vanzin commented Sep 9, 2019

HeartSaVioR commented Sep 9, 2019

[SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData #25731

[SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData #25731

Conversation

HeartSaVioR commented Sep 9, 2019 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HeartSaVioR commented Sep 9, 2019

SparkQA commented Sep 9, 2019

vanzin commented Sep 9, 2019

HeartSaVioR commented Sep 9, 2019

HeartSaVioR commented Sep 9, 2019 •

edited