Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData #25731

Closed
wants to merge 1 commit into from

Conversation

HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Sep 9, 2019

What changes were proposed in this pull request?

This patch fixes the bug regarding accessing DStreamCheckpointData.currentCheckpointFiles without guarding which makes the test basic rdd checkpoints + dstream graph checkpoint recovery being flaky.

There're two possible points to make test failing:

  1. checkpoint logic is too slow so that checkpoint cannot be handled within real delay
  2. There's multithreads-unsafe point in DStreamCheckpointData.update: it clears currentCheckpointFiles and adds new checkpointFiles. Race condition can happen between main thread for test and JobGenerator's event loop thread.

lastProcessedBatch guarantees that all events for given time are processed, as commented:
// last batch whose completion,checkpointing and metadata cleanup has been completed. That means, if we wait for time for exactly same amount as advanced the time in test (multiply of checkpoint interval as well as batch duration) we can expect nothing will happen in DStreamCheckpointData afterwards unless we advance the clock.

This patch applies the observation above.

Why are the changes needed?

The test is reported as flaky as SPARK-28214, and the test code seems unsafe.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modified UT. I've added some debug messages and confirmed no method in DStreamCheckpointData is being called between "after waiting lastProcessedBatch" and "advancing clock" even I added huge amount of sleep between twos, which avoids race-condition.

I was also able to make existing test artificially failing (not 100% consistently but high likely) via adding sleep between currentCheckpointFiles.clear() and currentCheckpointFiles ++= checkpointFiles in DStreamCheckpointData.update, and confirmed modified test doesn't fail the test multiple times.

…processed before accessing DStreamCheckpointData
@HeartSaVioR
Copy link
Contributor Author

We can guard DStreamCheckpointData.currentCheckpointFiles to synchronize it among any methods in DStreamCheckpointData, but it seems like overkill just for testing. It is only accessed "concurrently" when test code is accessing directly.

@HeartSaVioR HeartSaVioR changed the title [SPARK-28214][STREAMING] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData [SPARK-28214][STREAMING][TESTS] CheckpointSuite: wait for batch to be fully processed before accessing DStreamCheckpointData Sep 9, 2019
@SparkQA
Copy link

SparkQA commented Sep 9, 2019

Test build #110349 has finished for PR 25731 at commit bd8470f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Sep 9, 2019

Looks good, merging to master.

@vanzin vanzin closed this in 8018ded Sep 9, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks for the quick review and merge!

@HeartSaVioR HeartSaVioR deleted the SPARK-28214 branch September 9, 2019 23:20
PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
… fully processed before accessing DStreamCheckpointData

### What changes were proposed in this pull request?

This patch fixes the bug regarding accessing `DStreamCheckpointData.currentCheckpointFiles` without guarding which makes the test `basic rdd checkpoints + dstream graph checkpoint recovery` being flaky.

There're two possible points to make test failing:

1. checkpoint logic is too slow so that checkpoint cannot be handled within real delay
2. There's multithreads-unsafe point in `DStreamCheckpointData.update`: it clears `currentCheckpointFiles` and adds new checkpointFiles. Race condition can happen between main thread for test and JobGenerator's event loop thread.

`lastProcessedBatch` guarantees that all events for given time are processed, as commented:
`// last batch whose completion,checkpointing and metadata cleanup has been completed`. That means, if we wait for time for exactly same amount as advanced the time in test (multiply of checkpoint interval as well as batch duration) we can expect nothing will happen in DStreamCheckpointData afterwards unless we advance the clock.

This patch applies the observation above.

### Why are the changes needed?

The test is reported as flaky as [SPARK-28214](https://issues.apache.org/jira/browse/SPARK-28214), and the test code seems unsafe.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Modified UT. I've added some debug messages and confirmed no method in DStreamCheckpointData is being called between "after waiting lastProcessedBatch" and "advancing clock" even I added huge amount of sleep between twos, which avoids race-condition.

I was also able to make existing test artificially failing (not 100% consistently but high likely) via adding sleep between `currentCheckpointFiles.clear()` and `currentCheckpointFiles ++= checkpointFiles` in `DStreamCheckpointData.update`, and confirmed modified test doesn't fail the test multiple times.

Closes apache#25731 from HeartSaVioR/SPARK-28214.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants