New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32734][SPARK-32735][DSTREAM] Fix batch submission delay caused by actions in dstream transform #29578
Conversation
Can one of the admins verify this patch? |
Thanks for the work, @Olwn. Btw, have you checked the guide? https://spark.apache.org/contributing.html You need to set correct tags in the title like |
Thanks for correcting the title @maropu. I have read the contributing guide but missed this point. |
hi @tdas , could you help review this pull request? Seems you are the main contributor of spark streaming. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Currently dstream.getOrCompute runs at JobGenerator, which has a single thread event loop.
This pull request moves that to JobScheduler.
Why are the changes needed?
Some of our spark applications have batch creation delay after running for some time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the latest batch doesn't match current time.
We observe such applications share a commonality that rdd actions exist in dstream.transfrom. Those actions will be executed in dstream.compute, which is called by JobGenerator. JobGenerator runs with a single thread event loop so any synchronized operations will block event processing.
Does this PR introduce any user-facing change?
No
Reproduce
How was this patch tested?
I created a test ForEachDStreamSuite to make sure batch execution doesn't block batch submission.
I ran a streaming application and saw all jobs showing at batch page. A test JobSchedulerSuite is added to make sure all jobs in a batch can be associated with the BatchTime and display at Spark UI
JIRAs
https://issues.apache.org/jira/browse/SPARK-32734
https://issues.apache.org/jira/browse/SPARK-32735