Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32734][SPARK-32735][DSTREAM] Fix batch submission delay caused by actions in dstream transform #29578

Closed
wants to merge 12 commits into from

Conversation

Olwn
Copy link

@Olwn Olwn commented Aug 29, 2020

What changes were proposed in this pull request?

Currently dstream.getOrCompute runs at JobGenerator, which has a single thread event loop.
This pull request moves that to JobScheduler.

Why are the changes needed?

Some of our spark applications have batch creation delay after running for some time. For instance, Batch 10:03 is submitted at 10:06. In spark UI, the latest batch doesn't match current time.
We observe such applications share a commonality that rdd actions exist in dstream.transfrom. Those actions will be executed in dstream.compute, which is called by JobGenerator. JobGenerator runs with a single thread event loop so any synchronized operations will block event processing.

Does this PR introduce any user-facing change?

No

Reproduce

// run this at Spark shell. Access Spark UI and check batch submission
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val mapped = words.transform(rdd => { Thread.sleep(10000); rdd; } )
mapped.foreachRDD(rdd => rdd.foreach(x => println(x)))
// run this at Spark shell. Access Spark UI and look streaming output and associated jobs
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val mapped = words.transform(rdd => { val c=rdd.count(); rdd.map(x => s"$c x")} )
mapped.foreachRDD(rdd => rdd.foreach(x => println(x)))

How was this patch tested?

  • For the batch submission delay issue,
    I created a test ForEachDStreamSuite to make sure batch execution doesn't block batch submission.
  • For Spark UI issue,
    I ran a streaming application and saw all jobs showing at batch page. A test JobSchedulerSuite is added to make sure all jobs in a batch can be associated with the BatchTime and display at Spark UI

JIRAs

https://issues.apache.org/jira/browse/SPARK-32734
https://issues.apache.org/jira/browse/SPARK-32735

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@maropu maropu changed the title Fix batch submission delay caused by actions in dstream transform [SPARK-32734][SPARK-32735][DSTREAM] Fix batch submission delay caused by actions in dstream transform Sep 2, 2020
@maropu
Copy link
Member

maropu commented Sep 2, 2020

Thanks for the work, @Olwn. Btw, have you checked the guide? https://spark.apache.org/contributing.html You need to set correct tags in the title like [SPARK-XXXXX][some_component_tag], so I've fixed it. And, do you target to fix both issue SPARK-32734 and SPARK-32735 in a single PR? Basically, we need to make a PR for each issue.

@Olwn
Copy link
Author

Olwn commented Sep 3, 2020

Thanks for correcting the title @maropu. I have read the contributing guide but missed this point.
Regarding SPARK-32734 and SPARK-32735, yes this PR is a small change but fixes both issues.

@Olwn
Copy link
Author

Olwn commented Sep 8, 2020

hi @tdas , could you help review this pull request? Seems you are the main contributor of spark streaming.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 18, 2020
@github-actions github-actions bot closed this Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants