New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32857][CORE] Fix flaky o.a.s.s.BarrierTaskContextSuite.throw exception if the number of barrier() calls are not the same on every task #29732
Conversation
@jiangxb1987 Please take a look, thanks! |
Test build #128576 has finished for PR 29732 at commit
|
retest this please |
Test build #128612 has finished for PR 29732 at commit
|
cc @jiangxb1987 |
core/src/test/scala/org/apache/spark/scheduler/BarrierTaskContextSuite.scala
Outdated
Show resolved
Hide resolved
can we simply resolve the issue by setting longer barrier sync timeout? |
604e061
to
6e7044c
Compare
Increased the timeout to 5s. I think it should be enough long. |
Test build #128847 has finished for PR 29732 at commit
|
retest this please |
Test build #128947 has finished for PR 29732 at commit
|
cc @jiangxb1987 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #129194 has finished for PR 29732 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @Ngone51 and @jiangxb1987 .
I also agree with the above analysis. Merged to master for Apache Spark 3.1.0 on December 2020.
What changes were proposed in this pull request?
Fix the flaky test.
Why are the changes needed?
The test is flaky:
Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown
.Check the full error stack here.
By analyzing the log below, I found that task 0 hadn't reached the second
context.barrier()
when another three tasks already raised the sync timeout exceptions by the firstcontext.barrier()
. The timeout exceptions were caught by thetry...catch...
. Then, each task started another round barrier sync from the secondcontext.barrier()
and completed the sync successfully.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Updated the test and tested a hundred times without failure(Previously, there could be several failures).