-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-6002] Rollbacks PR #6927 #6970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Dataflow Runner Nexmark Tests |
|
Run beam_PostCommit_Java_Nexmark_Dataflow |
|
Run Dataflow ValidatesRunner |
|
cc: @reuvenlax @akedin |
|
What is breaking? I'm not sure I understand this.
…On Wed, Nov 7, 2018, 10:06 AM Chamikara Jayalath ***@***.*** wrote:
cc: @reuvenlax <https://github.com/reuvenlax> @akedin
<https://github.com/akedin>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6970 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AUGE1fDnX6trxJC6lcehb9qmYauoavEaks5uskAPgaJpZM4YRlWm>
.
|
|
https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_Nexmark_Dataflow/ I confirmed that nexmark for DataflowRunner passes with this skipped. |
|
The errors I see in the link you sent are rate-limited exceeds to GCS. Is
there another error I should be looking at?
I'm asking because I can't even imagine how changing HashMap ->
ConcurrentHashMap (which is lock free) could cause failures.
…On Wed, Nov 7, 2018 at 10:12 AM Chamikara Jayalath ***@***.***> wrote:
https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_Nexmark_Dataflow/
I confirmed that nexmark for DataflowRunner passes with this skipped.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6970 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AUGE1UbMFxaRmmX3XCMCvPFeqFTgt03dks5uskGQgaJpZM4YRlWm>
.
|
|
I think the issue is that tests are timing out since yesterday. Could this be due to the performance difference between HashMap and ConcurrentHashMap: https://stackoverflow.com/questions/1378310/performance-concurrenthashmap-vs-hashmap |
|
This code path is only called at graph-generation time (not in the data
path), and is called once per coder in the pipeline. Seems unlikely that
performance is an issue here.
…On Wed, Nov 7, 2018 at 10:34 AM Chamikara Jayalath ***@***.***> wrote:
I think the issue is that tests are timing out since yesterday.
Could this be due to the performance difference between HashMap and
ConcurrentHashMap:
https://stackoverflow.com/questions/1378310/performance-concurrenthashmap-vs-hashmap
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6970 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AUGE1Vv0Vvr5r2fSIvh8uJYbV144ywa3ks5uskbRgaJpZM4YRlWm>
.
|
|
I'm not convinced it is a performance issue. The issue is https://issues.apache.org/jira/projects/BEAM/issues/BEAM-6002. The timeouts look more like the Nexmark suites being healthy for ~10 minutes and then straight up hanging for 4 hours. This is a deadlock of some sort I would assume. |
|
The Dataflow runner stuff is a rate limit for writing to a single jar, the DF worker. The limit on GCS is one write per second so this is pretty obvious. The root cause is the Nexmark launcher sharing the staging location for all jobs. TBH not sure why it is surfacing now. |
|
Can you modify the Nexmark to have a PR phrase trigger so we can verify? Or you could post a gradle scan. |
|
Also FWIW ValidatesRunner is not broken, so no need to wait for that. |
|
I can confirm that the instructions at https://beam.apache.org/documentation/sdks/java/testing/nexmark/ for smoke test on DirectRunner definitely repro on FYI I have moved them to https://cwiki.apache.org/confluence/display/BEAM/Running+Nexmark because it doesn't really make sense for end-users. I can't post a scan because it deadlocks somewhat at random. And on this PR it succeeds, and here is a scan: https://gradle.com/s/xnx6gqfwnirxo |
|
LGTM. Please work to make it easy to confirm that the proposed fix works. I has to be something that someone looking at the PR can check. |
|
@apilloud on adding a trigger for running beam_PostCommit_Java_Nexmark_Dataflow on PRs. Possibly this is not included intentionally to prevent results from experimental runs from being published as perf data ? |
|
I believe I see the problem. The body of computeIfAbsent might make a
recursive call back to generate which will call computeIfAbsent again. This
is explicitly disallowed by ConcurrentHashMap (which is a bit confusing,
since it is allowed by the base class Map::computeIfAbsent).
…On Wed, Nov 7, 2018 at 1:03 PM Chamikara Jayalath ***@***.***> wrote:
@apilloud <https://github.com/apilloud> on adding a trigger for running
beam_PostCommit_Java_Nexmark_Dataflow on PRs. Possibly this is not included
intentionally to prevent results from experimental runs from being
published as perf data ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6970 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AUGE1VrpeApsqVAKgiGP--GSkl7Uhu7Qks5usmmcgaJpZM4YRlWm>
.
|
Rollback for PR #6927 since that seems to be breaking Nexmark tests.