Skip to content

Conversation

@chamikaramj
Copy link
Contributor

Rollback for PR #6927 since that seems to be breaking Nexmark tests.

@chamikaramj
Copy link
Contributor Author

Dataflow Runner Nexmark Tests

@chamikaramj
Copy link
Contributor Author

Run beam_PostCommit_Java_Nexmark_Dataflow

@chamikaramj
Copy link
Contributor Author

Run Dataflow ValidatesRunner

@chamikaramj chamikaramj changed the title Rollbacks PR #6927 [BEAM-6002] Rollbacks PR #6927 Nov 7, 2018
@chamikaramj
Copy link
Contributor Author

cc: @reuvenlax @akedin

@reuvenlax
Copy link
Contributor

reuvenlax commented Nov 7, 2018 via email

@chamikaramj
Copy link
Contributor Author

https://builds.apache.org/view/A-D/view/Beam/job/beam_PostCommit_Java_Nexmark_Dataflow/

I confirmed that nexmark for DataflowRunner passes with this skipped.

@reuvenlax
Copy link
Contributor

reuvenlax commented Nov 7, 2018 via email

@chamikaramj
Copy link
Contributor Author

I think the issue is that tests are timing out since yesterday.

Could this be due to the performance difference between HashMap and ConcurrentHashMap: https://stackoverflow.com/questions/1378310/performance-concurrenthashmap-vs-hashmap

@reuvenlax
Copy link
Contributor

reuvenlax commented Nov 7, 2018 via email

@kennknowles
Copy link
Member

I'm not convinced it is a performance issue. The issue is https://issues.apache.org/jira/projects/BEAM/issues/BEAM-6002. The timeouts look more like the Nexmark suites being healthy for ~10 minutes and then straight up hanging for 4 hours. This is a deadlock of some sort I would assume.

@kennknowles
Copy link
Member

kennknowles commented Nov 7, 2018

The Dataflow runner stuff is a rate limit for writing to a single jar, the DF worker. The limit on GCS is one write per second so this is pretty obvious. The root cause is the Nexmark launcher sharing the staging location for all jobs. TBH not sure why it is surfacing now.

@kennknowles
Copy link
Member

Can you modify the Nexmark to have a PR phrase trigger so we can verify? Or you could post a gradle scan.

@kennknowles
Copy link
Member

Also FWIW ValidatesRunner is not broken, so no need to wait for that.

@kennknowles
Copy link
Member

I can confirm that the instructions at https://beam.apache.org/documentation/sdks/java/testing/nexmark/ for smoke test on DirectRunner definitely repro on master.

./gradlew :beam-sdks-java-nexmark:run \
    -Pnexmark.runner=":beam-runners-direct-java" \
    -Pnexmark.args="
        --runner=DirectRunner
        --streaming=false
        --suite=SMOKE
        --manageResources=false
        --monitorJobs=true
        --enforceEncodability=true
        --enforceImmutability=true"

FYI I have moved them to https://cwiki.apache.org/confluence/display/BEAM/Running+Nexmark because it doesn't really make sense for end-users.

I can't post a scan because it deadlocks somewhat at random.

And on this PR it succeeds, and here is a scan: https://gradle.com/s/xnx6gqfwnirxo

@kennknowles
Copy link
Member

LGTM. Please work to make it easy to confirm that the proposed fix works. I has to be something that someone looking at the PR can check.

@kennknowles kennknowles merged commit a2fc15a into apache:master Nov 7, 2018
@chamikaramj
Copy link
Contributor Author

@apilloud on adding a trigger for running beam_PostCommit_Java_Nexmark_Dataflow on PRs. Possibly this is not included intentionally to prevent results from experimental runs from being published as perf data ?

@reuvenlax
Copy link
Contributor

reuvenlax commented Nov 7, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants