Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor Framework speed improvements #20447

Merged
merged 6 commits into from Sep 12, 2017
Merged

Conversation

Dr15Jones
Copy link
Contributor

While testing the limits of the framework's event throughput using a 'hyper fast' configuration, a few performance bottlenecks were found. With these changes, and running cmsRun on moderate hardware, one can get 33,000 event/sec with one thread and get a max throughput of 180,000 events/sec using 9 threads.

While doing high speed configuration tests, the unnecessary calculation of parentageID (which is a slow hash calculation) showed up in the performance profiling.
The call to processEventAsync is made in the source's serial queue task. The processEventAsync call does a lot of work, all of which can actually be done as a different task. Now processEventAsync immediately launched a new tbb task which just does the work of the original processEventAsync but this time no longer within the timing of the serial queue task. This doubled the scalability of a simple hyper fast configuration.
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2017

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2017

A new Pull Request was created by @Dr15Jones (Chris Jones) for master.

It involves the following packages:

FWCore/Framework

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @wddgit this is something you requested to watch as well.
@davidlange6, @slava77 you are the release manager for this.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor Author

@fwyzard here are some further improvements to drive the hyper fast configurations.

@Dr15Jones
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/22834/console Started: 2017/09/09 18:34

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2017

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/PR-20447/602

Code check has found code style and quality issues which could be resolved by applying a patch in https://cmssdt.cern.ch/SDT/code-checks/PR-20447/602/git-diff.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/PR-20447/602/git-diff.patch | patch -p1

You can run scram build code-checks to apply code checks directly (this will soon be required for PRs to be considered)

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2017

-1

Tested at: 45ad3bc

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
3b9aa09
You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20447/22834/git-log-recent-commits
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20447/22834/git-merge-result

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20447/22834/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests RelVals

  • Unit Tests:

I found errors in the following unit tests:

---> test TestFWCoreFrameworkGlobalStreamOne had ERRORS

  • RelVals:

When I ran the RelVals I found an error in the following worklfows:
1306.0 step4

runTheMatrix-results/1306.0_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15/step4_SingleMuPt1_UP15+SingleMuPt1_UP15+DIGIUP15+RECOUP15+HARVESTUP15.log

10024.0 step5
runTheMatrix-results/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+ALCAFull_2017+HARVESTFull_2017/step5_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+ALCAFull_2017+HARVESTFull_2017.log

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:
3b9aa09
You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20447/22834/git-log-recent-commits
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20447/22834/git-merge-result

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2017

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

@Dr15Jones
Copy link
Contributor Author

Please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 9, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/22837/console Started: 2017/09/09 21:59

@Dr15Jones
Copy link
Contributor Author

@smuzaffar is there a time limit on how long the tests can run? I can not see why the tests for this fail in a step4 job running single threaded since the harvesting step never calls the code I changed!

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20447/22843/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 16 differences found in the comparisons
  • DQMHistoTests: Total files compared: 26
  • DQMHistoTests: Total histograms compared: 2642440
  • DQMHistoTests: Total failures: 238
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2642012
  • DQMHistoTests: Total skipped: 189
  • DQMHistoTests: Total Missing objects: 0
  • Checked 107 log files, 14 edm output root files, 26 DQM output files

@smuzaffar
Copy link
Contributor

@Dr15Jones , although we do run with timeout 10800 for full runTheMatrix but I see that this was OOM-KILL. For PR testing we are running 8 workflows in parallel on a 8 core machine with 15GB memory. May be we should run cpu-1 workflows in parallel for PR testing.

@davidlange6
Copy link
Contributor

please test
(or is there some reason to think this PR increases RSS by some significant amount?

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 11, 2017

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/22847/console Started: 2017/09/11 12:16

@Dr15Jones
Copy link
Contributor Author

@davidlange6 I can't think of any reason why this pull request would increase RSS in any noticable way.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-20447/22847/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 16 differences found in the comparisons
  • DQMHistoTests: Total files compared: 26
  • DQMHistoTests: Total histograms compared: 2642440
  • DQMHistoTests: Total failures: 222
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2642028
  • DQMHistoTests: Total skipped: 189
  • DQMHistoTests: Total Missing objects: 0
  • Checked 107 log files, 14 edm output root files, 26 DQM output files

@davidlange6
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit a9cc7d9 into cms-sw:master Sep 12, 2017
@Dr15Jones Dr15Jones deleted the speedImprovements branch September 12, 2017 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants