Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Eigen compatibility with CUDA (10.4.x) #4474

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Nov 3, 2018

Update Eigen to the master branch as of Tue Sep 25 20:26:16 2018 +0200

  • hg hash 66ba78bf7efa93f69a075830a87a010ed1b1fe30
  • git hash 01ae86b9aad30b1e65cf1b749fd6cd9a645ac00d

Patch Tensorflow to follow Eigen internal changes

  • cherry-pick changes from upstream repository
  • add local changes for the latest updates

Improve Eigen compatibility with CUDA

  • add support for cache-size queries on CUDA devices.
  • extend support for matrix inversion on CUDA devices above 4x4 matrices;
    the size of the matrices that can be inverted is limited at runtime by the per-thread stack size.
  • extend support for diagonal matrices on CUDA devices.
  • fix deprecation warning in CUDA 10.0.

Update Eigen to the master branch as of Tue Sep 25 20:26:16 2018 +0200
  - hg hash  66ba78bf7efa93f69a075830a87a010ed1b1fe30
  - git hash 01ae86b9aad30b1e65cf1b749fd6cd9a645ac00d

Patch Tensorflow to follow Eigen internal changes
  - cherry-pick changes from upstream repository
  - add local changes for the latest updates
  - add support for cache-size queries on CUDA devices.

  - extend support for matrix inversion on CUDA devices above 4x4 matrices;
    the size of the matrices that can be inverted is limited at runtime by
    the per-thread stack size.

  - extend support for diagonal matrices on CUDA devices.

  - fix deprecation warning in CUDA 10.0.
@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 3, 2018

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2018

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/31462/console

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2018

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_10_4_X/gcc700.

@cmsbuild, @smuzaffar, @gudrutis, @mrodozov can you please review it and eventually sign? Thanks.
You can sign-off by replying to this message having '+1' in the first line of your reply.
You can reject by replying to this message having '-1' in the first line of your reply.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2018

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2018

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 3, 2018

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4474/31462/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 2993155
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2992957
  • DQMHistoTests: Total skipped: 197
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 134 log files, 14 edm output root files, 32 DQM output files

@smuzaffar smuzaffar merged commit 5587d10 into cms-sw:IB/CMSSW_10_4_X/gcc700 Nov 14, 2018
@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 15, 2018

When I build all the externals including these changes, I run into a problem with the tensorflow-python3-sources package, where the build fails with a cryptic message:

Server terminated abruptly (error code: 14, error message: '', log file: '/data/user/fwyzard/patatrack/build/slc7_amd64_gcc700.patatrack/BUILD/slc7_amd64_gcc700/external/tensorflow-python3-sources/1.6.0-patatrack/build/72fcdeb2f560249cbc23c63d6d0200b0/server/jvm.out')

but jvm.out is empty.
If I re-run the same build command in the same build area, the second time it succeeds.

@smuzaffar , @davidlange6 have you seen this before ? do you have any suggestions ?

@davidlange6
Copy link
Contributor

davidlange6 commented Nov 15, 2018 via email

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 15, 2018

Mhm, the strange thing is that it always goes like this:

  • it builds successfully the python 2 version
  • it fails the python 3 version
  • at the second attempt, it builds successfully the python 3 version as well

I saw that bazel leaves some build files under $HOME/.cache/bazel . I did not check how much space it uses during a build, is it possible that it runs out of space when doing multiple builds?

@smuzaffar
Copy link
Contributor

Could it be that first time we started two bazel servers (one for python2 and other for python3) and these two servers stepped out each other? Try adding

BuildRequires: tensorflow-python2-sources

in tensorflow-python3-sources to make sure that only one of these run at one time

@davidlange6
Copy link
Contributor

davidlange6 commented Nov 15, 2018 via email

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 15, 2018

Try adding

BuildRequires: tensorflow-python2-sources

in tensorflow-python3-sources to make sure that only one of these run at one time

Thanks, I will give it a try.

I have run into bazel filling my $HOME - but then you get an error message telling you that you are out of space...

Is there any way to tell bazel to use a different temp directory ?

@davidlange6
Copy link
Contributor

davidlange6 commented Nov 15, 2018 via email

@fwyzard fwyzard deleted the IB/CMSSW_10_4_X/gcc700_update_Eigen branch May 21, 2019 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants