Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMSSW supporting MPI #18174

Closed
perrozzi opened this issue Apr 3, 2017 · 60 comments
Closed

CMSSW supporting MPI #18174

perrozzi opened this issue Apr 3, 2017 · 60 comments

Comments

@perrozzi
Copy link
Contributor

perrozzi commented Apr 3, 2017

As detailed in the talk during the O&C week https://indico.cern.ch/event/624140/contributions/2533506/attachments/1438011/2212147/Long_GENComputingReport_2017_04_03.pdf
we would like to have MPI support inside CMSSW, to boost the usage of Sherpa MC generator.
From what I understood during the discussion, this should be relatively straightforward.
@kdlong @vciulli @bendavid can comment further

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 3, 2017

A new Issue was created by @perrozzi .

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@davidlange6
Copy link
Contributor

please propose the appropriate set of configure arguments for OpenMPI

@kdlong
Copy link
Contributor

kdlong commented Apr 3, 2017

I've been working on a local installation of Sherpa compiled against OpenMPI. I don't quite have it working at the moment but I hope to have more info soon. One thing I'm sure is that the c++ bindings, which are deprecated in the MPI standard and are disabled in OpenMPI by default, are necessary. This requires the configuration option --enable-mpi-cxx

In general, is there any reason to prefer OpenMPI vs. MPICH2? If the program is compiled against OpenMPI, and a cluster supports MPICH2, will it work?

@davidlange6
Copy link
Contributor

davidlange6 commented Apr 3, 2017 via email

@bbockelm
Copy link
Contributor

bbockelm commented Apr 3, 2017

@kdlong - one question is of expectations management. How many cores would you like to target for the Sherpa use case?

I ask because the approach here is completely different if Sherpa can run on a single host (8-64 cores) versus multiple hosts.

For a single host, one can just utilize OpenMPI with the shmem (shared memory) backend and disable all others; in such a case, it's irrelevant what the underlying cluster uses.

For multiple host runs (i.e., using the shared fabric), there's no expectation of portability between clusters. After all, MPI stands for Message Passing Interface - i.e., the API is standardized but there's no portability of executables.

@kdlong
Copy link
Contributor

kdlong commented Apr 4, 2017

@bbockelm - my impression is that we should first target this single host workflow. Without MPI the jobs can run on exactly 1 core, so 8 - 64 would already be a huge improvement, even if it isn't sufficient for the most intensive processes. A version of Sherpa + OpenMPI compiled in CMSSW would be a huge help for this type of workflow. I'm working on testing this with a local installation and will hopefully have feedback in ~ a week.

@bbockelm
Copy link
Contributor

bbockelm commented Apr 4, 2017

Gotcha - in that case, you simply want the sm backend (there is a new one, vader, but it requires a newer kernel than available at most sites). See this for a few instructions on how to configure and flags needed:

https://www.open-mpi.org/faq/?category=sm

One only needs to consider site-specific concerns if you need to go between nodes - otherwise, you want to really avoid the site's MPI stack.

Next question - does Sherpa run inside CMSSW or as an ExternalLHEProducer? Embedded MPI library calls into CMSSW might require some careful thought. If it's an ExternalLHEProducer, then this should be straightforward.

@perrozzi
Copy link
Contributor Author

perrozzi commented Apr 5, 2017

@kdlong
Copy link
Contributor

kdlong commented Apr 5, 2017

But my understanding is that the event generation won't use MPI, just the gridpack generation, which is done externally. We need to confirm/test this, but I think it can work. @vciulli, does this seem right?

@perrozzi
Copy link
Contributor Author

perrozzi commented Apr 5, 2017

yes indeed MPI is needed only to create the sherpack. in fact, what is currently done is to use a standalone version to create the sherpack somewhere like DESY, do some manipulation to make it comply with the "native CMSSW" sherpack, then use it as if it was made using sherpa inside CMSSW, cfr
https://indico.cern.ch/event/615514/contributions/2530154/attachments/1434198/2204645/2017-03-27__GEN_technical_meeting__Sherpa_Status.pdf

@bendavid
Copy link
Contributor

bendavid commented Apr 5, 2017

It might be reasonable to continue to run sherpa standalone to produce the sherpacks, but just to do so directly with the compiled version shipped with CMSSW.

@perrozzi
Copy link
Contributor Author

perrozzi commented Apr 6, 2017

yes, I agree, to avoid any inconsistency

@vciulli
Copy link
Contributor

vciulli commented Apr 6, 2017

@pmillet can comment but I think the command cmsRun is not invoked to create the sherpack, even if one uses the scripts prepared for using Sherpa inside CMSSW

MPI is certainly not needed to generate the events

I agree using multiple core in a single machine is already a good starting point.
Maybe @pmillet already has a recipe for building Sherpa with MPI inside CMSSW.

@pmillet
Copy link
Contributor

pmillet commented Apr 6, 2017

cmsRun is not invoked when creating the sherpack
I do not have a recipe ready, but i think specifying '--enable-mpi' when building sherpa should be sufficient (+ eventually specify the path to openmpi)

@pmillet
Copy link
Contributor

pmillet commented Apr 6, 2017

apparently this was also tested in the past, see
https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_5_3_X/slc5_amd64_gcc462-sherpa2/sherpampi.spec

@perrozzi
Copy link
Contributor Author

perrozzi commented Apr 6, 2017

(@fabiocos please comment in case)

@fabiocos
Copy link
Contributor

fabiocos commented Apr 6, 2017

This was indeed tested and used for private production (supporting SMP-12-017) in the branch IB/CMSSW_5_3_X/slc5_amd64_gcc462-sherpa2 with sherpa version 2.0.beta2, that stayed around for about one year. You may find in the same branch the openmpi.spec I used at that time. I never tested the setup on multiple nodes simultaneously, but used a 16 cores node for production of the sherpacks, and it worked.

@bbockelm
Copy link
Contributor

bbockelm commented Apr 6, 2017

@perrozzi - if you guys take this approach with CMS Connect, I've been working to get more hosts available with >8 cores.

The maximum currently is 56 (and you may end up in a quite long line for these). At 8 cores, you should get as many cores as you need.

HTH!

@perrozzi
Copy link
Contributor Author

perrozzi commented Apr 7, 2017

@bbockelm thanks for the info, this possibility will be definitely taken into account when mpi will be integrated in cmssw

@perrozzi
Copy link
Contributor Author

perrozzi commented Apr 7, 2017

@fabiocos thanks a lot, we should simply replicate what was done in 53x then

@perrozzi
Copy link
Contributor Author

@pmillet could you try to make a PR to copy what was in 53x?

@pmillet
Copy link
Contributor

pmillet commented Apr 11, 2017

ok

@perrozzi
Copy link
Contributor Author

any news on this?

@pmillet
Copy link
Contributor

pmillet commented May 11, 2017

So far I tried simply copying what was done in 53X to a recent release but got errors due to missing libraries. I will try again beginning of next week. Sorry for the delay.

@bbockelm
Copy link
Contributor

Hi @pmillet -

If you get stuck, feel free to post here and let us know!

Brian

@pmillet
Copy link
Contributor

pmillet commented May 15, 2017

Ok, thanks!. So this is what I did pmillet/cmsdist@5621b0b
When I try to build sherpa it fails while building openmpi. The error message is the following:
RpmInstallFailed: Failed to install package openmpi. Reason: error: Failed dependencies: /bin/perl is needed by external+openmpi+1.6.5-1-1.x86_64 libbat.so()(64bit) is needed by external+openmpi+1.6.5-1-1.x86_64 liblsf.so()(64bit) is needed by external+openmpi+1.6.5-1-1.x86_64
Does anybody know what to include to get those files?

@davidlange6
Copy link
Contributor

davidlange6 commented May 15, 2017 via email

@pmillet
Copy link
Contributor

pmillet commented May 16, 2017

thanks. I added the following lines to the openmpi spec file (https://github.com/pmillet/cmsdist/blob/sherpa_openmpi/openmpi.spec#L6-L9), and updated the version to a more recent one.
After building I now get the following error:

`Checking local path dependency for rpm package external+openmpi+2.1.0-cms just build.

Requested to quit.

Requested to quit.

Requested to quit.

Requested to quit.

The action "install-external+openmpi+2.1.0-cms" was not completed successfully because Traceback (most recent call last):

File "/build/millet/ext/CMSSW_9_1_X/20170515_1016/PKGTOOLS/scheduler.py", line 199, in doSerial
result = commandSpec0

File "PKGTOOLS/cmsBuild", line 3017, in installPackage

File "PKGTOOLS/cmsBuild", line 2829, in installRpm

RpmInstallFailed: Failed to install package openmpi. Reason:

error: Failed dependencies:

    libbat.so()(64bit) is needed by external+openmpi+2.1.0-cms-1-1.x86_64

    liblsf.so()(64bit) is needed by external+openmpi+2.1.0-cms-1-1.x86_64

`

@davidlange6
Copy link
Contributor

davidlange6 commented May 16, 2017 via email

@smuzaffar
Copy link
Contributor

please check /build/millet/ext/CMSSW_9_1_X/20170515_1016/a/BUILD/slc6_amd64_gcc530/external/openmpi/2.1.1/log file for more details

@smuzaffar
Copy link
Contributor

@pmillet OR make a PR with your changes and we will try to build/debug the issue

@davidlange6
Copy link
Contributor

davidlange6 commented May 17, 2017 via email

@pmillet
Copy link
Contributor

pmillet commented May 18, 2017

Thanks again for the suggestions. The issue @davidlange6 linked seems to be similar to the one I encountered. I opened an issue in the openmpi repo open-mpi/ompi#3546.

@pmillet
Copy link
Contributor

pmillet commented May 22, 2017

It looks like the fix the Open-MPI people propose needs a more recent version of autotools. I tried building with pmillet/cmsdist@fce2afc and it fails with:

++ export M4=/build/millet/ext/CMSSW_9_1_X/20170515_1016/a/slc6_amd64_gcc530/external/autotools/1.2-oenich/bin/m4
++ M4=/build/millet/ext/CMSSW_9_1_X/20170515_1016/a/slc6_amd64_gcc530/external/autotools/1.2-oenich/bin/m4
+ for x in external/gcc/5.3.0 external/autotools/1.2-oenich .
+ i=/build/millet/ext/CMSSW_9_1_X/20170515_1016/a/slc6_amd64_gcc530/./etc/profile.d/init.sh
+ '[' -f /build/millet/ext/CMSSW_9_1_X/20170515_1016/a/slc6_amd64_gcc530/./etc/profile.d/init.sh ']'
+ make -j 24
CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /build/millet/ext/CMSSW_9_1_X/20170515_1016/a/BUILD/slc6_amd64_gcc530/external/openmpi/2.1.1/openmpi-2.1.1/config/missing aclocal-1.15 -I config
/build/millet/ext/CMSSW_9_1_X/20170515_1016/a/BUILD/slc6_amd64_gcc530/external/openmpi/2.1.1/openmpi-2.1.1/config/missing: line 81: aclocal-1.15: command not found
WARNING: 'aclocal-1.15' is missing on your system.
         You should only need it if you modified 'acinclude.m4' or
         'configure.ac' or m4 files included by 'configure.ac'.
         The 'aclocal' program is part of the GNU Automake package:
         <http://www.gnu.org/software/automake>
         It also requires GNU Autoconf, GNU m4 and Perl in order to run:
         <http://www.gnu.org/software/autoconf>
         <http://www.gnu.org/software/m4/>
         <http://www.perl.org/>
make: *** [aclocal.m4] Error 127
error: Bad exit status from /build/millet/ext/CMSSW_9_1_X/20170515_1016/a/tmp/rpm-tmp.RgaH8G (%build)

Should I ask try asking for a patch which does not need a newer version of autotools? Or is updating autotools a possibility? I attached the full log to this post. Thanks again for your help.
openmpi_build.txt

@davidlt
Copy link
Contributor

davidlt commented May 22, 2017

We could bump automake in DEVEL IBs. Might need some minor cleaning up in worst case scenario.

The fix does not directly require automate 1.15, but once you modify the file the build scripts needs to be regenerated. That was originally generated with automatke 1.15, which was released more than 2 years ago.

@pmillet
Copy link
Contributor

pmillet commented Jun 7, 2017

How should we proceed with this? I tried what is suggested in open-mpi/ompi#3546 which is apply the patch on a system which has recent automake, run

./autogen.pl
./configure
make dist

and use this tarball on the SCL6 machine. However I still get the same error as above. Will there be an automake update? Or should I keep trying to use the method from above? Thanks a lot for your advice.

@davidlange6
Copy link
Contributor

davidlange6 commented Jun 7, 2017 via email

@pmillet
Copy link
Contributor

pmillet commented Jun 7, 2017

ok, i'll try the gcc700 branch

@davidlt
Copy link
Contributor

davidlt commented Jun 7, 2017

You are getting this error because you are probably applying the patch to a release tarball, which has all build scripts already generated. Use ./autogen.pl --force. According to the script automake 1.12 is fine. I did add 1.15 this long weekend, but only to GCC 7.1.1 builds.

@pmillet
Copy link
Contributor

pmillet commented Jun 7, 2017

I used the --force option (sorry it is missing in the commands above), since autogen.pl refused to run without (for the reason you explained).

@perrozzi
Copy link
Contributor Author

ping: any news?

@smuzaffar
Copy link
Contributor

@pmillet , did you try it with gcc700 cmsdist branch?

@pmillet
Copy link
Contributor

pmillet commented Jun 14, 2017

yes. OpenMPI builds fine with the patch and the updated automake. Yesterday I just had the problem that sherpa would not build anymore. I modified the build file and its currently trying to build it again. I'll report the outcome as soon as it is ready.

@smuzaffar
Copy link
Contributor

once you have it working then please make CMSDIST PR for gcc700 branch.

@pmillet
Copy link
Contributor

pmillet commented Jun 14, 2017

so I still get the same error as yesterday when building sherpa

checking whether the C compiler works... no
configure: error: in `/build/millet/ext/CMSSW_9_2_ROOT6_X/20170613_1014/a/BUILD/slc6_amd64_gcc700/external/sherpa/2.2.2-cms/sherpa-2.2.2':
configure: error: C compiler cannot create executables
See `config.log' for more details
error: Bad exit status from /build/millet/ext/CMSSW_9_2_ROOT6_X/20170613_1014/a/tmp/rpm-tmp.PBy4ZF (%build)

logs:
https://cernbox.cern.ch/index.php/s/rNwleeKfmlDnehI
https://cernbox.cern.ch/index.php/s/3l4XFxJiFeeWmIE
cmsdist: https://github.com/pmillet/cmsdist/tree/IB/CMSSW_9_2_X/gcc700

@davidlt
Copy link
Contributor

davidlt commented Jun 14, 2017

Sherpa cannot find libmpi_cxx.*

Check if it exist under /build/millet/ext/CMSSW_9_2_ROOT6_X/20170613_1014/a/slc6_amd64_gcc700/external/openmpi/2.1.1/lib/

@davidlt
Copy link
Contributor

davidlt commented Jun 14, 2017

From ./configure --help:

  --enable-mpi-cxx        enable C++ MPI bindings (default: disabled)
  --enable-mpi-cxx-seek   enable support for MPI::SEEK_SET, MPI::SEEK_END, and
                          MPI::SEEK_POS in C++ bindings (default: enabled)

If both are needed by Sherpa, enabled them explicitly.

@pmillet
Copy link
Contributor

pmillet commented Jun 18, 2017

It finally seems to have worked. PR (cms-sw/cmsdist#3108). Thanks for all the comments and suggestions.

@perrozzi
Copy link
Contributor Author

@pmillet

  • can you please remind me which releases have this feature in?
  • is there any documentation for people willing to try?
  • can you please report at a GEN meeting? (even next monday, although brief)
  • can this be backported to CMSSW 71X to be tested with 2016 MC conditions?

@pmillet
Copy link
Contributor

pmillet commented Jul 17, 2017

somehow my answer did not make it till here:

@pmillet

can you please remind me which releases have this feature in?

currently 93X

is there any documentation for people willing to try?

There is no dedicated documentation for this yet. In principle it should be enough to add -m 'mpirun -n NCORES' to the MakeSherpaLibs.sh call.

can you please report at a GEN meeting? (even next monday, although brief)

So far the only thing to report would be that it is included in 93X. I could make a one slide for this if you want.

can this be backported to CMSSW 71X to be tested with 2016 MC conditions?

From the Sherpa side I do not see an issue with backporting it.

@perrozzi
Copy link
Contributor Author

Hi Philipp,
thanks a lot.

I kindly ask to please backport the feature to 71X, add a line with the instructions in the Sherpa twiki and prepare a slide for today's GEN meeting.

Thanks
Luca

@pmillet
Copy link
Contributor

pmillet commented Jul 18, 2017

When trying to build openmpi in 71X the following message appears

 + ./autogen.pl --force
Open MPI autogen (buckle up!)
1. Checking tool versions
   Searching for autoconf
     Found autoconf version 2.68; checking version...
       Found version component 2 -- need 2
       Found version component 68 -- need 69
     ==> Too low!  Skipping this version
=================================================================
I could not find a recent enough copy of autoconf.
I need at least 2.69, but only found the following versions:
    autoconf: 2.68
I am gonna abort.  :-(
Please make sure you are using at least the following versions of the
tools:
    GNU Autoconf: 2.69
    GNU Automake: 1.12.2
    GNU Libtool: 2.4.2
=================================================================
error: Bad exit status from /build/millet/ext/CMSSW_7_1_X/20170717_2053/a/tmp/rpm-tmp.jJ3QkI (%prep)

Can we get the newer autoconf also in 71X?

@smuzaffar
Copy link
Contributor

just get the latest autotools.spec from https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_9_3_X/gcc630/autotools.spec in your 71x cmsdist. Once every thing builds then make a Pull request.

@makortel
Copy link
Contributor

Can this issue be closed?

@smuzaffar
Copy link
Contributor

yes this is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests