Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collective effects with MPI + improve interface to passmethod impedance #320

Merged
merged 58 commits into from
Dec 16, 2021

Conversation

swhite2401
Copy link
Contributor

@swhite2401 swhite2401 commented Oct 19, 2021

This branch introduces several changes:

-WakefieldPass and atimplib: comprehensive passmethod and library to perform multi-turn tracking in the presence of wakefield. MPI capabilities we added, setup.py was modified in order to compile with mpicc
-collective/wake_object: python object defining a wakefield
-collective/wake_elements: classes for wake elements creation, to completed with other specific cases
-collective/wake_functions: collection of analytical wake function, to be completed

The wake can be accumulated over multiple turns and allows to model beamloading using the LongitudinalResonatorElement.
Full multi-bunch implementation is possible once this is validated.

Single bunch/turn results as identical to the previous ImpedanceTablePass.c implementation.
Multi-turn results have been benchmarked by @lcarver against BlonD and pyHEADTAIL: this needs to be re-validated with latest developments

A c macro was added to setup.py in order to compile or not the MPI related code

The default behavior of pyAT is preserved: no mpi code is compiled unless the macro is activated and collective effects modules have to be explicitly imported

Build with MPI:
rm -rf build (seems to be needed for unknown reason)
MPI=1 python ./setup.py develop

Several addition needed for future PR:
-multi-bunch implementation
-silent atpass to avoid allocating large memory space for multi-particle tracking when only the center of mass is of interest
-class to wrap mpi functionalities such as gather, broadcast etc, to simplify actions on numpy array such as particle generation
-passmethod compatible with MPI that writes useful quantities to file

Copy link
Contributor

@lfarv lfarv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me, I trust you for the collective effects.
Unfortunately, at the moment, one has to choose between OpenMP for tracking except WakefieldPass and MPI for Wakefields. So:

  • since there is no overlap between both, can one activate both together ?
  • An OpenMP option for Wakefields should be easy to add (one directive for each loop). For those who don't have a cluster, this should be very efficient. But then it's one OR the other.

@swhite2401
Copy link
Contributor Author

In fact I thought about it and wanted to discuss it with you because it is not obvious what will happen if both are activated.
I think it is very unlikely that both OPENMP and MPI macros are activated at the same time but who knows...and in this case I am worried that something bad can happen.

So what I wanted to propose is:
-provide options for both implementation (MPI+OPENMP)
-in setup.py (or at runtime?) verify that only one is activated

This should prevent unwanted combinations, any suggestions?
(We are benchmarking it with @lcarver , result in single turn are identical to the previous implementation, multi-turn vs other codes ongoing we will post figures in this thread for the reference)

@lfarv
Copy link
Contributor

lfarv commented Nov 15, 2021

I tried to activate both. No problem up to now, but I did not check OpenMP performance yet, and I miss data to test wakefields. Code outside the OpenMP directives is unchanged, and MPI is apparently limited to library calls (but I don't know anything about MPI), so I don't expect problems, but it needs more checks.

@lfarv
Copy link
Contributor

lfarv commented Nov 16, 2021

Hi @swhite2401 . I started looking at the implementation of collective effects: very impressive ! I think it deserves some documentation (possibly separate from this PR). If you have anything, I can take care of the integration in the Web site.

I have a question about the python part: as far as I could see, there is no interaction between python and MPI. It seems that python can totally ignore whether at.c is compiled with or without MPI. Did I miss something? If not, I do not see the reason for importing mpi4py. The imported variable MPI is not used anywhere.

Concerning the C part, I have to look briefly at MPI…

@swhite2401
Copy link
Contributor Author

You are right, but for the MPI commands in C to work MPI needs to be initialized. This can be done in C by calling MPI_Init() that has to be followed by MPI_Finalize() for clean termination.

from mpi4py import MPI does both so I chose this option. The main reason is that this import will be need in any case when running the code with mpirun to exchange data between processes during particle generation for instance.

This is why I have in my todo list to write an mpi wrapper class to handle all this.

I can write some documentation with benchmarks added, let's wait for @lcarver study.

Very good you are looking into this! Any suggestions more than welcome my MPI skills are certainly not that great, do you need example scripts using mpirun?

@lfarv
Copy link
Contributor

lfarv commented Nov 16, 2021

So if I understand, a number of python processes will be started simultaneously by mpirun, and will wait, except the one with the keyboard control. But wait until when ? Where does effective parallel work start, in the code as it is now?

And yes, I am interested in example scripts!

@lcarver
Copy link
Contributor

lcarver commented Nov 16, 2021

example_resonator.txt

(I had to upload the python script as a txt file...you will have to change the extension yourself)

Hi @lfarv, I converted one of the scripts I had and simplified it as much as I could for an example of a longitudinal resonator wakefunction with mpi. While it uses the same parameters as the plot I showed above for the benchmarking, there are subtle differences between this script and the script that generate the figure above (for the figure I used a wakepotential that was identical to the one used in the Haissinski solver and also with more particles and slices and some post processing etc, here we use a simple longitudinal resonator wake function with no post processing).

This script can be run with (for 5 cores)

mpirun -n 5 python example_resonator.py

For me on grappa, 1 core takes 83.9 seconds with these settings and with 10 cores it takes 11.5 seconds (for the tracking part of course).

@swhite2401
Copy link
Contributor Author

@lfarv mpirun will launch n instances of the code running in parallel that you can identify by their rank (see Lee example).
They will not know about each other until they reach an MPI command as found in atimplib.c. You explicitly set a blocking point (all process stopped their until all have reached this point) using MPI_Barrier command.
Some MPI command have implicit check points but I tend to put a barrier in all MPI blocks just for safety (not sure if that standard practice).

@lfarv
Copy link
Contributor

lfarv commented Nov 17, 2021

Hi @swhite2401, can you send me your S28F.mat ? Thanks

mpirun -n 5 python example_resonator.py

@lcarver
Copy link
Contributor

lcarver commented Nov 17, 2021

S28F.txt
Please change the .txt to .mat

@lfarv
Copy link
Contributor

lfarv commented Nov 17, 2021

@swhite2401: I ran your test on my Mac 4 cores), and here is what I get:

  • with 4 processes (-n 4) : 24 s
  • with a single process (-n 1) : 83 s (ratio 3.4, which looks good)
  • without using mpirun : 81 s, consistent with the previous result

But then I modified the script to remove the Wakefield element (just commented #fring.append(welem)). Then there is no more any call to MPI_something in the code. And here are the result:

  • with 4 processes : 13 s
  • with 1 process : 49 s (ratio 3.8)

The shape of the result is correct (6, 50000), the bunch shape is symmetrical, so it looks correct (but I'm not sure it is). So does this mean that without any MPI call, tracking is magically parallelised ? Or are the 4 processes doing exactly the same thing, but then why is the time reduced ? And where is the sharing of particles between the processes done ?

For me, there are still a number of things to be clarified about MPI…

@lcarver
Copy link
Contributor

lcarver commented Nov 17, 2021

the number of particles used in the tracking is 50000/size where size is the number of cores you specified. For the MPI case, you specify 4 cores, each core has 50000/4 particles, then they are brought together at the end. Without the wakefield element, you are simply tracking less particles per core so it is faster.

EDIT to add: in the no-wakefield element case, probably what you are doing is splitting the particles among each of the cores, they are still tracked independently with no cross talk, but then at the end the gather still gathers them so you return to shape of (6,50000)

@swhite2401
Copy link
Contributor Author

swhite2401 commented Nov 17, 2021

mpirun will basically launch four copies of your code whether mpi commands are present or not.
These MPI commands are just used to communicate between processes.

@lfarv
Copy link
Contributor

lfarv commented Nov 17, 2021

EDIT to add: in the no-wakefield element case, probably what you are doing is splitting the particles among each of the cores, they are still tracked independently with no cross talk, but then at the end the gather still gathers them so you return to shape of (6,50000)

This is what I understand, and crosstalk is not necessary in that case, but the question is: where does this splitting occur ? Is the generation of particles by atbeam done in one process, or duplicated in the 4 processes, does the splitting occur in the python part or in the C part, why are the 4 processes taking each 1/4 of the particles instead of running all 50000 ? There is nowhere an instruction indicating how the splitting should occur…

@swhite2401
Copy link
Contributor Author

Ah the splitting is implicitly done in this line:

Npart = int(50000/size) where size is the number of processes defined by the -n option in mpirun
each process will run in parallel at.beam(Npart, sigm) (in your case Npart=50000/4) and then track it etc... fully independently until the gather command is reached

Another possible way is to generate the full beam on rank 0 and then split:

if rank==0:
    part=at.beam(50000, sigm)
MPI.scatter(part,comm,...) #I am not sure about the exact syntax

Then at the end you gather the full array on rank 0 (default) using the gather command as done in Lee's example

By the way, running the code without mpirun with just python ./example.py is called a singleton and will behave like the standard single process compilation

Not sure if that answers your questions...

@lcarver
Copy link
Contributor

lcarver commented Nov 17, 2021

I think the answer to Laurents question is that the splitting is handled by mpirun, not by python. By launching mpirun, you run the same script on each thread in an environment where you can communicate to the other threads using the comm object.

@lfarv
Copy link
Contributor

lfarv commented Nov 17, 2021

Thanks @swhite2401 and @lcarver, it gets clearer ! So I understand that:

  • it solves the general problem of parallel tracking, almost for free : no MPI directive anywhere in integrators, except for special cases like impedances,
  • In your example, particles are generated by a random generator, so are different for each thread (unless the generators start from the same seed). In other cases, the particles must be explicitly distributed among the threads, so we need the "scatter" directive,
  • in a standard AT session, a large part of the work (optics, matching…) should be restricted to rank 0 to avoid running the same thing in multiple threads. Only the tracking needs to be parallelised, with possibly other things like dynamic aperture etc. There is indeed a need for classes to handle parallelisation where it is needed.

So in my opinion, this looks very promising, though it's not yet ready for general use. Concerning the Wakefields, it's ok for me, merge it when you are confident.

@swhite2401
Copy link
Contributor Author

@lfarv, I think that is a good summary!
To complement: I would not recommend this at all for general use, MPI has large overhead that discards it for light computations such as optics or orbit.
It is nevertheless perfectly suited for DA, lifetime or RM computations (or any CPU intensive simulations that one would like to run on a cluster) for instance but will require a special class to handle "parallelized" data and outputs.
For moderate CPU needs I would still recommend OpenMP or patpass.

@lcarver
Copy link
Contributor

lcarver commented Dec 10, 2021

I made a comparison of the TMCI threshold and horizontal tune shift. Frustratingly, I was not able to recover the exact longitudinal wake potential that I used in my original simulations. Nonetheless, the TMCI threshold remains the same.

Wakefield pass:
image

ImpedanceTablePass:
image

As it shows, there are small differences in the tune shift but the threshold remains the same.

For the horizontal tune shift.

Wakefield pass:
image

ImpedanceTablePass:
image

For me this is perfectly fine. I know there are small differences in the longitudinal wake between the two cases, which explains why the tune shifts are not identical. The fact that the horizontal tune shift is close also shows that the quadrupolar wake is cancelling the dipolar wake in horizontal. For me there are no major issues.

I have a few other benchmarks ongoing, I am checking the resistive wall wake with theory and PyHEADTAIL and also a transverse broadband resonator. But I think this can be merged before those benchmarks are complete as this passmethod is at least doing the same as impedancetablepass.

Edit: I also remind you of the previous comment on this PR that the longitudinal broadband resonator agreed perfectly with the Haissinski formula.

@swhite2401
Copy link
Contributor Author

I set this one a ready for review, please comment in case you see something odd, if not I think we can merge this one

@swhite2401 swhite2401 marked this pull request as ready for review December 10, 2021 13:46
@swhite2401 swhite2401 merged commit e24ca2c into master Dec 16, 2021
@swhite2401 swhite2401 deleted the mp_collective branch December 16, 2021 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants