Collective effects with MPI + improve interface to passmethod impedance #320

swhite2401 · 2021-10-19T15:33:45Z

This branch introduces several changes:

-WakefieldPass and atimplib: comprehensive passmethod and library to perform multi-turn tracking in the presence of wakefield. MPI capabilities we added, setup.py was modified in order to compile with mpicc
-collective/wake_object: python object defining a wakefield
-collective/wake_elements: classes for wake elements creation, to completed with other specific cases
-collective/wake_functions: collection of analytical wake function, to be completed

The wake can be accumulated over multiple turns and allows to model beamloading using the LongitudinalResonatorElement.
Full multi-bunch implementation is possible once this is validated.

Single bunch/turn results as identical to the previous ImpedanceTablePass.c implementation.
Multi-turn results have been benchmarked by @lcarver against BlonD and pyHEADTAIL: this needs to be re-validated with latest developments

A c macro was added to setup.py in order to compile or not the MPI related code

The default behavior of pyAT is preserved: no mpi code is compiled unless the macro is activated and collective effects modules have to be explicitly imported

Build with MPI:
rm -rf build (seems to be needed for unknown reason)
MPI=1 python ./setup.py develop

Several addition needed for future PR:
-multi-bunch implementation
-silent atpass to avoid allocating large memory space for multi-particle tracking when only the center of mass is of interest
-class to wrap mpi functionalities such as gather, broadcast etc, to simplify actions on numpy array such as particle generation
-passmethod compatible with MPI that writes useful quantities to file

…ong, so no multiturn wakes were being applied. Second, the slices did not act on themselves which was wrong. Now the buffer sign is correct and ds=0 is allowed to pass into the slice loop.

lfarv

OK for me, I trust you for the collective effects.
Unfortunately, at the moment, one has to choose between OpenMP for tracking except WakefieldPass and MPI for Wakefields. So:

since there is no overlap between both, can one activate both together ?
An OpenMP option for Wakefields should be easy to add (one directive for each loop). For those who don't have a cluster, this should be very efficient. But then it's one OR the other.

swhite2401 · 2021-11-15T17:45:10Z

In fact I thought about it and wanted to discuss it with you because it is not obvious what will happen if both are activated.
I think it is very unlikely that both OPENMP and MPI macros are activated at the same time but who knows...and in this case I am worried that something bad can happen.

So what I wanted to propose is:
-provide options for both implementation (MPI+OPENMP)
-in setup.py (or at runtime?) verify that only one is activated

This should prevent unwanted combinations, any suggestions?
(We are benchmarking it with @lcarver , result in single turn are identical to the previous implementation, multi-turn vs other codes ongoing we will post figures in this thread for the reference)

lfarv · 2021-11-15T17:58:33Z

I tried to activate both. No problem up to now, but I did not check OpenMP performance yet, and I miss data to test wakefields. Code outside the OpenMP directives is unchanged, and MPI is apparently limited to library calls (but I don't know anything about MPI), so I don't expect problems, but it needs more checks.

lfarv · 2021-11-16T09:11:58Z

Hi @swhite2401 . I started looking at the implementation of collective effects: very impressive ! I think it deserves some documentation (possibly separate from this PR). If you have anything, I can take care of the integration in the Web site.

I have a question about the python part: as far as I could see, there is no interaction between python and MPI. It seems that python can totally ignore whether at.c is compiled with or without MPI. Did I miss something? If not, I do not see the reason for importing mpi4py. The imported variable MPI is not used anywhere.

Concerning the C part, I have to look briefly at MPI…

swhite2401 · 2021-11-16T10:38:13Z

You are right, but for the MPI commands in C to work MPI needs to be initialized. This can be done in C by calling MPI_Init() that has to be followed by MPI_Finalize() for clean termination.

from mpi4py import MPI does both so I chose this option. The main reason is that this import will be need in any case when running the code with mpirun to exchange data between processes during particle generation for instance.

This is why I have in my todo list to write an mpi wrapper class to handle all this.

I can write some documentation with benchmarks added, let's wait for @lcarver study.

Very good you are looking into this! Any suggestions more than welcome my MPI skills are certainly not that great, do you need example scripts using mpirun?

lfarv · 2021-11-16T14:15:59Z

So if I understand, a number of python processes will be started simultaneously by mpirun, and will wait, except the one with the keyboard control. But wait until when ? Where does effective parallel work start, in the code as it is now?

And yes, I am interested in example scripts!

lcarver · 2021-11-16T15:19:36Z

example_resonator.txt

(I had to upload the python script as a txt file...you will have to change the extension yourself)

Hi @lfarv, I converted one of the scripts I had and simplified it as much as I could for an example of a longitudinal resonator wakefunction with mpi. While it uses the same parameters as the plot I showed above for the benchmarking, there are subtle differences between this script and the script that generate the figure above (for the figure I used a wakepotential that was identical to the one used in the Haissinski solver and also with more particles and slices and some post processing etc, here we use a simple longitudinal resonator wake function with no post processing).

This script can be run with (for 5 cores)

mpirun -n 5 python example_resonator.py

For me on grappa, 1 core takes 83.9 seconds with these settings and with 10 cores it takes 11.5 seconds (for the tracking part of course).

swhite2401 · 2021-11-16T16:01:34Z

@lfarv mpirun will launch n instances of the code running in parallel that you can identify by their rank (see Lee example).
They will not know about each other until they reach an MPI command as found in atimplib.c. You explicitly set a blocking point (all process stopped their until all have reached this point) using MPI_Barrier command.
Some MPI command have implicit check points but I tend to put a barrier in all MPI blocks just for safety (not sure if that standard practice).

lfarv · 2021-11-17T10:09:57Z

Hi @swhite2401, can you send me your S28F.mat ? Thanks

mpirun -n 5 python example_resonator.py

lcarver · 2021-11-17T10:15:12Z

S28F.txt
Please change the .txt to .mat

lfarv · 2021-11-17T12:45:57Z

@swhite2401: I ran your test on my Mac 4 cores), and here is what I get:

with 4 processes (-n 4) : 24 s
with a single process (-n 1) : 83 s (ratio 3.4, which looks good)
without using mpirun : 81 s, consistent with the previous result

But then I modified the script to remove the Wakefield element (just commented #fring.append(welem)). Then there is no more any call to MPI_something in the code. And here are the result:

with 4 processes : 13 s
with 1 process : 49 s (ratio 3.8)

The shape of the result is correct (6, 50000), the bunch shape is symmetrical, so it looks correct (but I'm not sure it is). So does this mean that without any MPI call, tracking is magically parallelised ? Or are the 4 processes doing exactly the same thing, but then why is the time reduced ? And where is the sharing of particles between the processes done ?

For me, there are still a number of things to be clarified about MPI…

lcarver · 2021-11-17T12:48:37Z

the number of particles used in the tracking is 50000/size where size is the number of cores you specified. For the MPI case, you specify 4 cores, each core has 50000/4 particles, then they are brought together at the end. Without the wakefield element, you are simply tracking less particles per core so it is faster.

EDIT to add: in the no-wakefield element case, probably what you are doing is splitting the particles among each of the cores, they are still tracked independently with no cross talk, but then at the end the gather still gathers them so you return to shape of (6,50000)

swhite2401 · 2021-11-17T13:24:05Z

mpirun will basically launch four copies of your code whether mpi commands are present or not.
These MPI commands are just used to communicate between processes.

lfarv · 2021-11-17T14:30:36Z

EDIT to add: in the no-wakefield element case, probably what you are doing is splitting the particles among each of the cores, they are still tracked independently with no cross talk, but then at the end the gather still gathers them so you return to shape of (6,50000)

This is what I understand, and crosstalk is not necessary in that case, but the question is: where does this splitting occur ? Is the generation of particles by atbeam done in one process, or duplicated in the 4 processes, does the splitting occur in the python part or in the C part, why are the 4 processes taking each 1/4 of the particles instead of running all 50000 ? There is nowhere an instruction indicating how the splitting should occur…

swhite2401 · 2021-11-17T15:26:46Z

Ah the splitting is implicitly done in this line:

Npart = int(50000/size) where size is the number of processes defined by the -n option in mpirun
each process will run in parallel at.beam(Npart, sigm) (in your case Npart=50000/4) and then track it etc... fully independently until the gather command is reached

Another possible way is to generate the full beam on rank 0 and then split:

if rank==0:
    part=at.beam(50000, sigm)
MPI.scatter(part,comm,...) #I am not sure about the exact syntax

Then at the end you gather the full array on rank 0 (default) using the gather command as done in Lee's example

By the way, running the code without mpirun with just python ./example.py is called a singleton and will behave like the standard single process compilation

Not sure if that answers your questions...

lcarver · 2021-11-17T15:34:06Z

I think the answer to Laurents question is that the splitting is handled by mpirun, not by python. By launching mpirun, you run the same script on each thread in an environment where you can communicate to the other threads using the comm object.

lfarv · 2021-11-17T16:44:53Z

Thanks @swhite2401 and @lcarver, it gets clearer ! So I understand that:

it solves the general problem of parallel tracking, almost for free : no MPI directive anywhere in integrators, except for special cases like impedances,
In your example, particles are generated by a random generator, so are different for each thread (unless the generators start from the same seed). In other cases, the particles must be explicitly distributed among the threads, so we need the "scatter" directive,
in a standard AT session, a large part of the work (optics, matching…) should be restricted to rank 0 to avoid running the same thing in multiple threads. Only the tracking needs to be parallelised, with possibly other things like dynamic aperture etc. There is indeed a need for classes to handle parallelisation where it is needed.

So in my opinion, this looks very promising, though it's not yet ready for general use. Concerning the Wakefields, it's ok for me, merge it when you are confident.

swhite2401 · 2021-11-18T07:47:59Z

@lfarv, I think that is a good summary!
To complement: I would not recommend this at all for general use, MPI has large overhead that discards it for light computations such as optics or orbit.
It is nevertheless perfectly suited for DA, lifetime or RM computations (or any CPU intensive simulations that one would like to run on a cluster) for instance but will require a special class to handle "parallelized" data and outputs.
For moderate CPU needs I would still recommend OpenMP or patpass.

lcarver · 2021-12-10T13:15:12Z

I made a comparison of the TMCI threshold and horizontal tune shift. Frustratingly, I was not able to recover the exact longitudinal wake potential that I used in my original simulations. Nonetheless, the TMCI threshold remains the same.

Wakefield pass:

ImpedanceTablePass:

As it shows, there are small differences in the tune shift but the threshold remains the same.

For the horizontal tune shift.

Wakefield pass:

ImpedanceTablePass:

For me this is perfectly fine. I know there are small differences in the longitudinal wake between the two cases, which explains why the tune shifts are not identical. The fact that the horizontal tune shift is close also shows that the quadrupolar wake is cancelling the dipolar wake in horizontal. For me there are no major issues.

I have a few other benchmarks ongoing, I am checking the resistive wall wake with theory and PyHEADTAIL and also a transverse broadband resonator. But I think this can be merged before those benchmarks are complete as this passmethod is at least doing the same as impedancetablepass.

Edit: I also remind you of the previous comment on this PR that the longitudinal broadband resonator agreed perfectly with the Haissinski formula.

…cted a typo in NormFact

…ents

…nge is provided

swhite2401 · 2021-12-10T13:46:12Z

I set this one a ready for review, please comment in case you see something odd, if not I think we can merge this one

swhite2401 and others added 22 commits September 29, 2021 15:53

initial commit

f02ec59

simplified implementation

57ca0d7

impedance.py module added

abc2858

long. resonator element

f6bf820

multiturn added

a32175f

Simplifications

096820c

improved comment

94f21d5

Fixed build_srange logic which was not working for all cases.

93161a0

Changed some np with numpy

afa9003

Making the build work

f5828f8

Fixed yokoya build error. Added srange assertions on nturnsw

02c8e90

Fixed sign error on dt in function_longitudinal and function_transverse

b0bac06

Added fixed slicing. Future pushes will be via new branch

6e26865

Fixed two major bugs. First, the sign of the turnhistory bufer was wr…

4c53f1c

…ong, so no multiturn wakes were being applied. Second, the slices did not act on themselves which was wrong. Now the buffer sign is correct and ds=0 is allowed to pass into the slice loop.

started help

abc6b23

help improved

84376d2

pep8

0d03441

changed to wake.py

f599ab7

changed to wakefields.py

ee59120

test

1d0d19d

simple __repr__

ec89abf

simple __repr__

ee7f18a

swhite2401 requested review from carmignani, MJGaughran, lcarver and lfarv October 19, 2021 15:33

swhite2401 added 4 commits October 20, 2021 14:53

fixed zcuts init bug

5b0eaf3

Use capitals for element attributes to follow AT convention

c0823f8

Kick calculation parallelized

f7caaa7

Remove prints

aae86b5

Compile at.c with mpicc for ispmi function to work

a7a5b27

lfarv approved these changes Nov 15, 2021

View reviewed changes

Lee Carver added 5 commits December 10, 2021 14:22

Added Resistive wall element, transverse resonator element, and corre…

97d9011

…cted a typo in NormFact

Added an inf and nan removed for transverse_reswall

aa5a6f0

Corrected the wake comp comparison for the resonator and reswall elem…

fdacef9

…ents

switched np for numpy

b464040

Removed the inf and nan removal and replaced with an error if bad sra…

d1b154e

…nge is provided

swhite2401 marked this pull request as ready for review December 10, 2021 13:46

lcarver approved these changes Dec 10, 2021

View reviewed changes

swhite2401 merged commit e24ca2c into master Dec 16, 2021

swhite2401 deleted the mp_collective branch December 16, 2021 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collective effects with MPI + improve interface to passmethod impedance #320

Collective effects with MPI + improve interface to passmethod impedance #320

swhite2401 commented Oct 19, 2021 •

edited

lfarv left a comment

swhite2401 commented Nov 15, 2021

lfarv commented Nov 15, 2021

lfarv commented Nov 16, 2021

swhite2401 commented Nov 16, 2021

lfarv commented Nov 16, 2021

lcarver commented Nov 16, 2021 •

edited

swhite2401 commented Nov 16, 2021

lfarv commented Nov 17, 2021

lcarver commented Nov 17, 2021

lfarv commented Nov 17, 2021

lcarver commented Nov 17, 2021 •

edited

swhite2401 commented Nov 17, 2021 •

edited

lfarv commented Nov 17, 2021

swhite2401 commented Nov 17, 2021

lcarver commented Nov 17, 2021

lfarv commented Nov 17, 2021

swhite2401 commented Nov 18, 2021

lcarver commented Dec 10, 2021 •

edited

swhite2401 commented Dec 10, 2021

Collective effects with MPI + improve interface to passmethod impedance #320

Collective effects with MPI + improve interface to passmethod impedance #320

Conversation

swhite2401 commented Oct 19, 2021 • edited

lfarv left a comment

Choose a reason for hiding this comment

swhite2401 commented Nov 15, 2021

lfarv commented Nov 15, 2021

lfarv commented Nov 16, 2021

swhite2401 commented Nov 16, 2021

lfarv commented Nov 16, 2021

lcarver commented Nov 16, 2021 • edited

swhite2401 commented Nov 16, 2021

lfarv commented Nov 17, 2021

lcarver commented Nov 17, 2021

lfarv commented Nov 17, 2021

lcarver commented Nov 17, 2021 • edited

swhite2401 commented Nov 17, 2021 • edited

lfarv commented Nov 17, 2021

swhite2401 commented Nov 17, 2021

lcarver commented Nov 17, 2021

lfarv commented Nov 17, 2021

swhite2401 commented Nov 18, 2021

lcarver commented Dec 10, 2021 • edited

swhite2401 commented Dec 10, 2021

swhite2401 commented Oct 19, 2021 •

edited

lcarver commented Nov 16, 2021 •

edited

lcarver commented Nov 17, 2021 •

edited

swhite2401 commented Nov 17, 2021 •

edited

lcarver commented Dec 10, 2021 •

edited