Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libsharp discontinued, alternatives for MPI smoothing? #92

Closed
zonca opened this issue Oct 21, 2021 · 10 comments
Closed

libsharp discontinued, alternatives for MPI smoothing? #92

zonca opened this issue Oct 21, 2021 · 10 comments
Assignees

Comments

@zonca
Copy link
Member

zonca commented Oct 21, 2021

PySM relies on libsharp to smooth maps over MPI.

Libsharp is not maintained anymore: https://gitlab.mpcdf.mpg.de/mtr/libsharp

Is anyone using PySM over MPI? Should we simplify the code and remove MPI support? or find an alternative for libsharp (@mreineck suggestions?) ?

@NicolettaK @giuspugl @bthorne93

@zonca zonca self-assigned this Oct 21, 2021
@mreineck
Copy link

If you are currently using libsharp and it is doing what you need, I don't think there is a reason to worry; if any bugs are found in the code, I'm still happy to fix them. My main reason to archive the repo was that I don't want people to start new projects using this code.
That said, if you have a genuine need for SHTs with MPI, I'd be very interested to hear about the use case. My impression currently is that single compute nodes are strong enough to compute SHTs of any "typical" size very quickly, and that using MPI within a single node is a waste of resources (at least for SHTs). I might of course be wrong, and if so, I'd like to hear about it :-)

@zonca
Copy link
Member Author

zonca commented Oct 21, 2021

We are doing foreground modelling with PySM at N_side 8192, each map in memory in double precision just one component is 6.5 GB, the more demanding model has 6 layers, each with 1 IQU map and 2 auxiliary, so it's about ~200 GB just the inputs. That is a single model, in a complete simulation we would have 4 galactic models, 3 extragalactic, 2 or 3 CMB components.

It's not doable yet on standard compute nodes. And we might need N_side 16382.

@mreineck
Copy link

Thanks for the data, that is very helpful!
The approach I would suggest (for optimizing SHT performance) in this situation is to

  • distribute the components over the compute nodes in a round-robin fashion
  • let every node do the necessary SHT steps (analysis+smothing+synthesis) assigned to it, using multithreading with all threads on the node
  • communicate the results to the appropriate MPI tasks again

This sould be the fastest way of carrying out this operation. Of course the additional communication makes things more complicated, so I can perfectly understand if this is not your preferred solution. However, doing an SHT, even if it is nside=8192, lmax=16384, with hybrid MPI/OpenMP (or even wose with pure MPI) parallelization over >100 threads is quite inefficient. If you have the chance to do several SHTs simultaneously on fewer threads each, this would be preferable.

@mreineck
Copy link

mreineck commented Oct 21, 2021

PS. Out of curiosity, if you go to map space for intermediate computations (not for the final data product), have you considered using grids other than Healpix, e.g. Gauss-Legendre? Depending on the operations you have to carry out, this could be advantageous, since the SHTs are potentially significantly faster and also exact.
If you require equal-area properties, this won't work of course.

@zonca
Copy link
Member Author

zonca commented Oct 21, 2021

@mreineck I don't know much about Gauss-Legendre, do you have a reference I could look into?

@mreineck
Copy link

mreineck commented Oct 21, 2021

If your band limit is lmax, the minimal corresponding Gauss-Legendre grid has (lmax+1) nodes in theta (non-equidistant) times (2*lmax+1) nodes in phi (equidistant). SHTs are exact in both direction as long as the band limit holds. This grid has the fastest SHTs you can get, roughly twice as fast as using a Healpix grid with 2*nside=lmax. Basics are briefly mentioned in, e.g., https://hal-insu.archives-ouvertes.fr/file/index/docid/762867/filename/sht.pdf.
libsharp and my other SHT libraries support this out of the box.

[Edit: fixed the nside expression]

@zonca
Copy link
Member Author

zonca commented Oct 29, 2021

thanks @mreineck, do you have an example of using:

https://mtr.pages.mpcdf.de/ducc/sht.html#ducc0.sht.experimental.synthesis

to transform a set of IQU Alms into a map in GL pixels and then use analysis to go back to Alms?

It's really hard to understand how to use it from the API reference.

@mreineck
Copy link

The good thing about the GL grid is that it is "regular", i.e. it

Its usage is demonstrated, e.g., in https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/ducc0/python/demos/sht_demo.py.
This is only without polarisation; I'll add polarisation and post the modified code here soon!

@mreineck
Copy link

Here it is. (Sorry, Github doesn't appear to let me attach it.)

import ducc0
import numpy as np
from time import time

rng = np.random.default_rng(48)

def nalm(lmax, mmax):
    return ((mmax+1)*(mmax+2))//2 + (mmax+1)*(lmax-mmax)

def random_alm(lmax, mmax, spin):
    spin = list(spin)
    ncomp = len(spin)
    res = rng.uniform(-1., 1., (ncomp, nalm(lmax, mmax))) \
     + 1j*rng.uniform(-1., 1., (ncomp, nalm(lmax, mmax)))
    # make a_lm with m==0 real-valued
    res[:, 0:lmax+1].imag = 0.
    # zero a few other values dependent on spin
    for i in range(ncomp):
        ofs=0
        for s in range(spin[i]):
            res[i, ofs:ofs+spin[i]-s] = 0.
            ofs += lmax+1-s
    return res

# just run on one thread
nthreads = 1

# set maximum multipole moment
lmax = 2047
# maximum m.
mmax = lmax

# Number of pixels per ring. Must be >=2*lmax+1, but I'm choosing a larger
# number for which the FFT is faster.
nlon = 2*lmax+2

alm = random_alm(lmax, mmax, [0, 2, 2])
print("testing Gauss-Legendre grid with lmax+1 rings")

# Number of iso-latitude rings required for Gauss-Legendre grid
nlat = lmax+1

# go from a_lm to map
t0 = time()
map = np.empty((3, nlat, nlon))
# unpolarised component
ducc0.sht.experimental.synthesis_2d(
    alm=alm[0:1], ntheta=nlat, nphi=nlon, lmax=lmax, mmax=mmax, spin=0,
    geometry="GL", nthreads=nthreads, map=map[0:1])
# polarised component
ducc0.sht.experimental.synthesis_2d(
    alm=alm[1:3], ntheta=nlat, nphi=nlon, lmax=lmax, mmax=mmax, spin=2,
    geometry="GL", nthreads=nthreads, map=map[1:3])
print("time for map synthesis: {}s".format(time()-t0))

# transform back to a_lm

t0 = time()
alm2 = np.empty_like(alm)
# unpolarised component
ducc0.sht.experimental.analysis_2d(
    map=map[0:1], lmax=lmax, mmax=mmax, spin=0, geometry="GL", nthreads=nthreads, alm=alm2[0:1])
# polarised component
ducc0.sht.experimental.analysis_2d(
    map=map[1:3], lmax=lmax, mmax=mmax, spin=2, geometry="GL", nthreads=nthreads, alm=alm2[1:3])
print("time for map analysis: {}s".format(time()-t0))

# make sure input was recovered accurately
print("L2 error: ", ducc0.misc.l2error(alm, alm2))

@zonca
Copy link
Member Author

zonca commented Nov 4, 2021

discussion about libsharp is completed, discussion about Gauss-Legendre pixelization in #91

@zonca zonca closed this as completed Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants