Please tell us about your use of FINUFFT! #398

ahbarnett · 2023-12-13T19:53:16Z

ahbarnett
Dec 13, 2023
Maintainer

This is a discussion thread for users to post how they use FINUFFT. This will help us optimize and extend the software.

The main questions to answer are:
what type (1,2,3) and dimension?
what number of modes N = (N1,N2,...) ie uniform grid shape? (if type 1 or 2)
what number of nonuniform pts M? And are these points clustered or quasi-uniform?
what tolerance? (and do you use single- or double-precision library?)
what ntransf (if you use vectorized interface)?
Do you use CPU, GPU, or both?
For CPU, how many threads do you use? Do you use single-threaded calls in parallel?
Do you use the guru or the simple interfaces?
What language wrappers do you use?
What is your application area? (possibly link to paper or your package)
What feature requests do you have?

If you want to give a link to the part of your code that uses FINUFFT, that might also be useful.

Thanks so much! Alex

ahbarnett · 2023-12-13T20:13:30Z

ahbarnett
Dec 13, 2023
Maintainer Author

As an example of a GPU use case with all of the parameters listed: #323 (reply in thread)

0 replies

paquiteau · 2023-12-14T09:50:06Z

paquiteau
Dec 14, 2023

Hello (again)!

We (the team doing Compressed Sensing for MRI, at Neurospin, France) uses both finufft and cufinufft in our pipeline for iterative reconstruction of non-cartesian MRI.

In practice, we work with data that is:

2D images for prototyping and POCs (192x192 px image domain (resolution of 1mm) and 10^4^-10^5^ non-uniform samples "trajectory points" in the k-space
3D images with 1mm isotropic or lower resolution (e.g. 192x192x128 or bigger image domain size) with 10^5^-10^7^ samples in k-space and 32 coils. This is where real data is used
With NUFFT and typical (not parametrized!) trajectories, we get pretty pictures¹².
4D (3D+time) functional MRI data (multiply the 3D case by 100-400 time frames, with potentially different sampling trajectories at each frame)

NB: A key aspect of the Non-Cartesian Trajectories is that they are composed of "shots" that each passes through the center point of the k-space.

Due to the number of coils, the potential use of extra interpolators (to correct for static field inhomogeneities³), the number of calls to type 1 (adjoint) and type 2 (forward) NUFFT is in the hundreds if not thousands.

As there are many NUFFT implementations available, we have developed mri-nufft (in Python with numpy, cupy, and torch backend):

It augments the NUFFT operation for MRI (notably with SENSE for adding coil sensitivity profiles and density compensation to have better condition numbers), the rest of our reconstruction pipeline is handled in Modopt and pysap-mri.
It also gathers a large collection of routines to create and augments Non Cartesian trajectories

In term of parametrization:

we use exclusively single precision (for memory and compute efficiency)
finufft (launched in multithreading mode) is fine in 2D cases, but we move to cufinufft for 3D and beyond
mri-nufft implements ntransf, but we lack benchmark on the subject (and how it performs when we also have the sensitivity maps)
Due to our peculiar needs and for better memory management, we rely on the guru interfaces.

finufft and cufinufft are the most stable implementations we can find out there. There are some performance challengers (well, just gpuNUFFT in some cases ⁴, but the recent merging of #330 calls for a new study)

In terms of feature requests the discussion at #306 and #308 is a first point: getting preconditioning weights (similar to what is done in astronomy) helps for iterative reconstruction, and is essential for deep-learning based approaches⁵⁶

@chaithyagr, @philouc, @matthieutrs

https://onlinelibrary.wiley.com/doi/full/10.1002/mrm.29702 ↩
https://ieeexplore.ieee.org/document/9729575 ↩
https://onlinelibrary.wiley.com/doi/full/10.1002/mrm.29297 ↩
https://github.com/flatironinstitut/cufinufft/issues + a recent benchmark submitted to ISMRM24 ↩
NC-PDNet: https://pubmed.ncbi.nlm.nih.gov/35041598/ ↩
Jointly Learning trajectories and reconstruction network: https://www.mdpi.com/2306-5354/10/2/158 ↩

0 replies

turbotage · 2023-12-19T14:02:13Z

turbotage
Dec 19, 2023

I missed this discussion post and therefore accidentally created an issue explaining how finufft is used in MRI! I also see @paquiteau have already answered quite thourougly. Many of the things he describes is also the case here where I do my research in Compressed Sensing at Umeå University. We mainly work with 4D-flow MRI data (3D+time) and not 4D (functional MRI) but how this library is used I imagine is rather the same. We also, due to multiple coils and multiple velocity/phase encodings have to do potentially hundreds or thousands of nuffts every iteration. As I didn't take the time to rewrite my issue post more suitably for this discussion I have simply pasted what I wrote there here instead, and added some extra information.

ISSUE POST

I created this issue as a discussion on what other extensions the finufft library could provide that are still within the grasp of what it tries to achieve. Mainly, I hope it could bring some light to the reply @ahbarnett provided in #306. As such, the listed points will be written as if answered to that comment, so it is probably nice to read that comment first and also perhaps the expertly written inverse nufft tutorial.

First, I see how exposing the spreading/interpolation operations is problematic. I agree that the nufft is the crucial feature of the library, and if other spreaders/interpolators that would make the nufft faster were found that should be prioritized! I think exposing just the spreader and interpolator in C/C++ would still be useful, but users would have to be carefull or aware that they may change between versions and so on. If asking me, by no means make this a priority. Although, as mentioned in @ahbarnett's reply, the sinc^2 weights of Geengard, Inati, et al, would be very useful! So, if it would be feasible to implement an interface to sinc^2 quadrature weights that would be great! If exposing the fast sinc^2 in the process seems reasonable that would also be nice.

As pointed out. In MRI applications one often tries to minimize ||Ax-b|| where A represents a nufft. A may be both a 2D or 3D nufft depending on the particular problem. The iterative solvers have to repeatedly perform the operation A^HA, the normal operator. Sometimes to improve convergence, one instead solves ||W(Ax-b)||, where W is a diagonal matrix such that the iterations computes A^HWA instead. Offering up best likelihood estimation for convergence speed. Preconditioning might seem like a better alternative, and indeed sometimes it is, but sometimes also not. Preconditioning makes some of the proximal algorithms different and more difficult. The W used for convergence speedup is also often the same as the "density compensation weights" mentioned. Sometimes the density compensation weights are also used to perform a "gridding reconstruction". That is, one finds W such that instead of solving ||Ax-b|| via an iterative solver one just does x = A^H(Wb), this is approximate but fast! Also mentioned in the inverse tutorial (and often utilized in MRI), the toeplitz approach can greatly increase the speed of performing A^HA. But this approach is also sometimes not viable. For some problems, multiple different A^HA (coordinates differ) has to be performed in each iteration. Storing the toeplitz kernels for these kernels takes up too much memory so computing A^HA via nuffts has to be used instead. Also sometimes other modifications to the minimization problem taking into account off-resonance effects and such makes the toeplitz approach less feasible. Actually these problems are discussed in the Fessler article referenced in the inversion tutorial.

After all this text I would like to provide some features that would be really useful for MRI users that hopefully are still within the finufft grasp. I list them in order of importance.

A fast normal operator! It would be beneficial to provide an interface that exposes the normal operator. That is an operation that computes A^HA. I was hoping that perhaps this lends itself to speedup also, or, smaller memory footprint. As you don't need two fft plans right? And perhaps fewer fftshifts are required aswell? It would be beneficial here if an alternative diagonal vector could be supplied for this operation also, such that operations as A^HWA can be performed. I don't think this operation would expose any inner working parts of how the nuffts are performed and seems to be a reasonable extension of finufft as it is basically just what is required to perform inverse nufft.
A toeplitz kernel creation interface. For users who wants to use toeplitz for inversion it would be great if finufft provided an interface that calculated the kernels. Hopefully so they are real and with hermitian symmetry too!
sinc^2 quadrature weights. Also, other reasonably good (very unspecific, I know), if available/you know of, quadrature weights would also be interesting!

As I see it these features are all within the subject of what finufft strives to achieve, especially the first one. Atleast, if one considers inverse nufft as a goal. And I also think that changes to inner workings of how the nufft is performed could be made without this interface being unstable.

EXTRA INFO

We use only fp32
As of now, we only use cufinufft
We use the C/C++ only
We use Windows, potentially Linux in the future
N1 = 128-320, N2=128-320, N3=128-320, M = 65000 - 20,800,000, tol = 1e-5
Presorting probably beneficial since we run the same nufft many times each iteration

Another feature I didn't list but would be appreciated is the 1.25 upsampling factor mentioned in the cufinufft repo (issue 126)

3 replies

paquiteau Dec 19, 2023

FWIW, I think that the 1.25 upsampling factor is already available (upsampfac=1.25 and gpu_kerevalmeth=0 are required as arguments for the cufinufft Plan in this case), I don't know if it is supported behavior though.

I sustain the point about the normal operator (which would be "a reverse type-3": Grid-> NU -> Grid). An alternative (simpler?) would be to have a shared cufft plan for both types of transform (or across plans of the same type ?), reducing the overall memory footprint of cufinufft.

turbotage Dec 20, 2023

Hmm, I need to check that out! Btw the Grid -> NU -> Grid is the most important, but we actually have usage for a NU -> Grid -> NU normal operator too, when using dual algorithms, working in K-space instead of on the picture. Ofc this is just the type3 but we need to be able to multiply by weights in between the transforms.

Also, the shared cufft plan for both types is an alternative, although if there are potential performance wins when computing the normal operator other than just using the same plan for less memory usage, then perhaps it should be its own thing.

ahbarnett Dec 24, 2023
Maintainer Author

Dear turbotage, paquiteau, chaithyagr, et al,

These are wonderful and detailed responses - thank-you! - it will take us a little while to process them. So cool you're using up to 20M points in 3D and 5-digit accuracy!

Sinc^2 weights can already be produced by https://github.com/hannahlawrence/sinctransform
https://github.com/gauteh/fsinc although I haven't checked those in a while.
I have a feeling that short helper routines (in Py, Jl, MATLAB, etc) would be best for setting up (Toeplitz vector) then applying A^H A or A^H W A (via padded FFT). Crude MATLAB codes exist already in the inv1d2 tutorial and https://github.com/flatironinstitute/gp-shootout/tree/main/algs/EFGP .
Using short high-level helpers would avoid extending our language bindings (a big pain).
This would address all 3 requests of turbotage.

It would be interesting to see how @DiamonDinoia Marco's group usage for 4D MRI (XD-GRASP) differs from what's posted here.

I plan to write a inv2d2 tutorial in similar style to inv1d2, which would be my crude understanding of basic Fourier image recon in MRI (but also VLBI, SAR, etc) settings. I'm sure you will help correct me :)

Happy holidays! Alex

DiamonDinoia · 2023-12-26T10:34:20Z

DiamonDinoia
Dec 26, 2023

The way we use it is quite different. Our entire 4D-MRI reconstruction is based on the XD-GRASP¹. We implemented everything in c++. Give the structure of the algorithm we do some pre-processing then we split the reconstruction into independent conjugate-gradient problems along the Z-dimension. This has two advantages:

We create/delete threads once
When we use the GPU, we can overlap data transfers/reconstruction. This roughly double the performance. Figure 4²

We further split the problem into multiple respiration phases, this allows to do multiple 2D transforms. Since, these transforms do not saturate the GPU we queue multiple of them in parallel. At each reconstruction we do thousands of 2D reconstruction. To further minimize overhead we only use the GURU interface and FFTW_MEASURE and we create the plans once per thread re-using them until the end of the program.

Details:

FP32 only (on GPU FP16 might be enough) but not supported by cufinufft as of now.
Oversampling Factor 1.25
N = 336x336. N largely depends on the FOV, we would like to go higher but that depends on the scanner³.
tol $10^{-3}$
All calls are single-threaded (compiled without OpenMP/Threads)

0 replies

Sebastian-Belkner · 2024-04-12T11:07:49Z

Sebastian-Belkner
Apr 12, 2024

question	answer
type, dimension	1/2, 2-dimensional
number of modes N = (N1,N2,...)	up to ~(1e7,1e7)
number of pts M? clustered, quasi-uniform	Same as N, quasi-uniform
tolerance? (single- or double-precision library)	1e-5 (single) to 1e-7 (double) generally, can be up to 1e-14
ntransf	-
CPU, GPU, or both	both
CPU, how many threads	single-threaded calls in parallel: TBD, TBD
guru or simple interfaces	simple for now, perhaps guru soon to remove planning step from the pipeline
language wrapper	Python
application area	see use case below

Our use case is CMB lensing, and we work on iterative lensing reconstruction¹.
This includes a Wiener-filtering with a data-model that contains the lensing operation.
(lensing-operator times unlensed CMB gives lensed CMB. In essence, we Wiener-filter the data by the lensing-operator times unlensed CMB, and we build the lensing operator from the optimally reconstructed deflection field.)

The Wiener-filtering is done with a conjugate-gradient (cg) descent. For each cg-iteration, we have to perform the lensing operation.
The lensing operation can be done, e.g. with a (to name a few)

bicubic spline interpolation lenspix,
Taylor expansion Taylens, or,
using non-uniform FFTs lenspyx.

To do the lensing operation using non-uniform FFTs, we,

synthesize the SHT coefficients onto a CAR grid,
double the CAR grid map in theta direction,
perfrom a 2D FFT to get the fourier coefficients,
calculate the deflection angles and obtain the new pixel locations (they will be on a non-uniform grid!),
use 2d nuFFT with the new pixel locations, and the fourier coeffiencts.

This approach is tremendously better and faster than previous approaches as discussed in our lenspyx paper², that implements a CPU code.

We currently explore how much faster the lensing operation is on a GPU using cufinuFFT. All the enumerated steps above have to be put on the GPU. We use SHTns for the GPU SHT calculation.

The spin-0 implemenation of the lensing operation is almost done and speed ups look quite good at the moment! Plan is to also include spin-2 (or spin-n, really), and eventually integrate this into delensalot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please tell us about your use of FINUFFT! #398

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Please tell us about your use of FINUFFT! #398

ahbarnett Dec 13, 2023 Maintainer

Replies: 5 comments · 3 replies

ahbarnett Dec 13, 2023 Maintainer Author

paquiteau Dec 14, 2023

Footnotes

turbotage Dec 19, 2023

ISSUE POST

EXTRA INFO

paquiteau Dec 19, 2023

turbotage Dec 20, 2023

ahbarnett Dec 24, 2023 Maintainer Author

DiamonDinoia Dec 26, 2023

Footnotes

Sebastian-Belkner Apr 12, 2024

Footnotes

ahbarnett
Dec 13, 2023
Maintainer

Replies: 5 comments 3 replies

ahbarnett
Dec 13, 2023
Maintainer Author

paquiteau
Dec 14, 2023

turbotage
Dec 19, 2023

ahbarnett Dec 24, 2023
Maintainer Author

DiamonDinoia
Dec 26, 2023

Sebastian-Belkner
Apr 12, 2024