Phase winding? #4

dfm · 2022-02-07T15:38:54Z

But this reminds me of a hugely important fact:
Recall that since their freq modes are equispaced then you do not need
to do sincos for each!
The addition formula should be used, turning it all into adds&mults.

Eg check my code for summing a Fourier series here (where it's called
"phase winding" and done w/ complex #s):
https://github.com/flatironinstitute/finufft/blob/69aff52bd5eebb28ea0521bcc303cab1e201715b/src/finufft.cpp#L203

I imagine the 1tera-pair/sec they report on GPU is using this trick.

A few comments on this:

I'm not sure that I totally understand the trick. Perhaps one of you (@lgarrison, @ahbarnett) could explain it to me?
Gowanlock+ don't seem to use any tricks like this. They say that their implementation is directly ported from scipy, which is implemented here: https://github.com/scipy/scipy/blob/b9881c85530f42c19c97ed38c052af8100caf250/scipy/signal/_spectral.pyx#L20-L97, it looks pretty much identical to our baseline, unless I'm missing something.

lgarrison · 2022-02-07T15:44:16Z

I haven't sat down to try to understand the trick yet, but I did notice that the Gowanlock+ paper appears to address it (last paragraph of Section 4.5, page 8):

The sequential LSP algorithm iterates over frequen-
cies, which means that the calculation of sin and cos can
be simply computed using the ∆f between frequency n
and n + 1, thus eliminating costly sin and cos calcula-
tions (i.e., the sincos at iteration n + 1 in the loop in
Listing 1 could be expressed as the difference between the
sincos at iteration n). Unfortunately, this optimization
eliminates the possibility of parallelization, as it introduces
inter-iteration dependencies. Therefore, we do not employ
this optimization in lsp-c, as we would not be able to
execute the algorithm in parallel.

lgarrison · 2022-02-07T20:46:39Z

I think I understand the trick now, which is an application of the angle addition formula. If we want to compute sin(f + df), we have:

sin(f + df) = sin(f) cos(df) + cos(f) sin(df).

(sin|cos)(df) are constant, and (sin|cos)(f) can just come from the previous iteration. df doesn't have to be small; it's just the frequency grid spacing.

So it's true that this algorithm doesn't naively parallelize. But I think there's an opportunity for a hybrid approach, where each GPU thread does O(10) frequency bins, instead of 1. Each thread starts with a real sin/cos call, but then the subsequent 9 iterations use the angle-addition formula.

We don't want each GPU thread doing too many frequency bins, because then that exposes less overall parallelism. But it can be a free parameter based on the total number of frequency bins. Could be a big speedup!

ahbarnett · 2022-02-08T18:07:43Z

HI Lehman, Re addition formula - bingo. (2x2 matrix mult is another way to see it. Complex arith is not needed). There is one "phase" (cos(df),sin(df) pair) needed to be stored per NU pt. Nice to see you come up with this hybrid, which is exactly what I was thinking when I read the "does not parallelize" claim in your previous msg :) It will destroy the cost of sincos and be as parallel as you want. I suggest you make each thread do more than 10... more like 1e2-1e3? Or (Nmodes * NUpts) / Nthreads, which would be the ideal, right? But since you *can parallelize* over NU pts (of which they have ~4k), their "does not par" claim makes little sense to me. A OpenMP CPU test first would be great. How soon could you get a GPU version going? However, before we get too excited re GPU beating theirs, I first suggest figuring the rate at which a simple FMA can beat a sincos... it might be that making that swap does not actually speed up the GPU that much (ie can a GPU do 1Teraflop of plain FMAs, which is the pair rate for sincos they see to claim?) CHeers, Alex

…

On Mon, Feb 7, 2022 at 3:46 PM Lehman Garrison ***@***.***> wrote: I think I understand the trick now, which is an application of the angle addition formula. If we want to compute sin(f + df), we have: sin(f + df) = sin(f) cos(df) + cos(f) sin(df). (sin|cos)(df) are constant, and (sin|cos)(f) can just come from the previous iteration. df doesn't have to be small; it's just the frequency grid spacing. So it's true that this algorithm doesn't naively parallelize. But I think there's an opportunity for a hybrid approach, where each GPU thread does O(10) frequency bins, instead of 1. Each thread starts with a real sin/cos call, but then the subsequent 9 iterations use the angle-addition formula. We don't want each GPU thread doing too many frequency bins, because then that exposes less overall parallelism. But it can be a free parameter based on the total number of frequency bins. Could be a big speedup! — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSS364B6SZC26BDJANDU2AVTVANCNFSM5NXZWUKA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- *---------------------------------------------------------------------~^`^~._.~' |\ Alex H. Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

lgarrison mentioned this issue Feb 8, 2022

Initial phase winding implementation #6

Merged

lgarrison mentioned this issue Feb 23, 2024

CUDA and cufinufft implementations #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase winding? #4

Phase winding? #4

dfm commented Feb 7, 2022

lgarrison commented Feb 7, 2022

lgarrison commented Feb 7, 2022

ahbarnett commented Feb 8, 2022 via email

Phase winding? #4

Phase winding? #4

Comments

dfm commented Feb 7, 2022

lgarrison commented Feb 7, 2022

lgarrison commented Feb 7, 2022

ahbarnett commented Feb 8, 2022 via email