Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phase winding? #4

Open
dfm opened this issue Feb 7, 2022 · 3 comments
Open

Phase winding? #4

dfm opened this issue Feb 7, 2022 · 3 comments

Comments

@dfm
Copy link
Owner

dfm commented Feb 7, 2022

Over in #2, @ahbarnett says:

But this reminds me of a hugely important fact:
Recall that since their freq modes are equispaced then you do not need
to do sincos for each!
The addition formula should be used, turning it all into adds&mults.

Eg check my code for summing a Fourier series here (where it's called
"phase winding" and done w/ complex #s):
https://github.com/flatironinstitute/finufft/blob/69aff52bd5eebb28ea0521bcc303cab1e201715b/src/finufft.cpp#L203

I imagine the 1tera-pair/sec they report on GPU is using this trick.

A few comments on this:

  1. I'm not sure that I totally understand the trick. Perhaps one of you (@lgarrison, @ahbarnett) could explain it to me?
  2. Gowanlock+ don't seem to use any tricks like this. They say that their implementation is directly ported from scipy, which is implemented here: https://github.com/scipy/scipy/blob/b9881c85530f42c19c97ed38c052af8100caf250/scipy/signal/_spectral.pyx#L20-L97, it looks pretty much identical to our baseline, unless I'm missing something.
@lgarrison
Copy link
Collaborator

I haven't sat down to try to understand the trick yet, but I did notice that the Gowanlock+ paper appears to address it (last paragraph of Section 4.5, page 8):

The sequential LSP algorithm iterates over frequen-
cies, which means that the calculation of sin and cos can
be simply computed using the ∆f between frequency n
and n + 1, thus eliminating costly sin and cos calcula-
tions (i.e., the sincos at iteration n + 1 in the loop in
Listing 1 could be expressed as the difference between the
sincos at iteration n). Unfortunately, this optimization
eliminates the possibility of parallelization, as it introduces
inter-iteration dependencies. Therefore, we do not employ
this optimization in lsp-c, as we would not be able to
execute the algorithm in parallel.

@lgarrison
Copy link
Collaborator

I think I understand the trick now, which is an application of the angle addition formula. If we want to compute sin(f + df), we have:

sin(f + df) = sin(f) cos(df) + cos(f) sin(df).

(sin|cos)(df) are constant, and (sin|cos)(f) can just come from the previous iteration. df doesn't have to be small; it's just the frequency grid spacing.

So it's true that this algorithm doesn't naively parallelize. But I think there's an opportunity for a hybrid approach, where each GPU thread does O(10) frequency bins, instead of 1. Each thread starts with a real sin/cos call, but then the subsequent 9 iterations use the angle-addition formula.

We don't want each GPU thread doing too many frequency bins, because then that exposes less overall parallelism. But it can be a free parameter based on the total number of frequency bins. Could be a big speedup!

@ahbarnett
Copy link

ahbarnett commented Feb 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants