-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phase winding? #4
Comments
I haven't sat down to try to understand the trick yet, but I did notice that the Gowanlock+ paper appears to address it (last paragraph of Section 4.5, page 8):
|
I think I understand the trick now, which is an application of the angle addition formula. If we want to compute
So it's true that this algorithm doesn't naively parallelize. But I think there's an opportunity for a hybrid approach, where each GPU thread does O(10) frequency bins, instead of 1. Each thread starts with a real We don't want each GPU thread doing too many frequency bins, because then that exposes less overall parallelism. But it can be a free parameter based on the total number of frequency bins. Could be a big speedup! |
HI Lehman,
Re addition formula - bingo. (2x2 matrix mult is another way to see it.
Complex arith is not needed).
There is one "phase" (cos(df),sin(df) pair) needed to be stored per NU pt.
Nice to see you come up with this hybrid, which is exactly what I was
thinking when I read the "does not parallelize" claim in your previous msg
:) It will destroy the cost of sincos and be as parallel as you want.
I suggest you make each thread do more than 10... more like 1e2-1e3? Or
(Nmodes * NUpts) / Nthreads, which would be the ideal, right?
But since you *can parallelize* over NU pts (of which they have ~4k), their
"does not par" claim makes little sense to me.
A OpenMP CPU test first would be great. How soon could you get a GPU
version going?
However, before we get too excited re GPU beating theirs, I first suggest
figuring the rate at which a simple FMA can beat a sincos... it might be
that making that swap does not actually speed up the GPU that much (ie can
a GPU do 1Teraflop of plain FMAs, which is the pair rate for sincos they
see to claim?)
CHeers, Alex
…On Mon, Feb 7, 2022 at 3:46 PM Lehman Garrison ***@***.***> wrote:
I think I understand the trick now, which is an application of the angle
addition formula. If we want to compute sin(f + df), we have:
sin(f + df) = sin(f) cos(df) + cos(f) sin(df).
(sin|cos)(df) are constant, and (sin|cos)(f) can just come from the
previous iteration. df doesn't have to be small; it's just the frequency
grid spacing.
So it's true that this algorithm doesn't naively parallelize. But I think
there's an opportunity for a hybrid approach, where each GPU thread does
O(10) frequency bins, instead of 1. Each thread starts with a real sin/cos
call, but then the subsequent 9 iterations use the angle-addition formula.
We don't want each GPU thread doing too many frequency bins, because then
that exposes less overall parallelism. But it can be a free parameter based
on the total number of frequency bins. Could be a big speedup!
—
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACNZRSS364B6SZC26BDJANDU2AVTVANCNFSM5NXZWUKA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
*---------------------------------------------------------------------~^`^~._.~'
|\ Alex H. Barnett Center for Computational Mathematics, Flatiron
Institute
| \ http://users.flatironinstitute.org/~ahb
646-876-5942
|
Over in #2, @ahbarnett says:
A few comments on this:
The text was updated successfully, but these errors were encountered: