-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018
nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018
Conversation
The current commit is not bringing AVX and does not make any change in the "bound on number of limbs" logic. In my experiments it is mostly slightly faster (in the above-mentioned "new" cases), sometimes marginally slower, than the existing code. For some reason, with clang on zen4 it brings a good improvement for small moduli; I did not check carefully, but I suppose the loop unrolling in the "small modulus" case helps clang to auto-vectorize. Question: for a small modulus (< 32 bits) and an unsigned long (possibly up to full 64 bits), is there something faster than |
I definitely consider Apart from Edit: also |
I don't think so. IIRC, I tried at some point to do a 32-bit version of BTW, we definitely want to leverage the generics system to have dedicated 8-bit, 16-bit and 32-bit Z/nZ implementations (some code is already in place). Of course, that goal shouldn't obstruct optimizing the existing Regarding doubles for modular reduction: I suspect with |
Thanks for your input! |
…close to 32 or 64 bits)
Currently in |
@fredrik-johansson thanks for your fix on profile/p-dot.c .
Ok, I will look into this. This is about ll864--898 of gr/nmod.c, I suppose. This was among my list of things to clarify before finishing this PR. For short dot products, and for code organization in general, it could be nicer to avoid computing One approach: once the modulus is fixed, we could "precompute" ranges of dot product lengths for each number of limbs. For example, say up to length 30 then 2 limbs, and beyond that, 3 limbs; or up to length 115 then 1 limb, and beyond that, 2 limbs. If I'm not mistaken, for a given modulus we cannot have lengths (> 0) that are done with 1 limb and others with 3 limbs. So the precomputation could simply yield one length threshold, separating 1->2 limbs (if the modulus is <= 232) or 2->3 limbs (if the modulus is > 232, forbidding any length > 0 to be done with a single limb). One would have to store this threshold somewhere. Could it be simply in Is there any downside about adding a third field to this |
Some early, non-final comparison (new time / old time) via p-dot.c, on an IceLake server. The first line, 0, is for the modulus 2**63. Altogether, there is nice speed-up where expected, and sometimes some regression.
|
What about splitting the dot-product into different functions? Regardless, it looks good! I'm a little bit concerned why we get a slowdown around 1000 in length. Do you know why that is? |
Yes, we would still need the general function which calls the specific algorithms, but I agree it could be make the code clearer. And could be better performance-wise for some purposes: e.g. in
Unfortunately not. And it is not a general, easily reproducible fact. On my laptop (zen4) it looks like this:
Both this one and the previous timings are on machines with avx512, which can play a role in particular for the 12-bit case where auto-vectorization works well. Here, on another machine without avx512, which does not have the "length 1000" issue either:
|
Looks like it still has the same problem, just that it is faster than the previous version and does not slow down as much. It is not important IMO, it could be a profiler thing. Just curious. |
Update on the state of this:
This feels close to complete for me, so if you have comments/suggestions, please let me know.
I have reworked the parameter selection a bit, so that it does less work than it used to for moduli of 32 bits or less. For the small length case, I have made several attempts, measuring what this could gain, and this did not help as much as I expected. For example I thought it would speed up doing many 2 x 2 matrix multiplications done successively, since this calls
Maybe I misunderstood your point: I thought you mentioned length specifically 1000, and the slowdown we observed compared to the previous version (mat_mul has some < 1.00 ratios at length 1000 in the first timings above). If your remark was rather that for length around 1000 and more, then the ratio new version / old version becomes less interesting, yes this looks sensible, due to memory accesses. The dot product --- in particular in single limb cases, when the modulus is < 32 bits --- is quite memory intensive: for each product you load two coefficients, and you never re-use them (if doing avx2, for each 4 products via a single |
Current comparisons between this branch and the main branch (ratio new / old), on 3 machines. Using gcc in all cases. The last two machines have avx512, the first one does not and it seems that gcc does not vectorize well the current dot product on that first machine (this is about the single limb case only: gcc does vectorize it reasonably well on the avx512 machines).
|
Comparison after the last changes (ratios new times vs. times for main branch). The better handling of short dot products does help for polynomial routines. I also inserted comparisons of dot product times including the call to
|
Perhaps we should keep |
For the documentation, does it sound reasonable to not go into the details of the implementation (specific algorithms, choices, ...), and only mention the main interfaces such as In Currently this |
(I suppose the documentation should still contain a word about the types |
Updated exp_series comparisons: Ryzen 7 7840U
Xeon E7-4820
|
This now seems ready for review. |
It looks very good! Would be nice to get more consistent timings for the small cases though. I will give the code a careful review soon. |
Thanks!
Yes I agree... fortunately this is only for the exp_series bench (all other functions are pretty consistent including for the smallest lengths). This has been consistently inconsistent since the beginning of my work on this PR. You can see below the timings for the current main branch, this already is inconsistent:
|
By the way, here are the timings (before taking ratios), in case that can help. And the script I used to compute the ratios. |
Sorry for the delay. There are no obvious problems with the code, tests pass, and I reproduce uniform speedups on my machine (except for 63/64-bit moduli where performance is unchanged). So let's get this in! |
The goal is to introduce the following, for nmod vec's:
Good avx2 performance when the modulus is "small". This roughly means moduli with at most 32 bits, but for efficiency reasons the limit would be slightly lower, a bit above 31 bits. Based on draft code, the target of this PR is roughly up to 2**30.5 (trying to go beyond, it is difficult to maintain the same efficiency).
Better performance when the dot product fits on 3 limbs but we still have enough room to accumulate a few 2-limb products before using add_sssaaaaaa. Basically the target is to avoid any performance reduction when we are in the region of 50- to 62-bit moduli (depending on the length).
One foreseeable issue is that we do not want more "complex" code (e.g. more branching) to penalize small instances, since this is code which is actually used a lot for small instances. For example an indirect target is to accelerate matrix multiplication or matrix-vector products (when the modulus is close to 32 bits or close to 64 bits): there one computes many dot products of the same type (same length, same modulus), and the new code should avoid any performance decrease for small-ish matrices (e.g. 50x50).
Questions, written here for help and suggestions:
Regarding benchmarks: measuring the performance of dot products for small lengths is not that trivial, and in fact maybe not that meaningful: what matters the most is the performance within code that uses such products. Apart from matrix multiplication things, could someone suggest a few additional flint routines that could be targeted to get confident that no performance regression is introduced?
The current flint NMOD_VEC_DOT macro has 4 or 5 cases already (based on number of limbs of the product, but also other things). The new code has a couple more. The plan would be to slightly change the logic, which is currently to compute the number of limbs and use it to separate cases, and then again separate some additional cases within this. One could rather find directly a "case number" ("one limb", "small modulus", "two limbs", "three limbs but room for accumulation", etc.). In my draft experiments, this helps performance for small dot products. This means some code changes here and there. Something that is not clear to me: should a change in
_nmod_vec_dot_bound_limbs
be done via some deprecation? (I'm wondering whether this helper function is considered as, say, "for internal use, subject to change")In addition, for "small" modulus, the new code would require(*) to compute some integer like 2**56 mod n, i.e. which depends only on the modulus n. Recomputing this for every new dot product seems wasteful for repeated dot products of the same length/modulus, so this computation may as well be incorporated in the new "case distinguishing" approach: if we see we are in the "small modulus case", then compute this number and return it somehow.
(*) the non-avx version should not really require it, but the avx version does.