nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018

vneiger · 2024-06-13T08:48:10Z

The goal is to introduce the following, for nmod vec's:

Good avx2 performance when the modulus is "small". This roughly means moduli with at most 32 bits, but for efficiency reasons the limit would be slightly lower, a bit above 31 bits. Based on draft code, the target of this PR is roughly up to 2**30.5 (trying to go beyond, it is difficult to maintain the same efficiency).
Better performance when the dot product fits on 3 limbs but we still have enough room to accumulate a few 2-limb products before using add_sssaaaaaa. Basically the target is to avoid any performance reduction when we are in the region of 50- to 62-bit moduli (depending on the length).

One foreseeable issue is that we do not want more "complex" code (e.g. more branching) to penalize small instances, since this is code which is actually used a lot for small instances. For example an indirect target is to accelerate matrix multiplication or matrix-vector products (when the modulus is close to 32 bits or close to 64 bits): there one computes many dot products of the same type (same length, same modulus), and the new code should avoid any performance decrease for small-ish matrices (e.g. 50x50).

Questions, written here for help and suggestions:

Regarding benchmarks: measuring the performance of dot products for small lengths is not that trivial, and in fact maybe not that meaningful: what matters the most is the performance within code that uses such products. Apart from matrix multiplication things, could someone suggest a few additional flint routines that could be targeted to get confident that no performance regression is introduced?
The current flint NMOD_VEC_DOT macro has 4 or 5 cases already (based on number of limbs of the product, but also other things). The new code has a couple more. The plan would be to slightly change the logic, which is currently to compute the number of limbs and use it to separate cases, and then again separate some additional cases within this. One could rather find directly a "case number" ("one limb", "small modulus", "two limbs", "three limbs but room for accumulation", etc.). In my draft experiments, this helps performance for small dot products. This means some code changes here and there. Something that is not clear to me: should a change in _nmod_vec_dot_bound_limbs be done via some deprecation? (I'm wondering whether this helper function is considered as, say, "for internal use, subject to change")
In addition, for "small" modulus, the new code would require(*) to compute some integer like 2**56 mod n, i.e. which depends only on the modulus n. Recomputing this for every new dot product seems wasteful for repeated dot products of the same length/modulus, so this computation may as well be incorporated in the new "case distinguishing" approach: if we see we are in the "small modulus case", then compute this number and return it somehow.

(*) the non-avx version should not really require it, but the avx version does.

…explicit avx yet

vneiger · 2024-06-13T09:04:04Z

The current commit is not bringing AVX and does not make any change in the "bound on number of limbs" logic. In my experiments it is mostly slightly faster (in the above-mentioned "new" cases), sometimes marginally slower, than the existing code. For some reason, with clang on zen4 it brings a good improvement for small moduli; I did not check carefully, but I suppose the loop unrolling in the "small modulus" case helps clang to auto-vectorize.

Question: for a small modulus (< 32 bits) and an unsigned long (possibly up to full 64 bits), is there something faster than NMOD_RED, e.g. using doubles, for modular reduction? I had a look at #1010 and some other places but I did not reach a conclusion.

fredrik-johansson · 2024-06-13T09:05:27Z

I definitely consider _nmod_vec_dot_bound_limbs to be an implementation detail. Feel free to refactor the case selection in any way that makes sense.

Apart from nmod_mat_mul, some useful functions to benchmark are _nmod_poly_mul_classical, _nmod_poly_inv_series_basecase_preinv1 and _nmod_poly_exp_series.

Edit: also nmod_mat_solve_tril / nmod_mat_solve_triu.

fredrik-johansson · 2024-06-13T09:20:39Z

Question: for a small modulus (< 32 bits) and an unsigned long (possibly up to full 64 bits), is there something faster than NMOD_RED, e.g. using doubles, for modular reduction? I had a look at #1010 and some other places but I did not reach a conclusion.

I don't think so. IIRC, I tried at some point to do a 32-bit version of NMOD_RED but didn't notice any speedup. Maybe I missed some trick or that was just my machine though. I imagine that you can do several reductions in parallel a bit faster using avx2.

BTW, we definitely want to leverage the generics system to have dedicated 8-bit, 16-bit and 32-bit Z/nZ implementations (some code is already in place). Of course, that goal shouldn't obstruct optimizing the existing nmod format for small moduli.

Regarding doubles for modular reduction: I suspect with ulong input and output, the conversions are going to kill the advantage. What we probably want is rather a dedicated double based Z/nZ implementation using the same kind of machinery as in fft_small.

vneiger · 2024-06-13T09:40:20Z

Thanks for your input!

…l_modulus

…close to 32 or 64 bits)

fredrik-johansson · 2024-06-18T10:12:03Z

Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.

vneiger · 2024-06-19T14:45:49Z

@fredrik-johansson thanks for your fix on profile/p-dot.c .

Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.

Ok, I will look into this. This is about ll864--898 of gr/nmod.c, I suppose. This was among my list of things to clarify before finishing this PR. For short dot products, and for code organization in general, it could be nicer to avoid computing len * (mod.n - 1)**2 each time we make a new dot product (_nmod_vec_dot doesn't, but __gr_nmod_vec_dot does, and also we do it implicitly if we call many times, on small input, functions that use dot products).

One approach: once the modulus is fixed, we could "precompute" ranges of dot product lengths for each number of limbs. For example, say up to length 30 then 2 limbs, and beyond that, 3 limbs; or up to length 115 then 1 limb, and beyond that, 2 limbs. If I'm not mistaken, for a given modulus we cannot have lengths (> 0) that are done with 1 limb and others with 3 limbs. So the precomputation could simply yield one length threshold, separating 1->2 limbs (if the modulus is <= 232) or 2->3 limbs (if the modulus is > 232, forbidding any length > 0 to be done with a single limb).

One would have to store this threshold somewhere. Could it be simply in _dot_params_t? Then an element of this type would be initialized from the modulus, and would contain this threshold. Then _nmod_vec_dot_params would be "simplified", with no computation of len * (mod.n - 1)**2 and only some if conditions with ulong comparisons. (One should still try to avoid repeated calls when possible...)

Is there any downside about adding a third field to this _dot_params struct?

vneiger · 2024-06-19T15:00:22Z

Some early, non-final comparison (new time / old time) via p-dot.c, on an IceLake server. The first line, 0, is for the modulus 2**63. Altogether, there is nice speed-up where expected, and sometimes some regression.

#0  --> vec dot           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.50     1.89     1.86     2.31     2.59     3.39     3.24     3.58     3.38     3.61     3.70     3.71     1.76     1.18
12            1.05     0.97     0.95     1.07     0.80     1.35     1.29     1.62     1.69     1.63     1.14     1.13     0.89     0.90
28            1.03     0.98     0.96     1.07     0.86     1.33     1.29     1.60     1.68     4.38     3.39     3.40     1.38     1.00
30            1.03     0.97     0.94     1.07     0.80     2.63     2.86     3.91     4.17     4.33     3.39     3.40     1.49     1.00
31            1.02     0.97     0.96     1.39     1.46     2.59     2.87     3.90     4.16     4.37     3.39     3.40     1.38     1.00
32            1.04     0.98     1.09     0.97     0.90     0.95     0.95     0.97     0.98     0.97     0.99     0.99     0.96     0.91
40            1.04     0.98     1.03     0.97     0.94     0.99     0.90     0.88     0.87     0.86     0.95     0.95     0.97     0.94
50            1.03     0.98     1.02     0.97     0.94     0.99     0.89     0.89     0.87     0.85     0.96     0.95     0.97     0.94
60            1.02     0.98     1.04     0.97     0.93     0.99     0.90     0.89     0.87     1.29     1.34     1.35     1.28     1.13
61            1.02     0.99     1.02     0.97     0.94     0.99     0.90     1.28     1.29     1.29     1.34     1.35     1.29     1.13
62            1.02     0.98     1.02     0.97     0.94     1.17     1.22     1.28     1.29     1.29     1.34     1.35     1.30     1.15
63            1.02     0.98     1.03     1.02     1.02     1.17     1.22     1.28     1.29     1.28     1.35     1.35     1.28     1.13

#6  --> mat_mul           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.01     0.98     0.99     0.98     0.98     0.98     0.95     0.87     0.85
12            1.01     1.00     0.99     0.97     0.98     0.99     1.01     0.98     0.91
28            1.01     1.00     0.99     0.98     0.97     0.99     1.01     0.93     0.92
30            1.02     1.00     0.99     0.98     0.97     2.35     2.87     1.91     1.89
31            1.02     1.00     0.99     1.31     1.54     2.35     2.86     1.94     1.90
32            1.02     1.01     0.97     1.01     0.96     0.96     0.96     0.96     0.95
40            1.05     1.01     0.99     0.99     1.02     1.04     0.97     0.96     0.95
50            1.05     1.01     0.98     0.99     1.02     1.04     0.97     0.96     0.95
60            1.04     1.01     0.98     0.99     1.02     1.04     0.97     0.96     0.95
61            1.05     1.01     0.99     0.98     1.03     1.04     0.97     1.28     1.29
62            1.05     0.99     0.99     0.99     1.02     1.19     1.26     1.30     1.29
63            1.03     1.01     0.99     1.02     1.10     1.19     1.26     1.28     1.29

albinahlback · 2024-06-19T15:12:03Z

What about splitting the dot-product into different functions? Regardless, it looks good!

I'm a little bit concerned why we get a slowdown around 1000 in length. Do you know why that is?

vneiger · 2024-06-19T15:28:38Z

What about splitting the dot-product into different functions? Regardless, it looks good!

Yes, we would still need the general function which calls the specific algorithms, but I agree it could be make the code clearer. And could be better performance-wise for some purposes: e.g. in _nmod_mat_addmul_transpose_op my early experiments showed it would be better to have the dot algorithm selection outside the two for loops, and inside these loops call directly the specific function among those that you are suggesting to create (maybe the branching of _nmod_vec_dot costs a bit in the current implementation, I have not checked).

I'm a little bit concerned why we get a slowdown around 1000 in length. Do you know why that is?

Unfortunately not. And it is not a general, easily reproducible fact. On my laptop (zen4) it looks like this:

#0  --> vec dot           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.18     2.08     2.25     3.54     4.51     6.27     5.55     6.92     6.80     5.02     4.76     4.80     4.07     3.12
12            0.82     0.87     0.93     0.93     1.04     1.05     1.03     1.08     1.06     1.16     1.01     1.08     1.00     0.96
28            0.86     0.88     0.91     0.94     0.99     1.04     1.04     1.04     1.07     3.56     3.20     3.27     2.41     1.97
30            0.82     0.86     0.93     0.92     1.02     3.24     3.74     4.13     4.62     3.55     3.17     3.24     2.48     1.99
31            0.82     0.85     0.93     1.60     2.34     3.28     3.63     4.18     4.58     3.68     3.13     3.27     2.63     1.96
32            0.82     1.04     1.01     1.04     1.09     0.99     1.08     1.00     1.12     1.06     1.08     1.06     1.11     1.05
40            0.90     0.86     1.05     1.12     1.13     1.08     1.09     1.12     1.10     1.13     1.11     1.17     1.16     1.15
50            0.88     0.87     1.05     1.11     1.06     1.10     1.08     1.09     1.10     1.13     1.15     1.17     1.15     1.17
60            0.88     0.87     1.06     1.11     1.08     1.10     1.09     1.09     1.13     1.25     1.22     1.27     1.25     1.27
61            0.87     0.88     1.04     1.11     1.06     1.10     1.09     1.21     1.25     1.25     1.26     1.25     1.27     1.24
62            0.89     0.85     1.04     1.10     1.11     1.20     1.27     1.28     1.22     1.25     1.30     1.24     1.27     1.24
63            0.89     0.87     1.04     1.14     1.21     1.19     1.28     1.23     1.24     1.25     1.26     1.28     1.23     1.28

Both this one and the previous timings are on machines with avx512, which can play a role in particular for the 12-bit case where auto-vectorization works well. Here, on another machine without avx512, which does not have the "length 1000" issue either:


#0  --> vec dot           
#bits\len        1        5       10       25       50      100      250      500     1000     2500     5000    10000   100000  1000000
 0            1.17     1.96     1.70     2.06     2.21     2.45     2.48     2.51     2.65     2.50     2.51     2.44     1.77     1.59
12            0.82     0.96     1.01     1.17     1.38     1.99     1.99     2.23     2.27     1.82     1.78     1.66     1.17     1.34
28            0.80     0.94     1.02     1.18     1.39     1.98     1.95     2.16     2.25     2.78     2.81     2.77     1.58     1.65
30            0.79     0.95     1.02     1.20     1.39     2.38     2.66     3.35     3.73     2.81     2.82     2.58     1.68     1.58
31            0.79     0.94     1.03     1.25     1.57     2.33     2.67     3.40     3.72     2.82     2.84     2.68     1.68     1.65
32            0.80     1.01     1.01     0.89     0.93     0.95     0.94     0.93     1.02     0.98     1.01     1.01     0.92     0.98
40            0.95     1.01     1.04     1.08     1.10     1.08     1.07     1.07     1.07     1.12     1.06     1.07     1.13     1.15
50            0.95     1.02     1.03     1.09     1.10     1.08     1.07     1.06     1.07     1.11     1.09     1.11     1.14     1.13
60            0.95     1.02     1.02     1.08     1.09     1.08     1.08     1.07     1.06     1.39     1.39     1.38     1.32     1.32
61            0.94     1.04     1.03     1.07     1.10     1.10     1.08     1.33     1.33     1.41     1.38     1.35     1.39     1.29
62            0.95     1.03     1.03     1.09     1.10     1.24     1.32     1.35     1.36     1.39     1.39     1.39     1.33     1.31
63            0.94     1.04     1.03     1.07     1.20     1.25     1.31     1.31     1.34     1.39     1.38     1.39     1.34     1.32

albinahlback · 2024-06-19T15:40:13Z

Here, on another machine without avx512, which does not have the "length 1000" issue either

...

Looks like it still has the same problem, just that it is faster than the previous version and does not slow down as much. It is not important IMO, it could be a profiler thing. Just curious.

…t accept)

vneiger · 2024-06-22T17:16:10Z

Update on the state of this:

code is more complete now ("rev" and "ptr" variants written)
several efficiency improvements here and there; in particular making inline both the main entry points (not the subfunctions with specific algorithms) and the dot_params selection was beneficial, as well as handling small lengths explicitly
on my todo list is mostly to clean things up (notably make more use of the macros in the subfunctions if that preserves speed) and to play a bit more with this "small length specific code" (maybe trying up to 7)

This feels close to complete for me, so if you have comments/suggestions, please let me know.

@fredrik-johansson

Currently in gr/nmod.c there is some special code to select an algorithm quickly for short dot products. Maybe such a special case should be in _nmod_vec_dot_params.

I have reworked the parameter selection a bit, so that it does less work than it used to for moduli of 32 bits or less. For the small length case, I have made several attempts, measuring what this could gain, and this did not help as much as I expected. For example I thought it would speed up doing many 2 x 2 matrix multiplications done successively, since this calls _nmod_vec_dot_params for each product... but it remains a negligible part of the computation already there. It does help a bit (20-30%) for computing a single dot_params + dot at a small length (say < 10), so I will still have a last attempt at this before finishing this PR.

@albinahlback

we get a slowdown around 1000 in length. Do you know why that is?

Maybe I misunderstood your point: I thought you mentioned length specifically 1000, and the slowdown we observed compared to the previous version (mat_mul has some < 1.00 ratios at length 1000 in the first timings above). If your remark was rather that for length around 1000 and more, then the ratio new version / old version becomes less interesting, yes this looks sensible, due to memory accesses. The dot product --- in particular in single limb cases, when the modulus is < 32 bits --- is quite memory intensive: for each product you load two coefficients, and you never re-use them (if doing avx2, for each 4 products via a single _m256i_mul_epu32 you have to load 2 * 256 bits from memory). So a (non-verified) explanation could be: from length around 1000-2000, we need to go beyond the L1 cache, and things become really dominated by the memory accesses. Or something of this kind :-) .

vneiger · 2024-06-22T17:25:08Z

Current comparisons between this branch and the main branch (ratio new / old), on 3 machines. Using gcc in all cases. The last two machines have avx512, the first one does not and it seems that gcc does not vectorize well the current dot product on that first machine (this is about the single limb case only: gcc does vectorize it reasonably well on the avx512 machines).

Intel(R) Xeon(R) CPU E7-4820 v4
dot_posso_cmp.txt
Intel(R) Xeon(R) Gold 6354
dot_groebner_cmp.txt
AMD Ryzen 7 PRO 7840U
dot_znver4_cmp.txt

vneiger · 2024-06-24T11:25:27Z

Comparison after the last changes (ratios new times vs. times for main branch). The better handling of short dot products does help for polynomial routines. I also inserted comparisons of dot product times including the call to _nmod_vec_dot_params.

Intel(R) Xeon(R) CPU E7-4820 v4
dot_posso_cmp.txt
Intel(R) Xeon(R) Gold 6354
dot_groebner_cmp.txt
AMD Ryzen 7 PRO 7840U
dot_znver4_cmp.txt

fredrik-johansson · 2024-06-24T12:30:45Z

Perhaps we should keep _nmod_vec_dot_bound_limbs for uses like lu_classical_delayed where we want to bound the size of a dot product but where we don't actually call _nmod_vec_dot (or one of its variants)? Though maybe we want to add more versions of lu_classical_delayed as well...

vneiger · 2024-06-24T14:35:42Z

For the documentation, does it sound reasonable to not go into the details of the implementation (specific algorithms, choices, ...), and only mention the main interfaces such as _nmod_vec_dot and NMOD_VEC_DOT?

In gr/nmod.c, the short dot algorithm selection should not be needed anymore, since it is incorporated in the general parameters selection. So I removed it.

Currently this gr/nmod.c file uses the macro NMOD_VEC_DOT, could we instead use the function _nmod_vec_dot? (I'm not very familiar with the whole gr framework, so I want to avoid breaking anything.) The reason is that the avx-vectorization is not done in the macro, as it does not play nicely with it. The macro accepts a general expression as a function of the vector index i, but for loading data into avx registers we need to know a bit more about this function of i. Cf the differences of implementation of _nmod_vec_dot2_split, _nmod_vec_dot2_rev_split, _nmod_vec_dot2_ptr_split in nmod_vec/dot.c: the latter is less efficient due to the non-contiguous load and this is what would happen to the general macro if we tried to put a general avx version therein.

vneiger · 2024-06-24T14:36:23Z

(I suppose the documentation should still contain a word about the types dot_method_t and dot_params_t.)

vneiger · 2024-06-25T15:28:28Z

Updated exp_series comparisons:
(for that one, and for small len, some timings are quite off in one way or the other, not consistently... but apart from that, things look similar to before or faster)

Ryzen 7 7840U

#8  --> poly_exp_series        
#bits\len        1        2        3        4        5        7       10       15       25       35       50      100      250      500     1000     2500     5000    10000   100000  1000000
 12           1.10     0.55     1.27     1.19     1.07     1.11     1.06     1.09     1.14     1.13     1.08     1.11     1.07     0.98     0.94     0.92     0.96     0.98
 28           1.04     1.03     7.11     0.60     1.38     1.20     1.22     1.09     1.02     1.10     1.09     1.10     1.06     1.00     0.94     3.65     3.78     3.84
 30           1.02     0.99     0.98     1.11     0.16     0.39     2.51     1.13     1.10     1.10     1.05     1.54     2.54     3.15     3.60     4.54     3.94     3.69
 31           1.08     0.38     1.02     1.80     1.11     1.12     1.05     1.10     1.10     1.18     1.28     1.58     2.78     3.03     3.90     3.99     3.91     3.88
 32           1.02     2.30     1.43     0.57     0.62     0.89     1.03     0.99     1.03     1.08     1.12     1.24     1.52     1.31     1.47     1.40     1.45     1.38
 40           1.03     0.44     0.67     1.62     0.81     0.90     1.12     0.91     0.97     0.93     1.07     0.94     1.02     1.02     1.03     1.04     1.04     1.07
 50           1.06     1.00     7.18     0.99     1.02     1.03     1.01     0.97     0.94     0.97     0.98     0.99     1.12     1.03     1.04     1.10     1.05     1.10
 60           1.01     1.07     0.96     0.55     14.35     1.01     1.14     0.90     0.95     0.93     1.04     0.94     1.03     1.02     1.02     1.27     1.30     1.35
 61           1.06     0.37     0.14     1.81     0.07     0.99     1.13     0.89     0.97     0.78     1.08     0.99     1.02     1.22     1.30     1.40     1.30     1.40
 62           0.98     1.01     7.62     0.98     1.14     2.46     0.98     0.97     0.96     0.83     0.96     1.12     1.26     1.26     1.30     1.43     1.31     1.41
 63           0.99     1.03     1.03     0.92     0.45     1.02     1.11     1.13     1.06     1.06     1.11     1.18     1.23     1.32     1.23     1.34     1.35     1.31
 64           1.03     1.01     0.18     2.01     2.29     1.06     1.07     1.00     1.00     1.03     1.11     1.11     1.18     1.16     1.30     1.27     1.28     1.30

Xeon E7-4820

#8  --> poly_exp_series   
#bits\len        1        2        3        4        5        7       10       15       25       35       50      100      250      500     1000     2500     5000    10000   100000  1000000
 12           1.00     2.40     1.01     0.95     1.04     0.98     1.00     0.98     1.02     1.05     1.04     1.09     1.02     1.02     0.99     0.98     0.96     0.95
 28           1.01     2.40     0.84     0.77     1.03     0.03     0.98     1.04     1.01     1.03     1.05     1.07     1.03     1.02     0.99     1.45     1.31     1.27
 30           1.00     1.00     1.62     0.99     0.30     0.97     1.04     1.04     1.04     1.06     1.03     1.19     1.38     1.55     1.66     1.60     1.33     1.28
 31           1.00     0.42     10.81     1.01     0.76     0.22     1.04     1.03     1.03     1.09     1.12     1.27     1.43     1.55     1.66     1.60     1.34     1.28
 32           0.99     2.42     0.11     2.62     1.01     1.02     0.99     0.99     0.99     0.97     0.93     0.89     0.83     0.73     0.70     0.68     0.67     0.66
 40           1.00     2.41     0.99     2.05     0.95     0.88     0.98     1.00     1.03     1.05     1.06     1.11     1.14     1.07     1.07     1.05     1.00     0.98
 50           1.00     0.42     0.51     6.01     0.97     1.01     0.91     1.00     1.04     1.06     1.07     1.11     1.14     1.07     1.07     1.04     0.99     0.98
 60           0.99     1.02     1.00     0.98     1.09     1.01     1.02     1.00     1.02     1.04     1.06     1.10     1.13     1.07     1.07     1.72     1.76     1.78
 61           0.99     0.42     10.89     0.98     0.91     1.00     1.03     1.01     1.01     1.04     1.05     1.11     1.13     1.55     1.77     1.84     1.80     1.78
 62           1.00     1.01     1.64     2.72     1.30     0.99     1.04     1.00     1.02     1.05     1.07     1.28     1.57     1.71     1.79     1.85     1.79     1.78
 63           1.00     2.41     1.62     1.04     0.26     1.04     1.00     1.01     1.08     1.14     1.21     1.39     1.61     1.72     1.81     1.85     1.78     1.78
 64           1.00     0.99     1.01     12.94     3.55     5.19     1.03     1.14     1.02     1.05     1.03     1.04     1.02     0.98     0.98     0.98     0.98     0.98

vneiger · 2024-06-25T15:31:22Z

This now seems ready for review.

fredrik-johansson · 2024-06-25T15:37:18Z

It looks very good! Would be nice to get more consistent timings for the small cases though.

I will give the code a careful review soon.

vneiger · 2024-06-25T15:52:14Z

I will give the code a careful review soon.

Thanks!

It looks very good! Would be nice to get more consistent timings for the small cases though.

Yes I agree... fortunately this is only for the exp_series bench (all other functions are pretty consistent including for the smallest lengths). This has been consistently inconsistent since the beginning of my work on this PR. You can see below the timings for the current main branch, this already is inconsistent:

#8  --> poly_exp_series        
#bits\len        1        2        3        4        5
12        6.94e-09 6.93e-09 6.82e-09 3.61e-08 8.05e-08
28        5.17e-09 5.22e-09 3.81e-08 3.13e-08 6.01e-08
30        5.17e-09 5.31e-09 2.75e-08 5.67e-08 5.18e-09
31        5.44e-09 5.39e-09 5.18e-09 5.71e-08 7.56e-08
32        5.18e-09 1.22e-08 3.99e-08 3.06e-08 4.12e-08
40        5.17e-09 5.31e-09 2.75e-08 5.01e-08 6.11e-08
50        5.33e-09 1.22e-08 3.67e-08 5.68e-08 7.43e-08
60        5.32e-09 5.33e-09 3.83e-08 3.05e-08 7.55e-08
61        5.33e-09 5.23e-09 5.20e-09 5.65e-08 5.18e-09
62        5.19e-09 1.22e-08 3.94e-08 4.83e-08 7.23e-08
63        5.19e-09 1.22e-08 2.89e-08 3.06e-08 3.34e-08
64        5.18e-09 5.31e-09 5.18e-09 6.17e-08 8.15e-08

vneiger · 2024-06-25T15:58:50Z

By the way, here are the timings (before taking ratios), in case that can help. And the script I used to compute the ratios.
dot_bench.zip

fredrik-johansson · 2024-07-03T06:43:15Z

Sorry for the delay. There are no obvious problems with the code, tests pass, and I reproduce uniform speedups on my machine (except for 63/64-bit moduli where performance is unchanged). So let's get this in!

first version with some unrolling / delayed reductions+sssaaaaaa; no …

c0da438

…explicit avx yet

vneiger added the performance label Jun 13, 2024

vneiger linked an issue Jun 13, 2024 that may be closed by this pull request

Improve nmod dot products #1661

Closed

vincent.neiger and others added 8 commits June 13, 2024 16:56

Merge branch 'fix_profile_compilation' into nmod_vec_dot_product_smal…

38ce5e1

…l_modulus

Merge branch 'flintlib:main' into nmod_vec_dot_product_small_modulus

ee1bdac

new non-avx nmod_vec_dot, with more cases (goal: faster when modulus …

da6dac3

…close to 32 or 64 bits)

Merge branch 'main' into nmod_vec_dot_product_small_modulus

c88914b

add some help in profile file, and update with new dot

6393977

integrating avx2 code for nmod_vec_dot / nmod_vec_dot_rev

860a400

handle the power of 2 case more properly

4ea6569

handle the power of 2 case more properly

8c4371e

Merge branch 'main' into nmod_vec_dot_product_small_modulus

4045ed4

vneiger added 8 commits June 19, 2024 17:40

minor fixes, and DOT2_32_SPLIT renamed DOT2_SPLIT

d5a5fc5

horizontal sum (also avoids indexing mm256, which some compilers don'…

af7ca68

…t accept)

suppress trailing whitespace

faee494

add 64-bit machine constraint

f18eb01

DOT_SPLIT really only for 64-bit config

e08742d

DOT_SPLIT really really only for 64-bit config

46b1d44

DOT_SPLIT really really really only for 64-bit config

421f8d3

minor changes

b679932

vneiger added 2 commits June 22, 2024 15:12

profile dot_ptr

e353b05

dot_ptr: avx brings some gain

a80e8a3

vneiger added 7 commits June 22, 2024 20:55

use macro to refactor short dot products

b7044d3

use macro to refactor short dot products

b2f3b55

short dots up to length 10

f6b06bd

small = up to 11; some bugs to fix

55516d2

fix minor bug with POW2; clean things up a bit

51bf05b

refactor in the macros where possible

b495732

faster 64-bit-modulus short products

c515905

vneiger added 5 commits June 24, 2024 15:05

reorganize nmod_vec.h a bit

bd682ae

doc

733f2fb

test nlimbs

69e9343

gr nmod

dad15b8

gr nmod

ae05898

vneiger added 2 commits June 25, 2024 16:14

call the functions instead of the macros

7fe1258

updated doc

6ca1357

vneiger marked this pull request as ready for review June 25, 2024 15:29

fredrik-johansson merged commit 4496945 into flintlib:main Jul 3, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018

nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018

vneiger commented Jun 13, 2024

vneiger commented Jun 13, 2024

fredrik-johansson commented Jun 13, 2024 •

edited

Loading

fredrik-johansson commented Jun 13, 2024

vneiger commented Jun 13, 2024

fredrik-johansson commented Jun 18, 2024

vneiger commented Jun 19, 2024

vneiger commented Jun 19, 2024

albinahlback commented Jun 19, 2024

vneiger commented Jun 19, 2024

albinahlback commented Jun 19, 2024 •

edited

Loading

vneiger commented Jun 22, 2024

vneiger commented Jun 22, 2024 •

edited

Loading

vneiger commented Jun 24, 2024

fredrik-johansson commented Jun 24, 2024

vneiger commented Jun 24, 2024

vneiger commented Jun 24, 2024

vneiger commented Jun 25, 2024

vneiger commented Jun 25, 2024

fredrik-johansson commented Jun 25, 2024

vneiger commented Jun 25, 2024 •

edited

Loading

vneiger commented Jun 25, 2024

fredrik-johansson commented Jul 3, 2024

nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018

nmod_vec_dot_product enhancements (including avx2 for small moduli) #2018

Conversation

vneiger commented Jun 13, 2024

vneiger commented Jun 13, 2024

fredrik-johansson commented Jun 13, 2024 • edited Loading

fredrik-johansson commented Jun 13, 2024

vneiger commented Jun 13, 2024

fredrik-johansson commented Jun 18, 2024

vneiger commented Jun 19, 2024

vneiger commented Jun 19, 2024

albinahlback commented Jun 19, 2024

vneiger commented Jun 19, 2024

albinahlback commented Jun 19, 2024 • edited Loading

vneiger commented Jun 22, 2024

vneiger commented Jun 22, 2024 • edited Loading

vneiger commented Jun 24, 2024

fredrik-johansson commented Jun 24, 2024

vneiger commented Jun 24, 2024

vneiger commented Jun 24, 2024

vneiger commented Jun 25, 2024

vneiger commented Jun 25, 2024

fredrik-johansson commented Jun 25, 2024

vneiger commented Jun 25, 2024 • edited Loading

vneiger commented Jun 25, 2024

fredrik-johansson commented Jul 3, 2024

fredrik-johansson commented Jun 13, 2024 •

edited

Loading

albinahlback commented Jun 19, 2024 •

edited

Loading

vneiger commented Jun 22, 2024 •

edited

Loading

vneiger commented Jun 25, 2024 •

edited

Loading