Optimize output modular reductions in fft_small nmod_poly_mul by fredrik-johansson · Pull Request #2635 · flintlib/flint

fredrik-johansson · 2026-04-12T22:18:35Z

Optimize the final output reductions mod $n$ in the fft_small based nmod_poly multiplication when there are 2 or 3 CRT primes (i.e. coefficients in the product over $\mathbb{Z}$ have between 50 and 150 bits). Followup to #2634 which did the single prime case. The 4 CRT prime case could be done too, but this only affects multiplications so large that one gets only about a 2% speedup, which I'm not going to bother with here.

In essence, we replace use of the slow macros NMOD2_RED2 and NMOD_RED3 with the faster algorithms discussed in #2597. I'm doing this ad hoc in fft_small for now to test things before rolling this out to other nmod modules. The main complication is that one needs different precomputations with some case distinctions.

In the 2 CRT prime case, we get 3% additional speedup using __uint128_t for the CRT step instead of the generic macros. (Ideally we'd do the CRT and modular reduction directly in the SIMD representation, but that looks very tricky to implement.)

Speedup ratios for nmod_poly_mul:

                            bits in modulus
     length     16      24      32      40      48      56      64
        64   0.992   1.001   1.000   1.000   0.995   1.000   0.993  
       128   1.000   1.004   0.997   1.186   1.002   0.998   1.044  
       256   1.004   1.002   1.177   1.185   1.116   1.119   1.055  
       512   1.000   1.179   1.187   1.179   1.121   1.112   1.048  
      1024   0.991   1.173   1.161   1.161   1.112   1.112   1.045  
      2048   1.004   1.164   1.162   1.160   1.106   1.107   1.043  
      4096   1.000   1.153   1.153   1.153   1.108   1.094   1.044  
      8192   1.009   1.143   1.138   1.139   1.092   1.065   1.024  
     16384   1.000   1.137   1.135   1.131   1.071   1.058   1.026  
     32768   1.000   1.128   1.139   1.128   1.074   1.057   1.020  
     65536   1.010   1.127   1.122   1.122   1.061   1.052   1.021  
    131072   1.000   1.119   1.120   1.119   1.057   1.048   1.025  
    262144   1.111   1.111   1.120   1.110   1.048   1.026   1.026  
    524288   1.101   1.097   1.097   1.096   1.038   1.024   1.019  
   1048576   1.087   1.089   1.089   1.043   1.036   1.020   1.018  
   2097152   1.094   1.094   1.085   1.041   1.036   1.020   1.015  
   4194304   1.083   1.079   1.083   1.036   1.029   1.012   1.005  
   8388608   1.080   1.061   1.086   1.036   1.040   1.020   1.008  
  16777216   1.073   1.077   1.071   1.043   1.030   1.004   1.012  
  33554432   1.072   1.066   1.066   1.038   1.025   1.012   1.004

Speedup when squaring (slightly higher, as relatively more time is spent on output reconstruction):

                            bits in modulus
     length     16      24      32      40      48      56      64
        64   0.998   0.999   1.000   1.000   1.006   1.000   1.000  
       128   1.000   1.000   1.000   0.997   1.000   1.000   1.000  
       256   0.994   1.000   1.256   1.258   1.161   1.158   1.062  
       512   0.994   1.000   1.244   1.242   1.146   1.150   1.063  
      1024   1.001   1.242   1.232   1.226   1.143   1.145   1.057  
      2048   1.000   1.227   1.222   1.224   1.137   1.140   1.056  
      4096   1.000   1.220   1.198   1.209   1.161   1.104   1.045  
      8192   0.997   1.209   1.198   1.198   1.130   1.083   1.034  
     16384   1.000   1.199   1.194   1.194   1.096   1.075   1.032  
     32768   1.000   1.191   1.187   1.186   1.098   1.068   1.039  
     65536   1.002   1.190   1.184   1.184   1.084   1.072   1.028  
    131072   1.000   1.182   1.173   1.169   1.081   1.071   1.029  
    262144   1.165   1.168   1.164   1.158   1.067   1.029   1.035  
    524288   1.152   1.145   1.139   1.144   1.054   1.033   1.023  
   1048576   1.137   1.130   1.134   1.062   1.053   1.029   1.026  
   2097152   1.125   1.123   1.120   1.058   1.051   1.028   1.014  
   4194304   1.110   1.103   1.109   1.051   1.051   1.013   1.002  
   8388608   1.116   1.112   1.113   1.053   1.044   1.018   1.004  
  16777216   1.109   1.101   1.104   1.049   1.044   1.017   1.002  
  33554432   1.100   1.098   1.100   1.046   1.041   1.016   1.001

In summary,

Up to a 19% speedup for small (e.g. 30 bit) moduli (26% when squaring), diminishing with longer products.
Up to a 12% speedup for near-wordsize moduli (16% when squaring).
Up to a 6% speedup for 64-bit moduli.

fredrik-johansson added 2 commits April 12, 2026 22:37

Optimize output modular reductions in fft_small nmod_poly_mul

f694ba8

MSVC fix

fa0e3d4

fredrik-johansson force-pushed the fft2 branch from 62af663 to fa0e3d4 Compare April 12, 2026 23:05

fredrik-johansson merged commit b0cca3c into flintlib:main Apr 13, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize output modular reductions in fft_small nmod_poly_mul#2635

Optimize output modular reductions in fft_small nmod_poly_mul#2635
fredrik-johansson merged 2 commits intoflintlib:mainfrom
fredrik-johansson:fft2

fredrik-johansson commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fredrik-johansson commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant