Skip to content

Optimize output modular reductions in fft_small nmod_poly_mul#2635

Merged
fredrik-johansson merged 2 commits intoflintlib:mainfrom
fredrik-johansson:fft2
Apr 13, 2026
Merged

Optimize output modular reductions in fft_small nmod_poly_mul#2635
fredrik-johansson merged 2 commits intoflintlib:mainfrom
fredrik-johansson:fft2

Conversation

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

Optimize the final output reductions mod $n$ in the fft_small based nmod_poly multiplication when there are 2 or 3 CRT primes (i.e. coefficients in the product over $\mathbb{Z}$ have between 50 and 150 bits). Followup to #2634 which did the single prime case. The 4 CRT prime case could be done too, but this only affects multiplications so large that one gets only about a 2% speedup, which I'm not going to bother with here.

In essence, we replace use of the slow macros NMOD2_RED2 and NMOD_RED3 with the faster algorithms discussed in #2597. I'm doing this ad hoc in fft_small for now to test things before rolling this out to other nmod modules. The main complication is that one needs different precomputations with some case distinctions.

In the 2 CRT prime case, we get 3% additional speedup using __uint128_t for the CRT step instead of the generic macros. (Ideally we'd do the CRT and modular reduction directly in the SIMD representation, but that looks very tricky to implement.)

Speedup ratios for nmod_poly_mul:

                            bits in modulus
     length     16      24      32      40      48      56      64
        64   0.992   1.001   1.000   1.000   0.995   1.000   0.993  
       128   1.000   1.004   0.997   1.186   1.002   0.998   1.044  
       256   1.004   1.002   1.177   1.185   1.116   1.119   1.055  
       512   1.000   1.179   1.187   1.179   1.121   1.112   1.048  
      1024   0.991   1.173   1.161   1.161   1.112   1.112   1.045  
      2048   1.004   1.164   1.162   1.160   1.106   1.107   1.043  
      4096   1.000   1.153   1.153   1.153   1.108   1.094   1.044  
      8192   1.009   1.143   1.138   1.139   1.092   1.065   1.024  
     16384   1.000   1.137   1.135   1.131   1.071   1.058   1.026  
     32768   1.000   1.128   1.139   1.128   1.074   1.057   1.020  
     65536   1.010   1.127   1.122   1.122   1.061   1.052   1.021  
    131072   1.000   1.119   1.120   1.119   1.057   1.048   1.025  
    262144   1.111   1.111   1.120   1.110   1.048   1.026   1.026  
    524288   1.101   1.097   1.097   1.096   1.038   1.024   1.019  
   1048576   1.087   1.089   1.089   1.043   1.036   1.020   1.018  
   2097152   1.094   1.094   1.085   1.041   1.036   1.020   1.015  
   4194304   1.083   1.079   1.083   1.036   1.029   1.012   1.005  
   8388608   1.080   1.061   1.086   1.036   1.040   1.020   1.008  
  16777216   1.073   1.077   1.071   1.043   1.030   1.004   1.012  
  33554432   1.072   1.066   1.066   1.038   1.025   1.012   1.004  

Speedup when squaring (slightly higher, as relatively more time is spent on output reconstruction):

                            bits in modulus
     length     16      24      32      40      48      56      64
        64   0.998   0.999   1.000   1.000   1.006   1.000   1.000  
       128   1.000   1.000   1.000   0.997   1.000   1.000   1.000  
       256   0.994   1.000   1.256   1.258   1.161   1.158   1.062  
       512   0.994   1.000   1.244   1.242   1.146   1.150   1.063  
      1024   1.001   1.242   1.232   1.226   1.143   1.145   1.057  
      2048   1.000   1.227   1.222   1.224   1.137   1.140   1.056  
      4096   1.000   1.220   1.198   1.209   1.161   1.104   1.045  
      8192   0.997   1.209   1.198   1.198   1.130   1.083   1.034  
     16384   1.000   1.199   1.194   1.194   1.096   1.075   1.032  
     32768   1.000   1.191   1.187   1.186   1.098   1.068   1.039  
     65536   1.002   1.190   1.184   1.184   1.084   1.072   1.028  
    131072   1.000   1.182   1.173   1.169   1.081   1.071   1.029  
    262144   1.165   1.168   1.164   1.158   1.067   1.029   1.035  
    524288   1.152   1.145   1.139   1.144   1.054   1.033   1.023  
   1048576   1.137   1.130   1.134   1.062   1.053   1.029   1.026  
   2097152   1.125   1.123   1.120   1.058   1.051   1.028   1.014  
   4194304   1.110   1.103   1.109   1.051   1.051   1.013   1.002  
   8388608   1.116   1.112   1.113   1.053   1.044   1.018   1.004  
  16777216   1.109   1.101   1.104   1.049   1.044   1.017   1.002  
  33554432   1.100   1.098   1.100   1.046   1.041   1.016   1.001  

In summary,

  • Up to a 19% speedup for small (e.g. 30 bit) moduli (26% when squaring), diminishing with longer products.
  • Up to a 12% speedup for near-wordsize moduli (16% when squaring).
  • Up to a 6% speedup for 64-bit moduli.

@fredrik-johansson fredrik-johansson merged commit b0cca3c into flintlib:main Apr 13, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant