Skip to content

Optimize nmod_poly_divrem_basecase#2637

Merged
fredrik-johansson merged 1 commit intoflintlib:mainfrom
fredrik-johansson:n3
Apr 21, 2026
Merged

Optimize nmod_poly_divrem_basecase#2637
fredrik-johansson merged 1 commit intoflintlib:mainfrom
fredrik-johansson:n3

Conversation

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

@fredrik-johansson fredrik-johansson commented Apr 20, 2026

Basecase polynomial division with remainder (nmod_poly_divrem_basecase) is optimized in several ways:

  • Don't use mpn_addmul_1 for the unreduced vector addmuls; we don't need carry propagation, so do it with plain loops and benefit from instruction level parallelism (verified to be significantly faster on Zen 3; I presume that this will be the case for other modern architectures)
  • Use AVX2 when the unreduced sums fit in 64 bits
  • Skip a multiplication when the divisor is monic
  • Replace a negmod with a plain subtraction
  • Specialize modular reduction for different cases
  • Simplify some of the logic
  • Optimize temporary allocations

The basecase->Newton thresholds in _nmod_poly_divrem have also been increased, but I have not touched tuning values in higher algorithms that may benefit indirectly. Note that this speeds up nmod_poly_rem but does not affect nmod_poly_div since division without remainder uses a different algorithm.

Speedup on Zen 3 for nmod_poly_divrem to perform a 2 len by len division with moduli of different bit size, monic divisor:

  len \ bits   2    16    27    28    32    48    56    63    64    64(close to UWORD_MAX)

       2    1.73  1.92  1.80  1.81  1.63  1.62  1.62  1.64  1.64  1.52
       3    2.24  1.93  1.91  1.92  1.59  1.57  1.57  1.42  1.69  1.54
       4    1.83  1.75  1.75  1.74  1.59  1.57  1.58  1.49  1.71  1.59
       6    1.99  1.73  1.73  1.71  1.58  1.55  1.52  1.53  1.71  1.62
       8    2.40  1.82  1.84  1.80  1.58  1.54  1.53  1.63  1.83  1.70
      12    2.79  1.88  1.86  1.85  1.73  1.56  1.57  1.64  1.98  1.91
      16    3.07  2.10  2.15  2.10  1.65  1.56  1.56  1.62  1.77  1.66
      24    3.25  2.29  2.32  2.31  1.69  1.49  1.48  1.48  1.56  1.50
      32    3.79  2.47  2.48  2.44  1.54  1.25  1.28  1.27  1.34  1.31
      48    3.75  3.30  3.08  2.97  1.39  1.15  1.16  1.13  1.20  1.13
      64    3.54  3.28  3.00  3.04  1.29  1.06  1.08  1.06  1.11  1.09
      96    2.67  3.18  4.20  1.00  1.00  1.00  1.00  1.00  1.00  1.00
     128    2.30  2.48  4.09  1.00  1.00  1.00  1.00  1.01  1.02  1.00
     192    1.63  2.16  3.19  0.99  1.00  1.01  1.00  1.00  1.00  1.01
     256    1.53  1.93  2.48  1.00  1.00  1.00  1.00  1.01  1.00  1.00
     384    1.10  1.18  2.26  1.01  1.00  1.00  1.00  1.00  1.00  1.00
     512    1.01  1.00  0.99  1.00  1.00  1.00  0.99  1.00  1.00  1.00

Non-monic divisor:

               2    16    27    28    32    48    56    63    64    64(close to UWORD_MAX)
       2    1.30  1.39  1.29  1.23  1.22  1.15  1.18  1.13  1.20  1.20
       3    1.75  1.46  1.32  1.21  1.17  1.20  1.14  1.12  1.22  1.18
       4    1.56  1.43  1.30  1.26  1.20  1.16  1.14  1.14  1.26  1.19
       6    1.71  1.49  1.40  1.37  1.22  1.15  1.17  1.20  1.28  1.23
       8    1.80  1.48  1.33  1.33  1.21  1.15  1.14  1.26  1.38  1.29
      12    2.58  1.64  1.43  1.36  1.26  1.16  1.16  1.46  1.73  1.70
      16    2.30  1.82  1.61  1.53  1.25  1.20  1.20  1.51  1.60  1.54
      24    3.08  2.13  1.82  1.79  1.54  1.37  1.36  1.39  1.47  1.40
      32    3.06  2.21  2.13  2.12  1.44  1.19  1.22  1.22  1.31  1.27
      48    3.89  3.51  3.09  3.07  1.36  1.15  1.14  1.12  1.17  1.15
      64    4.59  3.49  3.06  3.03  1.27  1.08  1.08  1.06  1.09  1.04
      96    2.54  3.52  4.22  1.00  1.00  1.00  1.00  1.00  1.00  0.99
     128    2.38  2.71  4.18  1.00  1.00  1.01  1.00  1.01  1.00  1.00
     192    1.90  2.28  2.96  1.02  1.00  0.99  1.00  1.00  1.00  1.00
     256    1.51  2.01  2.42  1.00  1.00  1.00  1.00  1.00  1.01  1.00
     384    1.14  1.21  2.35  1.00  1.00  1.00  1.00  1.00  1.00  1.01
     512    1.01  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00

The added helper functions _nmod_vec_nored_scalar_addmul... can be useful elsewhere, notably in basecase LU factorization which I'll do in a separate PR.

Possible further improvements which I did not implement:

  • The AVX2 code goes up to ~27 bits; specifically, it handles the case where unreduced sums are guaranteed to fit in 64 bits. It shouldn't be too difficult to handle all 32-bit moduli: doing 128-bit additions with carry-handling sounds terrible, but you could rearrange the data to accumulate the high and low 32 bits of the products into two independent 64-bit sums.

  • Neon version of the AVX2 code.

  • For really small moduli I guess you could get another 2x speedup by repacking to work on vectors of 32-bit integers instead of vectors of 64-bit integers holding 32-bit values.

  • At each iteration, we reduce R[iR], check if it is zero, and then multiply by the leading inverse. It would sometimes be better to multiply by the unreduced value of R[iR], and then check for zero, thus skipping a modular reduction.

  • Vectorized modular reduction (not specific to polynomial division).

@fredrik-johansson fredrik-johansson merged commit 5c74e17 into flintlib:main Apr 21, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant