Optimize `nmod_poly_divrem_basecase` by fredrik-johansson · Pull Request #2637 · flintlib/flint

fredrik-johansson · 2026-04-20T07:50:09Z

Basecase polynomial division with remainder (nmod_poly_divrem_basecase) is optimized in several ways:

Don't use mpn_addmul_1 for the unreduced vector addmuls; we don't need carry propagation, so do it with plain loops and benefit from instruction level parallelism (verified to be significantly faster on Zen 3; I presume that this will be the case for other modern architectures)
Use AVX2 when the unreduced sums fit in 64 bits
Skip a multiplication when the divisor is monic
Replace a negmod with a plain subtraction
Specialize modular reduction for different cases
Simplify some of the logic
Optimize temporary allocations

The basecase->Newton thresholds in _nmod_poly_divrem have also been increased, but I have not touched tuning values in higher algorithms that may benefit indirectly. Note that this speeds up nmod_poly_rem but does not affect nmod_poly_div since division without remainder uses a different algorithm.

Speedup on Zen 3 for nmod_poly_divrem to perform a 2 len by len division with moduli of different bit size, monic divisor:

  len \ bits   2    16    27    28    32    48    56    63    64    64(close to UWORD_MAX)

       2    1.73  1.92  1.80  1.81  1.63  1.62  1.62  1.64  1.64  1.52
       3    2.24  1.93  1.91  1.92  1.59  1.57  1.57  1.42  1.69  1.54
       4    1.83  1.75  1.75  1.74  1.59  1.57  1.58  1.49  1.71  1.59
       6    1.99  1.73  1.73  1.71  1.58  1.55  1.52  1.53  1.71  1.62
       8    2.40  1.82  1.84  1.80  1.58  1.54  1.53  1.63  1.83  1.70
      12    2.79  1.88  1.86  1.85  1.73  1.56  1.57  1.64  1.98  1.91
      16    3.07  2.10  2.15  2.10  1.65  1.56  1.56  1.62  1.77  1.66
      24    3.25  2.29  2.32  2.31  1.69  1.49  1.48  1.48  1.56  1.50
      32    3.79  2.47  2.48  2.44  1.54  1.25  1.28  1.27  1.34  1.31
      48    3.75  3.30  3.08  2.97  1.39  1.15  1.16  1.13  1.20  1.13
      64    3.54  3.28  3.00  3.04  1.29  1.06  1.08  1.06  1.11  1.09
      96    2.67  3.18  4.20  1.00  1.00  1.00  1.00  1.00  1.00  1.00
     128    2.30  2.48  4.09  1.00  1.00  1.00  1.00  1.01  1.02  1.00
     192    1.63  2.16  3.19  0.99  1.00  1.01  1.00  1.00  1.00  1.01
     256    1.53  1.93  2.48  1.00  1.00  1.00  1.00  1.01  1.00  1.00
     384    1.10  1.18  2.26  1.01  1.00  1.00  1.00  1.00  1.00  1.00
     512    1.01  1.00  0.99  1.00  1.00  1.00  0.99  1.00  1.00  1.00

Non-monic divisor:

               2    16    27    28    32    48    56    63    64    64(close to UWORD_MAX)
       2    1.30  1.39  1.29  1.23  1.22  1.15  1.18  1.13  1.20  1.20
       3    1.75  1.46  1.32  1.21  1.17  1.20  1.14  1.12  1.22  1.18
       4    1.56  1.43  1.30  1.26  1.20  1.16  1.14  1.14  1.26  1.19
       6    1.71  1.49  1.40  1.37  1.22  1.15  1.17  1.20  1.28  1.23
       8    1.80  1.48  1.33  1.33  1.21  1.15  1.14  1.26  1.38  1.29
      12    2.58  1.64  1.43  1.36  1.26  1.16  1.16  1.46  1.73  1.70
      16    2.30  1.82  1.61  1.53  1.25  1.20  1.20  1.51  1.60  1.54
      24    3.08  2.13  1.82  1.79  1.54  1.37  1.36  1.39  1.47  1.40
      32    3.06  2.21  2.13  2.12  1.44  1.19  1.22  1.22  1.31  1.27
      48    3.89  3.51  3.09  3.07  1.36  1.15  1.14  1.12  1.17  1.15
      64    4.59  3.49  3.06  3.03  1.27  1.08  1.08  1.06  1.09  1.04
      96    2.54  3.52  4.22  1.00  1.00  1.00  1.00  1.00  1.00  0.99
     128    2.38  2.71  4.18  1.00  1.00  1.01  1.00  1.01  1.00  1.00
     192    1.90  2.28  2.96  1.02  1.00  0.99  1.00  1.00  1.00  1.00
     256    1.51  2.01  2.42  1.00  1.00  1.00  1.00  1.00  1.01  1.00
     384    1.14  1.21  2.35  1.00  1.00  1.00  1.00  1.00  1.00  1.01
     512    1.01  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00

The added helper functions _nmod_vec_nored_scalar_addmul... can be useful elsewhere, notably in basecase LU factorization which I'll do in a separate PR.

Possible further improvements which I did not implement:

The AVX2 code goes up to ~27 bits; specifically, it handles the case where unreduced sums are guaranteed to fit in 64 bits. It shouldn't be too difficult to handle all 32-bit moduli: doing 128-bit additions with carry-handling sounds terrible, but you could rearrange the data to accumulate the high and low 32 bits of the products into two independent 64-bit sums.
Neon version of the AVX2 code.
For really small moduli I guess you could get another 2x speedup by repacking to work on vectors of 32-bit integers instead of vectors of 64-bit integers holding 32-bit values.
At each iteration, we reduce R[iR], check if it is zero, and then multiply by the leading inverse. It would sometimes be better to multiply by the unreduced value of R[iR], and then check for zero, thus skipping a modular reduction.
Vectorized modular reduction (not specific to polynomial division).

Optimize nmod_poly_divrem_basecase

8ae366e

fredrik-johansson merged commit 5c74e17 into flintlib:main Apr 21, 2026
13 checks passed

fredrik-johansson mentioned this pull request Apr 21, 2026

Optimize nmod_mat_lu_classical_delayed #2640

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `nmod_poly_divrem_basecase`#2637

Optimize `nmod_poly_divrem_basecase`#2637
fredrik-johansson merged 1 commit intoflintlib:mainfrom
fredrik-johansson:n3

fredrik-johansson commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fredrik-johansson commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fredrik-johansson commented Apr 20, 2026 •

edited

Loading