Optimize nmod_poly_divrem_basecase#2637
Merged
fredrik-johansson merged 1 commit intoflintlib:mainfrom Apr 21, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Basecase polynomial division with remainder (
nmod_poly_divrem_basecase) is optimized in several ways:mpn_addmul_1for the unreduced vector addmuls; we don't need carry propagation, so do it with plain loops and benefit from instruction level parallelism (verified to be significantly faster on Zen 3; I presume that this will be the case for other modern architectures)negmodwith a plain subtractionThe basecase->Newton thresholds in
_nmod_poly_divremhave also been increased, but I have not touched tuning values in higher algorithms that may benefit indirectly. Note that this speeds upnmod_poly_rembut does not affectnmod_poly_divsince division without remainder uses a different algorithm.Speedup on Zen 3 for
nmod_poly_divremto perform a 2 len by len division with moduli of different bit size, monic divisor:Non-monic divisor:
The added helper functions
_nmod_vec_nored_scalar_addmul...can be useful elsewhere, notably in basecase LU factorization which I'll do in a separate PR.Possible further improvements which I did not implement:
The AVX2 code goes up to ~27 bits; specifically, it handles the case where unreduced sums are guaranteed to fit in 64 bits. It shouldn't be too difficult to handle all 32-bit moduli: doing 128-bit additions with carry-handling sounds terrible, but you could rearrange the data to accumulate the high and low 32 bits of the products into two independent 64-bit sums.
Neon version of the AVX2 code.
For really small moduli I guess you could get another 2x speedup by repacking to work on vectors of 32-bit integers instead of vectors of 64-bit integers holding 32-bit values.
At each iteration, we reduce
R[iR], check if it is zero, and then multiply by the leading inverse. It would sometimes be better to multiply by the unreduced value ofR[iR], and then check for zero, thus skipping a modular reduction.Vectorized modular reduction (not specific to polynomial division).