Optimized Danilevsky charpoly; strided dot products#2684
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Main contribution of this PR: optimize
nmod_mat_charpolyandgr_mat_charpolyover fields by replacing looped scalar operations in the Danilevsky algorithm with vector operations.To this end, we introduce strided dot product functions for the most common types, including generics. In the case of
nmod_mat_charpoly_danilevsky, each column was copied into a contiguous temporary buffer in order to use the normal_nmod_vec_dot; doing it directly with_nmod_vec_dot_stridedturns out to be a bit faster. The copying method is retained as a fallback for the new_gr_vec_dot_stridedin cases where a fast_gr_vec_dotexists but_gr_vec_dot_strideditself is not overloaded.An alternative I've also considered is to implicitly transpose the matrix in the Danilevsky algorithm so that the dot products are contiguous and the
vec_addmul_scalaroperations are noncontiguous. This may be better or worse depending on the ring. The strided dot product has other potential uses regardless, so it doesn't hurt to try the current version first.Other changes:
nmod_mat_charpoly_danilevskygains a return value, allowing it to fail gracefully when encountering an impossible inverse. This means that we can use it even when the modulus is not prime, with the O(n^4) Berkowitz fallback only when it fails.nmod_mat_charpolyrefers to thenmod8implementation for prime moduli <= 255, which helps for large matrices since Danilevsky has poor locality.Change some charpoly algorithm cutoffs.
Selection of benchmark results:
nmod_mat_charpoly, mod = nextprime(2^63), random input matrix:fmpz_mat_charpoly, randbits(10) entries; the multimodular algorithm benefits directly from the fasternmod_mat_charpoly:nmod_mat_charpoly, mod = 17; demonstrating the added cache benefits of switching tonmod8internally:nmod_mat_charpoly, mod = nextprime(1000) ^ 2; demonstrating the speedup for composite modulus when Danilevsky succeeds and we don't need to resort to Berkowitz:gr_mat_charpolywithmpn_modentries, mod = nextprime(2^200):fq_nmod_mat_charpoly(orgr_mat_charpolywithfq_nmodentries) for GF(7^16):