Optimized Danilevsky charpoly; strided dot products by fredrik-johansson · Pull Request #2684 · flintlib/flint

fredrik-johansson · 2026-05-17T18:44:34Z

Main contribution of this PR: optimize nmod_mat_charpoly and gr_mat_charpoly over fields by replacing looped scalar operations in the Danilevsky algorithm with vector operations.

To this end, we introduce strided dot product functions for the most common types, including generics. In the case of nmod_mat_charpoly_danilevsky, each column was copied into a contiguous temporary buffer in order to use the normal _nmod_vec_dot; doing it directly with _nmod_vec_dot_strided turns out to be a bit faster. The copying method is retained as a fallback for the new _gr_vec_dot_strided in cases where a fast _gr_vec_dot exists but _gr_vec_dot_strided itself is not overloaded.

An alternative I've also considered is to implicitly transpose the matrix in the Danilevsky algorithm so that the dot products are contiguous and the vec_addmul_scalar operations are noncontiguous. This may be better or worse depending on the ring. The strided dot product has other potential uses regardless, so it doesn't hurt to try the current version first.

Other changes:

nmod_mat_charpoly_danilevsky gains a return value, allowing it to fail gracefully when encountering an impossible inverse. This means that we can use it even when the modulus is not prime, with the O(n^4) Berkowitz fallback only when it fails.
nmod_mat_charpoly refers to the nmod8 implementation for prime moduli <= 255, which helps for large matrices since Danilevsky has poor locality.
Change some charpoly algorithm cutoffs.

Selection of benchmark results:

nmod_mat_charpoly, mod = nextprime(2^63), random input matrix:

       n        old       new  speedup
       4   2.04e-07  2.05e-07  0.995
       8   1.74e-06  1.75e-06  0.994
      16   1.06e-05  7.91e-06  1.340
      32   6.55e-05  4.64e-05  1.412
      64   0.000467  0.000316  1.478
     128     0.0042   0.00287  1.463
     256     0.0395     0.026  1.519
     512      0.343     0.264  1.299
    1024      2.976     2.258  1.318

fmpz_mat_charpoly, randbits(10) entries; the multimodular algorithm benefits directly from the faster nmod_mat_charpoly:

       n        old       new  speedup
       4   7.38e-07  7.27e-07  1.015
       8   5.64e-06  5.59e-06  1.009
      16   5.01e-05  4.06e-05  1.234
      32   0.000501  0.000374  1.340
      64    0.00747   0.00535  1.396
     128       0.12    0.0865  1.387
     256      2.239     1.615  1.386
     512     42.961    33.829  1.270

nmod_mat_charpoly, mod = 17; demonstrating the added cache benefits of switching to nmod8 internally:

       4   1.07e-07  1.12e-07  0.955
       8   8.16e-07  8.18e-07  0.998
      16   8.42e-06  5.32e-06  1.583
      32   5.63e-05  2.78e-05  2.025
      64   0.000418  0.000167  2.503
     128    0.00363   0.00106  3.425
     256     0.0348   0.00874  3.982
     512       0.33     0.101  3.267
    1024      2.808     0.902  3.113
    2048     30.177    13.179  2.290

nmod_mat_charpoly, mod = nextprime(1000) ^ 2; demonstrating the speedup for composite modulus when Danilevsky succeeds and we don't need to resort to Berkowitz:

       n        old       new  speedup
       4   1.06e-07  1.08e-07  0.981
       8   7.93e-07  7.91e-07  1.003
      16   8.57e-06  5.71e-06  1.501
      32   8.17e-05  3.51e-05  2.328
      64   0.000779   0.00025  3.116
     128    0.00959   0.00259  3.703
     256      0.141    0.0264  5.341
     512      2.475     0.251  9.861
    1024     40.665     2.234  18.203

gr_mat_charpoly with mpn_mod entries, mod = nextprime(2^200):

       n        old       new  speedup
       2   9.68e-07  9.83e-07  0.985
       4   4.41e-06  4.04e-06  1.092
       8   2.56e-05  1.79e-05  1.430
      16   0.000207  0.000105  1.971
      32    0.00166  0.000734  2.262
      64     0.0133   0.00534  2.491
     128      0.108    0.0426  2.535
     256       0.86     0.329  2.614
     512      6.973     2.706  2.577

fq_nmod_mat_charpoly (or gr_mat_charpoly with fq_nmod entries) for GF(7^16):

       n        old       new  speedup
       2   3.97e-06  4.08e-06  0.973
       4   1.56e-05  1.62e-05  0.963
       8     0.0001  9.02e-05  1.109
      16    0.00128  0.000975  1.313
      32     0.0109   0.00795  1.371
      64     0.0895    0.0637  1.405
     128      0.729     0.511  1.427
     256      5.875     4.133  1.421
     512     49.228    37.639  1.308

fredrik-johansson added 2 commits May 17, 2026 19:43

Strided dot products; use vector ops in charpoly_danilevsky

de3d1de

Improve test coverage

35793b2

fredrik-johansson merged commit 2196bbe into flintlib:main May 18, 2026
13 checks passed

fredrik-johansson deleted the strided branch May 18, 2026 11:23

fredrik-johansson mentioned this pull request May 20, 2026

Speed up fmpz_mat_charpoly #2691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Danilevsky charpoly; strided dot products#2684

Optimized Danilevsky charpoly; strided dot products#2684
fredrik-johansson merged 2 commits into
flintlib:mainfrom
fredrik-johansson:strided

fredrik-johansson commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fredrik-johansson commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fredrik-johansson commented May 17, 2026 •

edited

Loading