Skip to content

Optimized Danilevsky charpoly; strided dot products#2684

Merged
fredrik-johansson merged 2 commits into
flintlib:mainfrom
fredrik-johansson:strided
May 18, 2026
Merged

Optimized Danilevsky charpoly; strided dot products#2684
fredrik-johansson merged 2 commits into
flintlib:mainfrom
fredrik-johansson:strided

Conversation

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

@fredrik-johansson fredrik-johansson commented May 17, 2026

Main contribution of this PR: optimize nmod_mat_charpoly and gr_mat_charpoly over fields by replacing looped scalar operations in the Danilevsky algorithm with vector operations.

To this end, we introduce strided dot product functions for the most common types, including generics. In the case of nmod_mat_charpoly_danilevsky, each column was copied into a contiguous temporary buffer in order to use the normal _nmod_vec_dot; doing it directly with _nmod_vec_dot_strided turns out to be a bit faster. The copying method is retained as a fallback for the new _gr_vec_dot_strided in cases where a fast _gr_vec_dot exists but _gr_vec_dot_strided itself is not overloaded.

An alternative I've also considered is to implicitly transpose the matrix in the Danilevsky algorithm so that the dot products are contiguous and the vec_addmul_scalar operations are noncontiguous. This may be better or worse depending on the ring. The strided dot product has other potential uses regardless, so it doesn't hurt to try the current version first.

Other changes:

  • nmod_mat_charpoly_danilevsky gains a return value, allowing it to fail gracefully when encountering an impossible inverse. This means that we can use it even when the modulus is not prime, with the O(n^4) Berkowitz fallback only when it fails.

  • nmod_mat_charpoly refers to the nmod8 implementation for prime moduli <= 255, which helps for large matrices since Danilevsky has poor locality.

  • Change some charpoly algorithm cutoffs.

Selection of benchmark results:

  • nmod_mat_charpoly, mod = nextprime(2^63), random input matrix:
       n        old       new  speedup
       4   2.04e-07  2.05e-07  0.995
       8   1.74e-06  1.75e-06  0.994
      16   1.06e-05  7.91e-06  1.340
      32   6.55e-05  4.64e-05  1.412
      64   0.000467  0.000316  1.478
     128     0.0042   0.00287  1.463
     256     0.0395     0.026  1.519
     512      0.343     0.264  1.299
    1024      2.976     2.258  1.318
  • fmpz_mat_charpoly, randbits(10) entries; the multimodular algorithm benefits directly from the faster nmod_mat_charpoly:
       n        old       new  speedup
       4   7.38e-07  7.27e-07  1.015
       8   5.64e-06  5.59e-06  1.009
      16   5.01e-05  4.06e-05  1.234
      32   0.000501  0.000374  1.340
      64    0.00747   0.00535  1.396
     128       0.12    0.0865  1.387
     256      2.239     1.615  1.386
     512     42.961    33.829  1.270
  • nmod_mat_charpoly, mod = 17; demonstrating the added cache benefits of switching to nmod8 internally:
       4   1.07e-07  1.12e-07  0.955
       8   8.16e-07  8.18e-07  0.998
      16   8.42e-06  5.32e-06  1.583
      32   5.63e-05  2.78e-05  2.025
      64   0.000418  0.000167  2.503
     128    0.00363   0.00106  3.425
     256     0.0348   0.00874  3.982
     512       0.33     0.101  3.267
    1024      2.808     0.902  3.113
    2048     30.177    13.179  2.290
  • nmod_mat_charpoly, mod = nextprime(1000) ^ 2; demonstrating the speedup for composite modulus when Danilevsky succeeds and we don't need to resort to Berkowitz:
       n        old       new  speedup
       4   1.06e-07  1.08e-07  0.981
       8   7.93e-07  7.91e-07  1.003
      16   8.57e-06  5.71e-06  1.501
      32   8.17e-05  3.51e-05  2.328
      64   0.000779   0.00025  3.116
     128    0.00959   0.00259  3.703
     256      0.141    0.0264  5.341
     512      2.475     0.251  9.861
    1024     40.665     2.234  18.203
  • gr_mat_charpoly with mpn_mod entries, mod = nextprime(2^200):
       n        old       new  speedup
       2   9.68e-07  9.83e-07  0.985
       4   4.41e-06  4.04e-06  1.092
       8   2.56e-05  1.79e-05  1.430
      16   0.000207  0.000105  1.971
      32    0.00166  0.000734  2.262
      64     0.0133   0.00534  2.491
     128      0.108    0.0426  2.535
     256       0.86     0.329  2.614
     512      6.973     2.706  2.577
  • fq_nmod_mat_charpoly (or gr_mat_charpoly with fq_nmod entries) for GF(7^16):
       n        old       new  speedup
       2   3.97e-06  4.08e-06  0.973
       4   1.56e-05  1.62e-05  0.963
       8     0.0001  9.02e-05  1.109
      16    0.00128  0.000975  1.313
      32     0.0109   0.00795  1.371
      64     0.0895    0.0637  1.405
     128      0.729     0.511  1.427
     256      5.875     4.133  1.421
     512     49.228    37.639  1.308

@fredrik-johansson fredrik-johansson merged commit 2196bbe into flintlib:main May 18, 2026
13 checks passed
@fredrik-johansson fredrik-johansson deleted the strided branch May 18, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant