Skip to content

Toom-Cook multiplication with arbitrary number of interpolation points#2536

Merged
fredrik-johansson merged 3 commits intoflintlib:mainfrom
fredrik-johansson:toom
Dec 25, 2025
Merged

Toom-Cook multiplication with arbitrary number of interpolation points#2536
fredrik-johansson merged 3 commits intoflintlib:mainfrom
fredrik-johansson:toom

Conversation

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

@fredrik-johansson fredrik-johansson commented Dec 23, 2025

Adds _gr_poly_mullow_toom_serial / gr_poly_mullow_toom_serial for Toom-Cook polynomial multiplication with $r$ evaluation/interpolation points for arbitrary $r$.

This can be used to multiply in time $O(N^{1+\varepsilon})$ in rings without a preexisting fast multiplication routine based on FFT or Kronecker substitution, at least when the characteristic is large enough. However, the main intended application is to support memory efficient multiplication in rings which already have fast multiplication by splitting into smaller chunks, as discussed in #2535.

Some sample results below. Explanations:

  • $N$ is the length of the inputs
  • $r$ is the Toom-Cook order; "-" corresponds to a direct fast (FFT-based) multiplication with _gr_poly_mul, so essentially equivalent to taking $r = 1$.
  • "Time, first" is the time in seconds for doing a single multiplication; the ratio in parentheses shows the slowdown of Toom-Cook vs _gr_poly_mul, i.e. lower ratio is better.
  • "Time, second" is the time for a followup multiplication, which is generally faster as FLINT has done some caching (roots of unity for fft_small, etc.).
  • For peak memory usage, the ratio in parentheses shows the savings of Toom-Cook over _gr_poly_mul, i.e. higher ratio is better.

$N = 10^8$, integers mod 251 (nmod8):

r Time, first Time, second Peak memory
- 6.703 ( 1.00x) 5.174 ( 1.00x) 7.86 GB ( 1.00x)
3 8.094 ( 1.21x) 7.335 ( 1.42x) 4.30 GB ( 1.83x)
5 8.476 ( 1.26x) 8.094 ( 1.56x) 2.75 GB ( 2.86x)
7 9.459 ( 1.41x) 9.093 ( 1.76x) 2.34 GB ( 3.36x)
11 10.252 ( 1.53x) 10.069 ( 1.95x) 1.56 GB ( 5.04x)
15 11.669 ( 1.74x) 11.501 ( 2.22x) 1.36 GB ( 5.78x)
21 14.215 ( 2.12x) 14.030 ( 2.71x) 1.24 GB ( 6.34x)
41 19.642 ( 2.93x) 19.645 ( 3.80x) 838 MB ( 9.60x)
81 31.112 ( 4.64x) 31.413 ( 6.07x) 617 MB (13.04x)

$N = 10^7$, integers mod 9223372036854775837 (nmod):

r Time, first Time, second Peak memory
- 2.137 ( 1.00x) 1.520 ( 1.00x) 1.99 GB ( 1.00x)
3 2.659 ( 1.24x) 2.328 ( 1.53x) 1.30 GB ( 1.53x)
5 2.747 ( 1.29x) 2.575 ( 1.69x) 854 MB ( 2.39x)
7 3.163 ( 1.48x) 3.005 ( 1.98x) 822 MB ( 2.48x)
11 2.589 ( 1.21x) 2.541 ( 1.67x) 534 MB ( 3.82x)
15 3.083 ( 1.44x) 3.004 ( 1.98x) 518 MB ( 3.93x)
21 3.316 ( 1.55x) 3.278 ( 2.16x) 425 MB ( 4.79x)
41 4.839 ( 2.26x) 4.807 ( 3.16x) 369 MB ( 5.52x)
81 8.090 ( 3.79x) 8.217 ( 5.41x) 340 MB ( 5.99x)

$N = 10^8$, integers mod 1108307720798209 (nmod) (this is an FFT prime):

r Time, first Time, second Peak memory
1 5.313 ( 1.00x) 3.781 ( 1.00x) 7.49 GB ( 1.00x)
3 7.274 ( 1.37x) 6.538 ( 1.73x) 6.73 GB ( 1.11x)
5 8.042 ( 1.51x) 7.681 ( 2.03x) 5.23 GB ( 1.43x)
7 9.749 ( 1.83x) 9.572 ( 2.53x) 4.86 GB ( 1.54x)
11 12.456 ( 2.34x) 12.292 ( 3.25x) 4.11 GB ( 1.82x)
15 15.947 ( 3.00x) 15.850 ( 4.19x) 3.92 GB ( 1.91x)
21 21.538 ( 4.05x) 21.457 ( 5.67x) 3.82 GB ( 1.96x)
41 39.950 ( 7.52x) 39.375 (10.41x) 3.41 GB ( 2.20x)
81 71.582 (13.47x) 72.004 (19.04x) 3.20 GB ( 2.34x)

$N = 10^5$, integers with 10000-bit coefficients (fmpz)

r Time, first Time, second Peak memory
1 5.986 ( 1.00x) 5.291 ( 1.00x) 5.40 GB ( 1.00x)
3 7.259 ( 1.21x) 6.408 ( 1.21x) 3.42 GB ( 1.58x)
5 7.457 ( 1.25x) 7.105 ( 1.34x) 2.37 GB ( 2.28x)
7 8.111 ( 1.35x) 7.982 ( 1.51x) 2.07 GB ( 2.61x)
11 8.925 ( 1.49x) 9.378 ( 1.77x) 1.54 GB ( 3.51x)
15 10.791 ( 1.80x) 11.858 ( 2.24x) 1.40 GB ( 3.86x)
21 11.042 ( 1.84x) 12.107 ( 2.29x) 1.24 GB ( 4.35x)
41 17.378 ( 2.90x) 19.815 ( 3.75x) 1.03 GB ( 5.24x)
81 44.022 ( 7.35x) 47.56 ( 8.99x) 982 MB ( 5.63x)

$N = 10^6$, integers with 1000-bit coefficients (fmpz)

r Time, first Time, second Peak memory
1 6.213 ( 1.00x) 5.416 ( 1.00x) 5.56 GB ( 1.00x)
3 7.669 ( 1.23x) 6.533 ( 1.21x) 3.67 GB ( 1.51x)
5 8.392 ( 1.35x) 7.583 ( 1.40x) 2.74 GB ( 2.03x)
7 8.536 ( 1.37x) 7.946 ( 1.47x) 2.27 GB ( 2.45x)
11 10.31 ( 1.66x) 9.955 ( 1.84x) 1.82 GB ( 3.05x)
15 12.633 ( 2.03x) 12.711 ( 2.35x) 1.65 GB ( 3.37x)
21 12.688 ( 2.04x) 13.374 ( 2.47x) 1.40 GB ( 3.97x)
41 22.8 ( 3.67x) 26.986 ( 4.98x) 1.26 GB ( 4.41x)
81 82.393 (13.26x) 103.886 (19.18x) 1.27 GB ( 4.38x)

Observations:

This works especially well for nmod8. It makes the least sense to use when working with an FFT prime since the memory saving is at most going to be ~2x. In most other cases memory can be reduced at least 5x with just a 2-4x slowdown. It's also worth noting that small $r$ in many cases can give a ~2x reduction in memory with <1.5x slowdown.

In practice, if you want to minimize memory usage, you wouldn't use something like $r = 81$; it should be more efficient to use smaller $r$ together with another recursive round of Toom-Cook.

For fmpz with small coefficients, this algorithm isn't ideal as the evaluation/interpolation coefficients cause some blowup (with 1000-bit coefficients, you can see that $r = 81$ uses more memory than $r = 41$). A better alternative there apart from recursive Toom-Cook would be to do incremental CRT.

@fredrik-johansson fredrik-johansson merged commit bbde32c into flintlib:main Dec 25, 2025
13 checks passed
@fredrik-johansson fredrik-johansson deleted the toom branch December 25, 2025 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant