Skip to content

Implement Kinoshita-Li power series composition#2658

Merged
fredrik-johansson merged 4 commits intoflintlib:mainfrom
fredrik-johansson:compose
May 5, 2026
Merged

Implement Kinoshita-Li power series composition#2658
fredrik-johansson merged 4 commits intoflintlib:mainfrom
fredrik-johansson:compose

Conversation

@fredrik-johansson
Copy link
Copy Markdown
Collaborator

Adds _gr_poly_compose_series_kinoshita_li / gr_poly_compose_series_kinoshita_li for power series composition in $O(M(n) \log n)$ over generic rings using the Kinoshita-Li algorithm (by @EntropyIncreaser). See discussion in #1911.

The threshold where this starts to beat the Brent-Kung baby-step giant-step algorithm seems to be between 100 and 1000, obviously depending on the ring and input; see below for benchmarks. Note that this algorithm might be preferable to Brent-Kung even in cases where it is not faster, as it uses less memory asymptotically ($O(n \log n)$ vs $O(n^{1.5})$ coefficients).

I have not yet switched over any other series composition and reversion functions (e.g. fmpz_poly_compose_series) to use this algorithm.

This code was written using Claude Sonnet. Methodology: I first asked Claude to read the paper and the comments in #1911 and implement the algorithm following FLINT coding conventions, using gr_poly_compose_series_brent_kung as a guideline for the API. I explicitly instructed it to represent the bivariate polynomials in flat format and to implement KS multiplication. The first draft looked OK structurally but failed the tests. When I asked Claude to debug, it asked me for a copy of @vneiger's Sage implementation. With this as a blueprint, Claude was able to fix the bugs. After that, I reviewed the code and prompted Claude to make a series improvements:

  • Convert from recursive to iterative form
  • Avoid unnecessary temporary vectors and copying
  • Fix an unsafe pointer swap of temporary vectors
  • Avoid excessive zero-padding and tighten allocations
  • Switch the Kronecker substitution format for bivariate multiplication from Claude's initial choice of x-major to y-major. I did extensive benchmarking to verify that y-major consistently performs better, and sometimes dramatically better with arb whose univariate polynomial multiplication is designed for typical univariate input and poorly tuned for coefficients in the scrambled KS order
  • (Not kept in the final version): use Harvey's multipoint KS instead of standard KS; on every example I tested, this performed worse, but this may have been due to not being implemented optimally and may be worth retrying in the future
  • Fix errors in comments

nmod example

Timings in seconds to compute $f(g(x)) \bmod x^n$ with random $f$ and $g$, nmod coefficients mod $p = 2^{63} + 29$. Speedup ratios are relative to Brent-Kung.

        n     Brent-Kung Kinoshita-Li (speedup)
       100      7.79e-05    0.000162  (0.481)
       200      0.000285    0.000347  (0.821)
       400      0.000869    0.000839  (1.036)
       800       0.00254     0.00193  (1.316)
      1600       0.00763     0.00434  (1.758)
      3200        0.0238     0.00974  (2.444)
      6400        0.0776      0.0218  (3.560)
     12800         0.257      0.0487  (5.277)
     25600         0.882       0.115  (7.670)
     51200         3.036       0.266  (11.414)
    102400         10.82       0.646  (16.749)
    204800         37.76       1.556  (24.267)
    409600       140.573       3.736  (37.627)
    819200       497.694      10.274  (48.442)

Excellent speedup with evident quasilinear asymptotics!

fmpz example

Example timings with fmpz coefficients, $f = g = \sum_{k \ge 0} F_k x^k$ (Fibonacci generating function):

        n     Brent-Kung Kinoshita-Li (speedup)
       100      0.000298    0.000443  (0.673)
       200       0.00243     0.00284  (0.856)
       400        0.0153      0.0144  (1.062)
       800         0.108      0.0683  (1.581)
      1600         0.682       0.339  (2.012)
      3200         4.651       1.716  (2.710)
      6400        33.263       8.462  (3.931)
     12800        266.59      43.331  (6.152)

fmpq example

Example with fmpq coefficients, $f = g = \log(1+x)$:

        n     Brent-Kung Kinoshita-Li (speedup)
       100       0.00516      0.0053  (0.974)
       200        0.0333      0.0268  (1.243)
       400         0.235       0.169  (1.391)
       800         1.821       1.046  (1.741)
      1600        15.511       6.986  (2.220)
      3200       153.219      48.735  (3.144)

Note that the implementation has not been optimized specifically for fraction fields; one can presumably gain something by handling denominators less generically.

arb example

Example with arb coefficients, prec = 128, $f = g = \log(1+x)$:

        n     Brent-Kung Kinoshita-Li (speedup)      [x^(n-1)]  (Brent-Kung)      [x^(n-1)]  (Kinoshita-Li)
       100      0.000866     0.00178  (0.487)           [5e+17 +/- 9.13e-20]           [5e+17 +/- 8.90e-20] 
       200       0.00392     0.00606  (0.647)               [2e+37 +/- 7.35]               [2e+37 +/- 6.40] 
       400        0.0144       0.017  (0.847)           [8e+76 +/- 4.50e+40]           [8e+76 +/- 4.14e+40] 
       800        0.0597      0.0603  (0.990)         [2e+156 +/- 2.22e+120]         [2e+156 +/- 1.91e+120] 
      1600         0.218       0.174  (1.253)         [2e+315 +/- 4.61e+279]         [2e+315 +/- 4.32e+279] 
      3200         0.946       0.545  (1.736)         [5e+633 +/- 2.55e+598]         [5e+633 +/- 2.26e+598] 
      6400         4.441       2.008  (2.212)       [8e+1270 +/- 6.44e+1235]       [8e+1270 +/- 6.22e+1235] 
     12800        23.467       8.449  (2.777)       [3e+2545 +/- 5.20e+2510]       [3e+2545 +/- 4.75e+2510] 
     25600       130.642      37.419  (3.491)       [8e+5094 +/- 2.77e+5060]       [8e+5094 +/- 2.74e+5060] 

Same polynomials with prec = 3333:

        n     Brent-Kung Kinoshita-Li (speedup)      [x^(n-1)]  (Brent-Kung)      [x^(n-1)]  (Kinoshita-Li)
       100        0.0175       0.023  (0.761)          [5e+17 +/- 1.53e-984]          [5e+17 +/- 1.61e-984] 
       200        0.0621      0.0575  (1.080)          [2e+37 +/- 1.23e-964]          [2e+37 +/- 1.17e-964] 
       400         0.225       0.154  (1.461)          [8e+76 +/- 8.11e-925]          [8e+76 +/- 7.63e-925] 
       800         0.827        0.39  (2.121)         [2e+156 +/- 3.75e-845]         [2e+156 +/- 3.55e-845] 
      1600         2.538       1.008  (2.518)         [2e+315 +/- 8.40e-686]         [2e+315 +/- 8.05e-686] 
      3200         8.114       2.715  (2.989)         [5e+633 +/- 4.32e-367]         [5e+633 +/- 4.20e-367] 
      6400         27.16       8.036  (3.380)        [8e+1270 +/- 1.18e+271]        [8e+1270 +/- 1.15e+271] 
     12800       100.991      24.848  (4.064)       [3e+2545 +/- 8.82e+1545]       [3e+2545 +/- 8.76e+1545] 

Same polynomials with prec = 333333:

        n     Brent-Kung Kinoshita-Li (speedup)      [x^(n-1)]  (Brent-Kung)      [x^(n-1)]  (Kinoshita-Li)
       100         3.965       3.735  (1.062)       [5e+17 +/- 1.93e-100324]       [5e+17 +/- 2.04e-100324] 
       200        15.886       9.082  (1.749)       [2e+37 +/- 1.56e-100304]       [2e+37 +/- 1.48e-100304] 
       400        62.491      22.944  (2.724)       [8e+76 +/- 1.02e-100264]       [8e+76 +/- 9.63e-100265] 
       800       244.042      63.805  (3.825)      [2e+156 +/- 4.74e-100185]      [2e+156 +/- 4.48e-100185] 

Numerical stability with arb coefficients seems to be quite similar to that of Brent-Kung. I was worried in advance that Kinoshita-Li might be numerically unstable, but it turns out to be very well behaved! The only catch is that it isn't consistently faster than Brent-Kung at low precision (where numerically stable polynomial multiplication is not necessarily quasilinear), but it clearly wins when the precision grows.

qqbar example

Example in a ring where we currently only use schoolbook multiplication: $f = g = \sum_{k \ge 1} (\sqrt{2}+k) x^k$ with qqbar coefficients.

        n     Brent-Kung Kinoshita-Li (speedup)
        25          0.13       0.138  (0.942)
        50          1.05       0.852  (1.232)
       100         7.355       4.968  (1.480)
       200        54.726      29.232  (1.872)
       400       393.888     179.693  (2.192)

Multivariate example

Example with gr_poly(nmod) coefficients, $f = g = \sum_k (k+t) x^k \in ((\mathbb{Z}/17 \mathbb{Z})[t])[x]$.

        n     Brent-Kung Kinoshita-Li (speedup)
        25      7.29e-05    0.000201  (0.363)
        50      0.000329    0.000852  (0.386)
       100       0.00151      0.0038  (0.397)
       200       0.00757      0.0183  (0.414)
       400        0.0419       0.134  (0.313)
       800         0.305       0.699  (0.436)
      1600         2.068       3.821  (0.541)
      3200         13.34      17.828  (0.748)
      6400        81.608      89.788  (0.909)

Here Kinoshita-Li seems to underperform. We might want some heuristic to detect such rings before using it by default in gr_poly_compose_series.

@fredrik-johansson
Copy link
Copy Markdown
Collaborator Author

fredrik-johansson commented May 3, 2026

I have now added fmpq_poly_compose_series_kinoshita_li which is much faster than the gr version with fmpq coefficients as it works with polynomials on common denominators. Claude was able to generate this translation of the gr version essentially in one shot, requiring only minor manual cleanup.

In addition, I have optimized fmpq_poly_compose_series_brent_kung by using fmpz_mat with separate denominators instead of fmpq_mat, which speeds up composition 1.5x-ish for small to moderate $n$.

The main function fmpq_poly_compose_series has been updated to choose Kinoshita-Li instead of Brent-Kung for $n \ge 250$.

Also, fmpq_poly_revert_series has been updated to use Newton iteration except for small $n$ now that composition is significantly faster.

Some timing results for fmpq_poly, old code vs new (speedup for small $n$ due to optimized Brent-Kung; for larger $n$ due to using Kinoshita-Li):

Composition, $f = g = \log(1+x)$:

       n            old           new    (speedup)
      12       6.12e-06       2.5e-06    (2.448)
      25       4.33e-05      2.76e-05    (1.569)
      50       0.000417       0.00024    (1.738)
     100         0.0034       0.00184    (1.848)
     200         0.0217        0.0134    (1.619)
     400          0.155        0.0769    (2.016)
     800          1.195         0.512    (2.334)
    1600          9.823         3.053    (3.217)
    3200         97.485        18.235    (5.346)

Composition, $f = g = \exp(x)-1$:

       n            old           new    (speedup)
      12       5.47e-06      2.16e-06    (2.532)
      25        7.5e-05      4.89e-05    (1.534)
      50       0.000592      0.000357    (1.658)
     100        0.00402        0.0026    (1.546)
     200         0.0265        0.0195    (1.359)
     400          0.205         0.136    (1.507)
     800           1.64         0.768    (2.135)
    1600         13.877         4.818    (2.880)

Composition, $f = g = \sqrt{1+x}-1$:

       n            old           new    (speedup)
      12        5.7e-06       2.1e-06    (2.714)
      25        3.7e-05      2.24e-05    (1.652)
      50       0.000333       0.00019    (1.753)
     100        0.00158        0.0011    (1.436)
     200        0.00824       0.00762    (1.081)
     400         0.0482        0.0346    (1.393)
     800          0.321         0.177    (1.814)
    1600            2.3         0.859    (2.678)
    3200         17.427         4.327    (4.028)

Composition, $f = \log(1+x), g = x/(1-x)/3$:

       n            old           new    (speedup)
      12       5.45e-06      1.81e-06    (3.011)
      25       1.85e-05      6.78e-06    (2.729)
      50       8.65e-05      6.42e-05    (1.347)
     100       0.000389       0.00028    (1.389)
     200        0.00265       0.00203    (1.305)
     400         0.0151        0.0144    (1.049)
     800         0.0809        0.0753    (1.074)
    1600          0.472         0.395    (1.195)
    3200          2.568         1.888    (1.360)
    6400         15.632         9.701    (1.611)

Composition, $f = g = x/(1-x)/3$:

       n            old           new    (speedup)
      12       4.63e-06      1.46e-06    (3.171)
      25       1.45e-05      4.11e-06    (3.528)
      50       5.44e-05      2.44e-05    (2.230)
     100       0.000202      0.000113    (1.788)
     200         0.0012      0.000851    (1.410)
     400        0.00678       0.00902    (0.752)
     800         0.0357        0.0416    (0.858)
    1600          0.219         0.231    (0.948)
    3200          1.279         1.116    (1.146)
    6400          7.179         5.902    (1.216)

Reversion, $f = \log(1+x)$:

       n            old           new    (speedup)
      12       4.42e-06      4.46e-06    (0.991)
      25       6.59e-05      3.62e-05    (1.820)
      50        0.00039      0.000349    (1.117)
     100        0.00259       0.00277    (0.935)
     200         0.0195        0.0172    (1.134)
     400          0.184         0.134    (1.373)
     800          1.928         0.814    (2.369)
    1600         21.251         4.987    (4.261)
    3200        265.237        33.305    (7.964)

Reversion, $f = \exp(x)-1$:

       n            old           new    (speedup)
      12       3.62e-06      3.59e-06    (1.008)
      25       5.91e-05      3.58e-05    (1.651)
      50       0.000384      0.000305    (1.259)
     100        0.00264       0.00213    (1.239)
     200         0.0198        0.0148    (1.338)
     400          0.195         0.113    (1.726)
     800          1.903          0.63    (3.021)
    1600         20.784         3.826    (5.432)
    3200        251.899        23.058    (10.925)

Reversion, $f = \sqrt{1+x}-1$:

       n            old           new    (speedup)
      12       2.59e-06      2.58e-06    (1.004)
      25       2.21e-05      9.22e-06    (2.397)
      50       0.000165      6.34e-05    (2.603)
     100         0.0012      0.000262    (4.580)
     200        0.00575        0.0015    (3.833)
     400         0.0316       0.00642    (4.922)
     800          0.219        0.0321    (6.822)
    1600          1.669         0.418    (3.993)
    3200         13.752         3.536    (3.889)
    6400         132.18        19.758    (6.690)

Reversion, $f = x/(1-x)/3$:

       n            old           new    (speedup)
      12       1.38e-06      1.43e-06    (0.965)
      25       3.66e-06      6.47e-06    (0.566)
      50       1.96e-05       0.00011    (0.178)
     100       7.59e-05      0.000949    (0.080)
     200       0.000516       0.00525    (0.098)
     400        0.00257        0.0273    (0.094)
     800         0.0108         0.173    (0.062)
    1600          0.053         1.193    (0.044)
    3200          0.351         7.977    (0.044)
    6400          2.061        56.239    (0.037)

(The last example having a special form where the previous default of fast Lagrange inversion is particularly efficient; it would be possible to detect such special input, but not sure if it's worth it.)

@fredrik-johansson
Copy link
Copy Markdown
Collaborator Author

Now enabled by default for all series composition functions, and reversion functions have been updated to favor Newton iteration for sufficiently large $n$ now that composition is asymptotically fast.

Algorithm cutoffs are always going to be suboptimal for some inputs, but hopefully this will improve performance in most cases.

@fredrik-johansson fredrik-johansson merged commit 65198c1 into flintlib:main May 5, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant