Skip to content
trunk
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
flint2/src/fft_small/
flint2/src/fft_small/

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
doc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
FFT over F_p
===============================

    Elements of Z/nZ are represented as integral doubles. Floating point numbers
    in general in these notes will be viewed at exact rationals, and operations
    on them that produce a rounded (to some nearest) answer will be denoted by
    add, mul, ... etc. The + , *, / always denote the true exact rational value.


The reduction mod n
-------------------

    For arithmetic mod n, the product a*b is represented exactly as the usual
    double-double format h + l where h and l are prec-bit floating point
    numbers. a*b will be roughly the size of n^2, but since one or both of a or
    b may not be reduced modulo n, it is common for the product a*b to be a
    reasonable multiple of n^2. Since this is calculated as

        h = mul(a, b)
        l = fmsub(a, b, h),

    if we have |a*b| < 2^e <= 2^(2*prec), then

        |l| <= 2^(e - prec - 1).

    First, ninv is an floating approximation of 1/n with some error epsilon

        ninv := 1/n + epsilon.

    Next, we would like to approximate the true quotient a*b/n as

        q = round(mul(h, ninv)),

    so that h + l can be reduced mod n. Since

        h*ninv = (a*b - l)*(1/n + epsilon)
               = a*b/n + a*b*epsilon - l*ninv,

    we have

        |a*b/n - q| <= |a*b*epsilon| + 2^(e - prec - 1)*ninv + 1/2.

    When the quantity on the rhs is <= k, we can guarantee that

        |h + l - q*n| <= k*n,

    while h + l - q*n can be calculated exactly as add(l, fnmadd(q, n, h)).


Analysis for 2^(prec - 4) < n < 2^(prec - 3) (aka 50 bit primes)
----------------------------------------------------------------

    In this case we have |epsilon| < 2^(3 - 2*prec) and ninv < 2^(4 - prec).

    For |a*b| < 2*n^2, we can take e = 2*prec - 5. Then,

        |a*b*epsilon| + 2^(e - prec - 1)*ninv + 1/2 < 2^-2 + 2^-2 + 1/2 = 1

    Similarly for |a*b| < 4*n^2, we can take e = 2*prec - 4 and find that

        |a*b*epsilon| + 2^(e - prec - 1)*ninv + 1/2 < 2^-1 + 2^-1 + 1/2 = 3/2

    Therefore,

(1)  Products in the range (-2*n^2, 2*n^2) are reduced to the range (-n, n).
(2)  Products in the range (-4*n^2, 4*n^2) are reduced to the range (-3/2*n, 3/2*n).


Analysis of the forward butterfiles for 50 bit primes
-----------------------------------------------------

        b0 = 1*a0 + w^2*a2 +   w*(a1 + w^2*a3)
        b1 = 1*a0 + w^2*a2 -   w*(a1 + w^2*a3)
        b2 = 1*a0 - w^2*a2 + i*w*(a1 - w^2*a3)
        b3 = 1*a0 - w^2*a2 - i*w*(a1 - w^2*a3)

    Each of w^2, w, and i*w is in (-n/2, n/2). As part of this operation, the
    multiplication 1*a0 reduces a0 to the range (-n, n) and the other
    multiplications use the above. In practice, a0 does not get large even
    without this reduction, but this is needed to prove things stay bounded.
    The following claim is trivial from (1):

(*)     If the ai in (-3*n, 3*n), then the bi are also in (-3*n, 3*n).


Analysis of the truncated forward butterfiles for 50 bit primes
---------------------------------------------------------------

    Trivial because some of the ai are just 0.


Analysis of the pointwise muls
------------------------------

    The output of one fft in the range (-3*n, 3*n) is multiplied with a
    multiplier (2^-depth) in the range (-n/2, n/2). This produces a product x
    in the range (-n, n). Then, x is multiplied by the output of another fft
    in the range (-3*n, 3*n). Thus,

(*)     The outputs of the pointwise muls are in the range (-3/2*n, 3/2*n).


Analysis of the reverse butterflies for 50 bit primes
-----------------------------------------------------

        4a0 =    1*(     (b0 + b1) +        (b2 + b3))
        4a2 = w^-2*(     (b0 + b1) -        (b2 + b3))
        4a1 =       w^-1*(b0 - b1) - i*w^-1*(b2 - b3)
        4a3 = w^-2*(w^-1*(b0 - b1) + i*w^-1*(b2 - b3))

    As before, each of w^-2, w^-1, and i*w^-1 is in (-n/2, n/2). As part of
    this operation, 1*(b0 + b1 + b2 + b3) reduces 4a0 to the range (-n, n),
    and the other multiplications are as before.
    The following claim is trivial from (1) and (2):

(*)     If the bi are in (-2*n, 2*n), then the 4ai are also in (-2*n, 2*n).

    The bound n < 2^50 guarantees that |(b0 + b1) +- (b2 + b3)| < 2^53 so that
    no bits are lost.


Analysis of the truncated reverse butterflies for 50 bit primes
---------------------------------------------------------------

    *** TODO ***
    A bit tedious because there are so many formulas, but we just need make
    sure that output is in the range (-2*n, 2*n) if input is.


Specific bounds for the chosen primes
-------------------------------------

The bounds (1) and (2) are actually a bit pessimistic.
For the primes here, epsilon is a bit smaller, and we have

for n = 0x0003f00000000001, epsilon = 2.484444775828405e-32
products in the range +-1*n^2 are reduced to the range +-0.5940096416170644*n
products in the range +-2*n^2 are reduced to the range +-0.6880192832341288*n
products in the range +-3*n^2 are reduced to the range +-0.8455209883432566*n
products in the range +-4*n^2 are reduced to the range +-0.8760385664682575*n

for n = 0x0002580000000001, epsilon = 2.843405307262943e-32
products in the range +-1*n^2 are reduced to the range +-0.5657082112630193*n
products in the range +-2*n^2 are reduced to the range +-0.6314164225260386*n
products in the range +-3*n^2 are reduced to the range +-0.7504579671223911*n
products in the range +-4*n^2 are reduced to the range +-0.762832845052077*n

for n = 0x0003dc0000000001, epsilon = 6.811649508604047e-32
products in the range +-1*n^2 are reduced to the range +-0.6451606287164718*n
products in the range +-2*n^2 are reduced to the range +-0.7903212574329437*n
products in the range +-3*n^2 are reduced to the range +-1.0002592140846382*n
products in the range +-4*n^2 are reduced to the range +-1.0806425148658874*n

for n = 0x00033c0000000001, epsilon = 2.9157237243432387e-32
products in the range +-1*n^2 are reduced to the range +-0.6014607931680735*n
products in the range +-2*n^2 are reduced to the range +-0.7029215863361472*n
products in the range +-3*n^2 are reduced to the range +-0.7270876935138827*n
products in the range +-4*n^2 are reduced to the range +-0.9058431726722944*n

for n = 0x00027c0000000001, epsilon = 4.7499842291553306e-32
products in the range +-1*n^2 are reduced to the range +-0.5735421570591971*n
products in the range +-2*n^2 are reduced to the range +-0.6470843141183942*n
products in the range +-3*n^2 are reduced to the range +-0.7709409365863962*n
products in the range +-4*n^2 are reduced to the range +-0.7941686282367884*n

for n = 0x0003a20000000001, epsilon = 3.053563908647828e-32
products in the range +-1*n^2 are reduced to the range +-0.600745266740041*n
products in the range +-2*n^2 are reduced to the range +-0.7014905334800822*n
products in the range +-3*n^2 are reduced to the range +-0.8710530045211985*n
products in the range +-4*n^2 are reduced to the range +-0.9029810669601643*n

for n = 0x00039a0000000001, epsilon = 1.649948345672754e-32
products in the range +-1*n^2 are reduced to the range +-0.5863706460485221*n
products in the range +-2*n^2 are reduced to the range +-0.6727412920970441*n
products in the range +-3*n^2 are reduced to the range +-0.828526254848386*n
products in the range +-4*n^2 are reduced to the range +-0.8454825841940883*n

for n = 0x0003160000000001, epsilon = 3.3640032256934724e-32
products in the range +-1*n^2 are reduced to the range +-0.6063937464846851*n
products in the range +-2*n^2 are reduced to the range +-0.7127874929693703*n
products in the range +-3*n^2 are reduced to the range +-0.7381685812262074*n
products in the range +-4*n^2 are reduced to the range +-0.9255749859387407*n


==========================

8x8 transpose with avx512

x0[0] ... x0[7]
...       ...
x7[0] ... x7[7]

t0 = unpacklo(x0, x1) // x0[0],x1[0], x0[2],x1[2], x0[4],x1[4], x0[6],x1[6]
t1 = unpacklo(x2, x3) // x2[0],x3[0], x2[2],x3[2], x2[4],x3[4], x2[6],x3[6]
t2 = unpacklo(x4, x5) // x4[0],x5[0], x4[2],x5[2], x4[4],x5[4], x4[6],x5[6]
t3 = unpacklo(x6, x7) // x6[0],x7[0], x6[2],x7[2], x6[4],x7[4], x6[6],x7[6]
t4 = unpackhi(x0, x1) // x0[1],x1[1], x0[3],x1[3], x0[5],x1[5], x0[7],x1[7]
t5 = unpackhi(x2, x3) // x2[1],x3[1], x2[3],x3[3], x2[5],x3[5], x2[7],x3[7]
t6 = unpackhi(x4, x5) // x4[1],x5[1], x4[3],x5[3], x4[5],x5[5], x4[7],x5[7]
t7 = unpackhi(x6, x7) // x6[1],x7[1], x6[3],x7[3], x6[5],x7[5], x6[7],x7[7]


s0 = (t0, t1) // x0[0],x1[0],x2[0],x3[0],