Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nmod): Add nmod_ctx to store is_prime #179

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

oscarbenjamin
Copy link
Collaborator

@oscarbenjamin oscarbenjamin commented Aug 9, 2024

Work in progress...

Fixes gh-124
Fixes gh-73

Have nmod use a context object so that we can store the additional information that the modulus is prime to avoid operations that might fail for non-prime modulus (gh-124).

@oscarbenjamin
Copy link
Collaborator Author

This should be actually working now unlike previous commits.

I haven't implemented the prime checks yet but I have changed it so that nmod, nmod_poly and nmod_mat all use context objects to store the modulus.

This gives:

In [3]: sys.getsizeof(nmod(3, 7))
Out[3]: 48

That is the same as on master. What has happened is that rather than storing a 64 bit nmod_t it stores a 64 bit pointer to an nmod_ctx. The nmod_ctx then holds an nmod_t along with a bint _is_prime.

When calling a C function this means that it looks like:

__pyx_v_r->val = nmod_mul(__pyx_v_val, __pyx_v_s2->val, __pyx_v_s2->ctx->mod);

Whereas on master it looks like:

__pyx_v_r->val = nmod_mul(__pyx_v_val, __pyx_v_s->val, __pyx_v_r->mod);

That has an extra indirection to dereference two pointers rather than one. In the context of all of the surrounding code though I doubt it is significant. It is very likely the context is going to be high in the memory cache if this indirection happens a lot.

A simple timing with the PR:

In [3]: %timeit a*a
333 ns ± 8.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With master that is:

In [3]: %timeit a*a
202 ns ± 3.71 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

So it looks like there is a significant slowdown. I will investigate further but I doubt that the pointer indirection is the cause. More INCREF/DECREF is involved since the context object is a reference-counted.

Possibly this is more significant:

any_as_nmod(&val, t, (<nmod>s).mod)

vs:

s2.ctx.any_as_nmod(&val, t)

It might be better to inline that a bit more somehow rather than calling a method on ctx so maybe the Cython code needs to look like:

any_as_nmod(&val, t, (<nmod>s.ctx).mod)

Actually I just tried that it brings the time back down to:

In [3]: %timeit a*a
244 ns ± 2.18 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

I'll push that approach up...

So clearly we need to a bit careful here because subtle differences in exactly how the code is called can make a big difference to overall timings. I want to try to get back down to at least the same time if not faster than master.

@oscarbenjamin
Copy link
Collaborator Author

I want to try to get back down to at least the same time if not faster than master.

It is potentially not worth slowing down nmod for prime modulus just to have better error handling for non-prime modulus...

@oscarbenjamin
Copy link
Collaborator Author

oscarbenjamin commented Aug 22, 2024

This gives:

In [3]: sys.getsizeof(nmod(3, 7))
Out[3]: 48

That is the same as on master.

I'm not sure where that comes from. General Python object overhead is 16 bytes. An fmpz_t is 8 bytes. Hence:

In [3]: sys.getsizeof(fmpz(3))
Out[3]: 24

An fmpq_t is two fmpz_t so 16 bytes. Hence:

In [4]: sys.getsizeof(fmpq(3))
Out[4]: 32

An nmod has a mp_limb_t (8 bytes) and then an nmod_t (on master) and oh an nmod_t is more than 8 bytes. It is:

    ctypedef struct nmod_t:
        mp_limb_t n
        mp_limb_t ninv
        flint_bitcnt_t norm

according to flint.pxd. That would be 3 lots of 8 bytes so 16 + 24 = 40 bytes, but I guess alignment pushes it to 48 bytes.

Here in this PR though an nmod has an mp_limb_t and an nmod_ctx which is a Python object so a pointer in the struct. I would have thought that would be 16 + 8 + 8 = 32 bytes...

Maybe adding an attribute that is a Python object adds some other overhead...

The struct in the generated C code is:

struct __pyx_obj_5flint_5types_4nmod_nmod {
  struct __pyx_obj_5flint_10flint_base_10flint_base_flint_scalar __pyx_base;
  mp_limb_t val;
  struct __pyx_obj_5flint_5types_4nmod_nmod_ctx *ctx;
};

So it is just PyObject_HEAD plus an mp_limb_t and a pointer.

It would be nice if we could get it down to 32 bytes rather than 48...

@oscarbenjamin
Copy link
Collaborator Author

After playing around with this a bit I think that it is not possible to store an nmod_ctx as a Python object attached to an nmod without incurring some overhead. I haven't found any way to match let alone beat the current code as long as there is an nmod_ctx attribute on nmod.

I show a lot of timings below and they vary quite a bit from run to run (e.g. if I quit ipython and restart it). The general trends are as described though. This time I am building with -Dbuildtype=release i.e. C compiler optimisations on for the Cython code.

Timings for master:

In [1]: from flint import *

In [2]: ai = 10   # CPython caches small ints

In [3]: %timeit ai * ai
27 ns ± 0.169 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [4]: ai = 1000  # Larger ints not cached

In [5]: %timeit ai * ai
49.4 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [7]: ag = gmpy2.mpz(10)  # gmpy2 slower for small ints

In [8]: %timeit ag * ag
74.2 ns ± 1.15 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: af = nmod(10, 17)  # nmod similar

In [10]: %timeit af * af
75.3 ns ± 0.255 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

So nmod times about the same as gmpy2.mpz (for small ints). Both are about 50% slower than CPython's int. I'm not sure if that is because int somehow gets a built-in advantage or whether there is some way for both gmpy2 and python-flint to be faster here.

For very small integers CPython caches them in memory which avoids the heap allocation. That makes it 3x faster than nmod. We could contemplate doing the same thing for nmod when the modulus is small: just store every element in an array. That wouldn't work for large modulus though and we would need a separate array for each modulus. I'm not sure how worthwhile it is to try to optimise the small modulus case for this.

By contrast with my best efforts (not pushed) to micro-optimise the approach in this PR I get:

In [3]: %timeit af * af
104 ns ± 1.56 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

That's a 30% slowdown. I believe that this is purely due to the unavoidable overhead of needing to assign a Python object in Cython like:

(<nmod>result).ctx = (<nmod>self).ctx

This happens for every operation that creates a new nmod and because ctx is an nmod_ctx Python object it results in INCREF/DECREF etc. The Cython-generated C code looks like:

  /* "flint/types/nmod.pyx":403
 *             raise ValueError("cannot coerce integers mod n with different n")
 *         r = nmod.__new__(nmod)
 *         r.ctx = s.ctx             # <<<<<<<<<<<<<<
 *         r.val = nmod_mul(s.val, t.val, s.ctx.mod)
 *         return r
 */
  __pyx_t_2 = ((PyObject *)__pyx_v_s->ctx);
  __Pyx_INCREF(__pyx_t_2);
  __Pyx_GIVEREF(__pyx_t_2);
  __Pyx_GOTREF((PyObject *)__pyx_v_r->ctx);
  __Pyx_DECREF((PyObject *)__pyx_v_r->ctx);
  __pyx_v_r->ctx = ((struct __pyx_obj_5flint_5types_4nmod_nmod_ctx *)__pyx_t_2);
  __pyx_t_2 = 0;

I don't know what all these macros are doing but the analogous code when we store an nmod_t for the modulus is just one line that copies the struct.

Also here are timings for a more macro benchmark where SymPy uses python-flint's nmod for GF(p) matrices. This is SymPy's pure Python matrix implementation but with nmod as the elementary matrix elements. We can invert a 100x100 matrix over GF(17) in 11 milliseconds with python-flint master:

In [13]: from sympy import *

In [14]: M = randMatrix(100)

In [15]: dM = M.to_DM(GF(17))

In [26]: %timeit dM.inv()
11.2 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Converting that to a Flint nmod_mat we get:

In [27]: fM = dM.to_dense().rep.rep

In [28]: type(fM)
Out[28]: flint.types.nmod_mat.nmod_mat

In [30]: %timeit fM.inv()
1.16 ms ± 35.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So that is about SymPy's pure Python matrix implementation operating with python-flint's nmod is about 10x slower than FLINT's nmod_mat_inv function. Obviously we expect a pure Python implementation to be slower but it is good to get within a factor of 10 of C here. That depends critically on python-flint's nmod speed though.

With this PR instead of python-flint master we get:

In [8]: %timeit dM.inv()
15.9 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's 16 milliseconds rather than 11 milliseconds so a 50% slowdown.

To control all extraneous factors I made these two simpler demo types:

from cpython.object cimport PyTypeObject, PyObject_TypeCheck

cdef class int_mod1:
    cdef unsigned long long val
    cdef unsigned long long mod

    def __init__(self, val, mod):
        self.val = val
        self.mod = mod

    def __repr__(self):
        return f"{self.val} mod {self.mod}"

    def __mul__(self, other):
        cdef int_mod1 result
        if not PyObject_TypeCheck(other, <PyTypeObject*>int_mod1):
            return NotImplemented
        if self.mod != (<int_mod1>other).mod:
            raise ValueError("cannot multiply integers mod n with different n")
        result = int_mod1.__new__(int_mod1)
        result.val = (self.val * (<int_mod1>other).val) % self.mod
        result.mod = self.mod
        return result

cdef class int_mod2_ctx:
    cdef unsigned long long mod

    def __init__(self, mod):
        self.mod = mod

cdef class int_mod2:
    cdef unsigned long long val
    cdef int_mod2_ctx ctx

    def __init__(self, val, mod):
        self.val = val
        self.ctx = int_mod2_ctx(mod)

    def __repr__(self):
        return f"{self.val} mod {self.ctx.mod}"

    def __mul__(self, other):
        cdef int_mod2 result
        if not PyObject_TypeCheck(other, <PyTypeObject*>int_mod2):
            return NotImplemented
        if self.ctx is not (<int_mod2>other).ctx:
            raise ValueError("cannot multiply integers mod n with different n")
        result = int_mod2.__new__(int_mod2)
        result.val = (self.val * (<int_mod2>other).val) % self.ctx.mod
        result.ctx = self.ctx
        return result

The only difference between these two is whether we have a mod attribute that is an unsigned long or an int_mod2_ctx attribute that is a Python object storing the mod. These are the timings:

In [10]: a1 = int_mod1(1000, 10000)

In [11]: a2 = int_mod2(1000, 10000)

In [12]: %timeit a1 * a1
91.4 ns ± 2.16 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [13]: %timeit a2 * a2
142 ns ± 0.38 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

So that one difference (using an nmod_ctx Python object rather than an unsigned long modulus) brings a 50% slowdown. Timings on this vary from run to run but int_mod2 is always slower. In absolute terms here it is about 50 nanoseconds which seems a lot but for my best efforts with nmod the difference was more like 20 nanoseconds. Clock speed here is apparently 1.8GHz so 20ns is about 10 CPU instructions I guess. It seems plausible that the INCREF/DECREF operations amount to that...

Note also:

In [17]: sys.getsizeof(a1)
Out[17]: 32

In [18]: sys.getsizeof(a2)
Out[18]: 48

That sort of explains why in this PR nmod is still 48 bytes rather than 32. It is because the nmod_ctx pointer somehow cannot just increase the size from 24 bytes to 32. I suppose this is some alignment issue or to do with cyclic GC but it is just what happens with Cython when we swap an 8 byte object for a PyObject* apparently...

I think that proves then that we can't have an nmod_ctx attribute attached to an individual nmod so along with fmpz and fmpq we need an nmod to not have a context attribute. There can still be a context from an interface/API perspective it is just that we cannot store the context as an attribute on each nmod instance.

For nmod_poly and nmod_mat though I think that this overhead is likely acceptable e.g. the time to invert a 100x100 matrix is not measurably affected:

In [15]: %timeit fM.inv()
1.12 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That is because a 30 nanosecond overhead just doesn't register in the context of a 1 millisecond operation.

The conclusion then is this: we have nmod_ctx but it is only used by nmod_poly and nmod_mat (nmod_mpoly already has a context). All other types besides fmpz* and fmpq* already have contexts. We should still add contexts for fmpz and fmpq but just they can be global variables. An nmod.get_ctx() function would need to retrieve a context object from some global cache.

Part of the reason for wanting to have contexts (apart from uniformity across types) is to store whether or not the modulus is prime. This is actually not needed in the case of nmod because the only operations that would have needed it use n_gcdinv and n_sqrtmod which don't crash and which can detect whether the result exists. It is only nmod_poly and nmod_mat that need to store the is_prime attribute and so we can add contexts for them.

In conclusion then I need to revert a lot of the changes to nmod here but not those to nmod_poly and nmod_mat...

@oscarbenjamin
Copy link
Collaborator Author

That sort of explains why in this PR nmod is still 48 bytes rather than 32. It is because the nmod_ctx pointer somehow cannot just increase the size from 24 bytes to 32. I suppose this is some alignment issue or to do with cyclic GC but it is just what happens with Cython when we swap an 8 byte object for a PyObject* apparently...

I asked on the Cython mailing list and this is apparently due to cyclic GC:
https://groups.google.com/g/cython-users/c/b_LGThvSj8M/m/pPw8c_IRCQAJ?utm_medium=email&utm_source=footer
Basically having any reference to a Python object adds 16 bytes to the memory footprint for the doubly linked list used by the cyclic garbage collector.

@oscarbenjamin
Copy link
Collaborator Author

I asked on the Cython mailing list and this is apparently due to cyclic GC:

Using @cython.no_gc seems to bring some speedups for some benchmarks.

Also here are timings for a more macro benchmark where SymPy uses python-flint's nmod for GF(p) matrices. This is SymPy's pure Python matrix implementation but with nmod as the elementary matrix elements. We can invert a 100x100 matrix over GF(17) in 11 milliseconds with python-flint master:

In [13]: from sympy import *

In [14]: M = randMatrix(100)

In [15]: dM = M.to_DM(GF(17))

In [26]: %timeit dM.inv()
11.2 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Turns out I was completely wrong about this. I forgot that in this particular case what happens is that SymPy's sparse implementation converts the matrix to the dense representation which basically means it uses nmod_mat_inv to compute the inverse. The fact that is slowed down from 10 milliseconds to 15 milliseconds is still a problem but it is not due to slower arithmetic but rather slower conversions.

This is a better comparison of these using python-flint master:

In [1]: from sympy import *

In [2]: M = randMatrix(100)

In [3]: dM_sympy_sparse = M.to_DM(GF(17)).to_sdm()

In [4]: dM_sympy_dense = M.to_DM(GF(17)).to_ddm()

In [5]: dM_nmod_mat = M.to_DM(GF(17)).to_dfm().rep

In [6]: %timeit dM_sympy_sparse.rref()
136 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit dM_sympy_dense.rref()
81.6 ms ± 694 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit dM_nmod_mat.rref()
313 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So in this PR with no_gc the pure Python implementation using nmod for the arithmetic is about 300-500x slower than Flint's nmod_mat_rref function depending on whether the dense or sparse implementation is used.

If I remove the no_gc here:

diff --git a/src/flint/types/nmod.pyx b/src/flint/types/nmod.pyx
index 3b99b54..145b15d 100644
--- a/src/flint/types/nmod.pyx
+++ b/src/flint/types/nmod.pyx
@@ -195,7 +195,6 @@ cdef class nmod_ctx:
         return self._new(&v)
 
 
-@cython.no_gc
 cdef class nmod(flint_scalar):
     """
     The nmod type represents elements of Z/nZ for word-size n.

Then timings are:

With no_gc:

In [4]: %timeit dM_sympy_dense.rref()
93.2 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit dM_sympy_dense.rref()
91.6 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit dM_sympy_dense.rref()
94.6 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Without no_gc:

In [8]: %timeit dM_sympy_dense.rref()
110 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit dM_sympy_dense.rref()
112 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit dM_sympy_dense.rref()
113 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With master:

In [4]: %timeit dM_sympy_dense.rref()
80.1 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit dM_sympy_dense.rref()
82.1 ms ± 2.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit dM_sympy_dense.rref()
84 ms ± 5.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I've repeated this a few times (different matrices, restart process, etc) and it is consistent so no_gc makes a difference of 110 ms vs 90 ms but master is still 80 ms. That means that this PR measures about 10% slower than master but it was 30% slower before adding @cython.no_gc so clearly no_gc makes a big difference.

s2 = s
ctx = s2.ctx
sval = s2.val
if not ctx.any_as_nmod(&tval, t):
return NotImplemented
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling ctx.any_as_nmod method is slower than calling the any_as_nmod function...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we declare them as @cython.final to avoid the virtual method table then it is just as fast I think...

Comment on lines 533 to 687
res = nmod_poly.__new__(nmod_poly)
nmod_poly_init(res.val, self.val.mod.n)
# XXX: don't create a new context for the polynomial
res = nmod_poly_new_init(any_as_nmod_poly_ctx(self.ctx.mod.n))
nmod_mat_charpoly(res.val, self.val)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to avoid creating new contexts. I think we need an nmod_mat_ctx that holds an nmod_ctx and an nmod_poly_ctx as attributes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now an nmod_mat_ctx holds an nmod_poly_ctx which holds an nmod_ctx.

Comment on lines 51 to 29
cdef class nmod_poly_ctx:
"""
Context object for creating :class:`~.nmod_poly` initalised
with modulus :math:`N`.

>>> nmod_poly_ctx(17)
nmod_poly_ctx(17)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have __init__ raise an exception so you have to use something like nmod_poly_ctx.new(). Then we can make all contexts unique on construction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 714 to 813
v = nmod_poly.__new__(nmod_poly)
nmod_poly_init(v.val, self.val.mod.n)
v = nmod_poly_new_init(self.ctx)
nmod_poly_deflate(v.val, self.val, n)
v.ctx = self.ctx
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a new_nmod_poly(self.ctx) function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather self.ctx.new_nmod_poly() as a @cython.final method.

@oscarbenjamin
Copy link
Collaborator Author

oscarbenjamin commented Aug 28, 2024

@GiacomoPope @Jake-Moss I don't know if either of you wants to review this.

I ended up doing a bunch of somewhat unrelated things like improving the coverage plugin adding tests for full coverage and then needing to make various types consistent with error-handling etc. I also found some problems with the newly added methods like inverse_series_trunc that they could crash under some conditions.

The main change here though is that nmod, nmod_poly and nmod_mat all have associated context types nmod_ctx, nmod_poly_ctx, and nmod_mat_ctx. Apart from meaning that we can handle the non-prime modulus case gracefully (no crashes) I am hoping that the pattern here is something that can work for all domains in python-flint.

On master an nmod has an nmod_t struct as part of the PyOjbect struct so it ends up being 48 bytes which is 16 bytes for PyObject_HEAD (ref count and pointer to type object) + 8 bytes for the value + 24 bytes for the nmod_t adding up to 48 bytes. Working in C you would not attach the nmod_t to each instance which is otherwise just an 8 byte integer but it is potentially the simplest way in Python/Cython.

Here in the PR each nmod instance actually holds a pointer to an nmod_ctx instance which holds the nmod_t as well as an _is_prime attribute that is computed when the context is created. This exchanges the 24 byte nmod_t struct for an 8 byte pointer bringing the size down from 48 bytes to 32.

Initially I found that the size was still 48 bytes but that is because Cython implicitly decided that an nmod should be GC-tracked if it has a PyObject pointer as one of its fields. The GC-tracking added not just 16 bytes but also significant time overhead when creating nmod instances. After discussing on the Cython mailing list I see that the @cython.no_gc decorator can disable this removing all of the overhead as far as I can tell from my measurements.

I also moved functions like any_as_nmod to be methods on the nmod_ctx context object but this also brought measurable overhead because of the virtual method table dispatch mechanism. Marking all cdef methods on the nmod_ctx either staticmethod or cython.final means that they carry no overhead compared to a plain function call (I think...).

If you have e.g.:

cdef some_func(nmod_ctx ctx, val):
   ...

cdef class nmod_ctx:
    cdef some_func(self, val):
        ...

then from my measurements calling the method will be significantly slower than calling the function:

ctx.some_func(val) # slow
some_func(ctx, val) # fast

The slowdown is due to the virtual method table because a subclass might override the some_func method. Declaring the method to be @staticmethod or @cython.final turns it back into a plain function call at the C level which seems to eliminate the overhead. I don't think that the actual lookup time of the virtual method table is the main issue but rather the fact that it prevents the C compiler from just completely inlining everything so that there is no actual function call overhead in the machine code. This distinction is important because it means that you really need to build with -Dbuildtype=release for timings to see the full effect of the comiler inlining everything.

I believe that attaching nmod_ctx instead of nmod_t to all nmod instances does not cause any slowdowns although it is hard to be sure because the timing measurements I get vary a lot from run to run and after rebuilding etc. Likewise I believe that all overhead of using context methods is also eliminated by using @staticmethod and @cython.no_gc as is done here. I don't think that there are slowdowns here then but I would be interested to hear any timing comparisons measured by others on different machines etc.

I wanted the context objects to be unique in memory which is not possible if using the ordinary constructor like:

ctx = nmod_ctx(17)

I have disabled that so constructing a context that way gives an error requiring instead that a context be created like:

ctx = nmod_ctx.new(17)

This (static) method first tries to lookup the context from a dict that holds all contexts previously created and if not found then it creates a new context. This ensures that the contexts are always unique in memory and so do not need __eq__ or __hash__ methods and can be compared as a pointer comparison. Ensuring that the context never gets created twice means that we can absorb the cost of checking whether the modulus is prime up front when the context is created (it takes about 50x longer to check if a large modulus is prime than it does to lookup a context from the dict).

This does all mean that the contexts are never freed from memory though so if someone uses many different moduli then they will gradually consume more and more memory because of context objects that remain attached to the dict in the nmod module. I don't believe it is possible to solve that using weakref etc without incurring precisely the GC-type overhead that I referred to above so if a solution is needed that does not just store all the contexts forever then I think it means that the contexts cannot be attached to the nmod instances as literal fields. It wouldn't be hard to change this to do that though in future but it would mean reverting some of the changes I made to nmod here.

Regardless of whether the nmod instances literally have a field that is a pointer to an nmod_ctx context object I still think that it is good from a UI perspective at some level to have an nmod_ctx context and that there should also be contexts for fmpz and fmpq and all types so that there is not a distinction between types that have contexts and types that don't. Ideally all the contexts would have some uniform interface that would be consistent across types and domains so that the types can can be used interchangeably in generic code rather than this sort of awkwardness:

Z = flint.fmpz
Q = flint.fmpq
F17 = lambda x: flint.nmod(x, 17)
ctx = flint.fmpz_mod_ctx(163)
F163 = lambda a: flint.fmpz_mod(a, ctx)

I would want to rewrite that as:

 Z = flint.fmpz_ctx()
 Q = flint.fmpq_ctx()
 F17 = flint.nmod_ctx(17)
 F163 = flint.fmpz_mod_ctx(163) 

Then ideally you can use them all the same way like F17(2) * F(3) or Z(2) * Z(3) etc.

I changed the nmod constructor so that the second argument can be either an int or an nmod_ctx i.e.:

ctx = nmod_ctx(17)
a = nmod(3, 17)
b = nmod(3, ctx)

Internally nmod(3, 17) will go and look up the context which means a dict-lookup that adds some overhead but note that in Python just calling nmod(...) requires looking up the name nmod in a dict... Ideally the way to create an instance would be through the context like

c = ctx(3)

in which case there does not need to be any overhead. In any case the cost of a call like nmod(3, 17) is much less important than the cost of something like a * b which I think does not incur any overhead here.

I also added contexts for nmod_poly and nmod_mat that work in the same way: all @no_gc and static methods and contexts stored in a dict etc. An nmod_poly_ctx sometimes needs to be able to create an nmod so it internally holds a pointer to an nmod_ctx which is retrieved when the nmod_poly_ctx is created. This means that it is never necessary to create a context on the fly. An nmod_mat_ctx needs to be able to create both nmod and nmod_poly so it internally holds both an nmod_ctx and an nmod_poly_ctx and there are no circular references.

At the higher level in future I would envisage having an object like

Z17 = Zmod(17)

which would internally hold references to all three kinds of context and then you could do e.g.:

a = Z17(2)
x = Z17.poly([0, 1])
M = Z17.matrix([[1, 2], [3, 4]])
mat17 = Z17.mat_ctx()
M2 = mat17([[3], [4]])

It might make most sense for the dict management to take place at that higher Zmod level and for Zmod to be a GC-managed object that can be in a weakref dict. The individual nmod and nmod_poly instances would not hold a pointer to Z17 but rather to the no_gc context objects added here. Then having Z17 be GC-tracked would not slow down basic operations but we would be able to free the memory for contexts no longer in use.

I am hoping that the basic structure of the context objects here is a reasonable design that we can use for all types and domains which is why I put some time into micro-optimising and measuring timings etc to see exactly how to organise methods like new_nmod, any_as_nmod etc in a clean way but without incurring runtime overhead (I hope!).

A separate consideration is how to give these contexts a uniform Python-level interface. I don't think that we can afford the cost of a virtual method table lookup for something like ctx.new_nmod but there should be some more generic methods that can be used by Python or Cython code in situations where it is not quite so important that the compiler inlines everything.

@Jake-Moss
Copy link
Contributor

@GiacomoPope @Jake-Moss I don't know if either of you wants to review this.

I've just had a quick read over everything here + the diff but won't have the time this week for a proper review.

The main change here though is that nmod, nmod_poly and nmod_mat all have associated context types nmod_ctx, nmod_poly_ctx, and nmod_mat_ctx. Apart from meaning that we can handle the non-prime modulus case gracefully (no crashes) I am hoping that the pattern here is something that can work for all domains in python-flint.

🎉

On master an nmod has an nmod_t struct as part of the PyOjbect struct so it ends up being 48 bytes which is 16 bytes for PyObject_HEAD (ref count and pointer to type object) + 8 bytes for the value + 24 bytes for the nmod_t adding up to 48 bytes. Working in C you would not attach the nmod_t to each instance which is otherwise just an 8 byte integer but it is potentially the simplest way in Python/Cython.

Here in the PR each nmod instance actually holds a pointer to an nmod_ctx instance which holds the nmod_t as well as an _is_prime attribute that is computed when the context is created. This exchanges the 24 byte nmod_t struct for an 8 byte pointer bringing the size down from 48 bytes to 32.

Initially I found that the size was still 48 bytes but that is because Cython implicitly decided that an nmod should be GC-tracked if it has a PyObject pointer as one of its fields. The GC-tracking added not just 16 bytes but also significant time overhead when creating nmod instances. After discussing on the Cython mailing list I see that the @cython.no_gc decorator can disable this removing all of the overhead as far as I can tell from my measurements.

That is pretty interesting, I was unaware of that decorator. Seems like it might also be applicable to other types. I can't imagine any of the python-flint types forming cycles, but it's benefits might not be as noticeable.

I also moved functions like any_as_nmod to be methods on the nmod_ctx context object but this also brought measurable overhead because of the virtual method table dispatch mechanism. Marking all cdef methods on the nmod_ctx either staticmethod or cython.final means that they carry no overhead compared to a plain function call (I think...).

This what I thought #192 would achieve, AFAIK unfortunately for the arithmetic operators, that virtual function look up is required.

This does all mean that the contexts are never freed from memory though so if someone uses many different moduli then they will gradually consume more and more memory because of context objects that remain attached to the dict in the nmod module.

I think despite this it's certainly something worth doing. AFAIK this is the case with all mpoly contexts ATM. These objects are small enough that millions would have to be made before anyone would look at it and notice something is off. It should be possible to reuse the C implementation of functools.lru_cache if we wanted something that was fixed size, but it would be a little pre-emptive IMO. We could also expose some cache busting functions. If a user was to create a ton of contexts, and was aware of it, they could remove them when they're done, explicitly freeing them.

Regardless of whether the nmod instances literally have a field that is a pointer to an nmod_ctx context object I still think that it is good from a UI perspective at some level to have an nmod_ctx context and that there should also be contexts for fmpz and fmpq and all types so that there is not a distinction between types that have contexts and types that don't. Ideally all the contexts would have some uniform interface that would be consistent across types and domains so that the types can can be used interchangeably in generic code rather than this sort of awkwardness:

I agree.

It might make most sense for the dict management to take place at that higher Zmod level and for Zmod to be a GC-managed object that can be in a weakref dict. The individual nmod and nmod_poly instances would not hold a pointer to Z17 but rather to the no_gc context objects added here. Then having Z17 be GC-tracked would not slow down basic operations but we would be able to free the memory for contexts no longer in use.

I think this is also a good idea.

I am hoping that the basic structure of the context objects here is a reasonable design that we can use for all types and domains which is why I put some time into micro-optimising and measuring timings etc to see exactly how to organise methods like new_nmod, any_as_nmod etc in a clean way but without incurring runtime overhead (I hope!).

This is quite interested to me, I haven't put a lot of time into micro-optimisations and benchmarks for Cython because my work outside of python-flint has always been numeric where we just get down to C++ stdlib calls and call it a day. Hopefully I get the time to expand my suite to include some micro-benchmarks. I should be able to measure the GC impacts once this is merged and SymPy is able to make use of it. I already can measure GC differences but something is wrong with my post processing ATM.

@oscarbenjamin
Copy link
Collaborator Author

I've just had a quick read over everything here + the diff but won't have the time this week for a proper review.

No problem. There is no rush so I will leave this here if you intend to review it later.

I can't imagine any of the python-flint types forming cycles, but it's benefits might not be as noticeable.

The benefits are going to be most extreme for nmod as compared to anything else in python-flint. It is the smallest possible object that in C would only ever be a 64 bit integer so any overhead adds up. That's not the case for other python-flint types e.g. nmod_mat for a large matrix has much smaller relative overhead compared to working in C. The ~10 nanoseconds that I worried about for nmod are not significant for nmod_mat unless perhaps we have very small matrices like 2 x 2 or something. Of course ideally someone does not need to do heavy calculations in Python with nmod rather than nmod_mat or nmod_poly but it is nice if someone can do that and if python-flint makes this as fast as possible for them.

Also it is worth trying to minimise the overhead for trivial cases like the zero polynomial. Consider e.g.:

In [1]: M = randMatrix(10, percent=10) + randMatrix(10, percent=10)*x

In [3]: M.to_DM().to_ddm()
Out[3]: DDM([[7*x, 0, 0, 0, 0, 0, 0, 5*x, 0, 0], [0, 69, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 82*x, 0], [69, 96*x, 0, 0, 33, 0, 0, 0, 0, 0], [0, 0, 15*x + 21, 0, 0, 0, 0, 0, 0, 24*x], [14*x, 0, 0, 0, 0, 36, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 65*x, 0, 0, 35], [23, 0, 0, 0, 0, 0, 25*x, 0, 31, 14*x], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 55, 0, 87, 0, 0, 0, 0]], (10, 10), ZZ[x])

Here you have a dense matrix that is full of zeros but those zeros are actually zero polynomials. The speed of e.g. matrix multiplication with this representation depends predominantly on how fast you can multiply and add mostly zero polynomials so the overhead in trivial cases does matter.

As for the reference cycles, consider this: an nmod_mat sometimes needs to create an nmod_poly e.g. for charpoly. To make that work an nmod_mat_ctx holds a reference to an nmod_poly_ctx. If we wanted to add a method to nmod_poly that would create a matrix like p.companion_matrix() or something then the nmod_poly_ctx would somehow need to get the nmod_mat_ctx. If the nmod_poly_ctx stored a reference to the nmod_mat_ctx we would have a reference cycle. If it didn't store the nmod_mat_ctx then it would need to use some slower way to get it when the method is called. One fix is to keep things acyclic if possible e.g. we add the companion_matrix method to nmod_mat instead like:

M = p.companion_matrix() # cyclic
M = nmod_mat.companion_matrix(p) # acyclic

Hopefully that acyclic structure:

nmod_mat_ctx -> nmod_poly_ctx -> nmod_ctx

is something that can reasonably be used for all types and domains.

@oscarbenjamin
Copy link
Collaborator Author

The ~10 nanoseconds that I worried about for nmod are not significant for nmod_mat unless perhaps we have very small matrices like 2 x 2 or something.

Here is a timing comparison of that. With master:

In [2]: M = nmod_mat([[1, 2], [3, 4]], 17)

In [3]: %timeit M*M
404 ns ± 2.52 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With the PR:

In [4]: %timeit M*M
425 ns ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

So maybe there is some slowdown...

Oh, wait it is because I forgot to add the @cython.no_gc decorator to nmod_mat. Another point: The decorator only works if used in the .pyx file. If it is only in the .pxd file it seems to be ignored...

If I fix that by adding @cython.no_gc then:

In [3]: %timeit M*M
413 ns ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [4]: %timeit M*M
405 ns ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

The timings can vary a lot though so it is difficult to be conclusive. They usually seem consistent if I run them repeatedly in the same ipython session but if I exit ipython and then start it again they might suddenly change. They also can change significantly if I checkout a different branch and rebuild, so I guess measuring timings this short is just tricky somehow because it depends on so many things. The timings I show here are just the ones that I get as I write the message but I have seen significantly slower and faster timings as well...

It is possible that timeit is not a good measure. It doesn't work well for SymPy for example because lots of things are cached there. I also think that trying to measure the time of a single fast operation is difficult. Here is how I would more often time things by making a single slow operation. This is the PR:

In [9]: import math

In [10]: matrices = [M] * 10 ** 6

In [11]: %time math.prod(matrices)
CPU times: user 417 ms, sys: 107 µs, total: 417 ms
Wall time: 416 ms
Out[11]: 
[1, 0]
[0, 1]

And this is master:

In [6]: %time math.prod(matrices)
CPU times: user 413 ms, sys: 0 ns, total: 413 ms
Wall time: 412 ms
Out[6]: 
[1, 0]
[0, 1]

So basically the same time.

@oscarbenjamin
Copy link
Collaborator Author

It should be possible to reuse the C implementation of functools.lru_cache if we wanted something that was fixed size, but it would be a little pre-emptive IMO. We could also expose some cache busting functions.

The question is whether we guarantee uniqueness of contexts or not. A cache that has a maximum size or that can be cleared cannot guarantee uniqueness.

It is not necessarily critical that we should guarantee uniqueness but just it has some other implications for how you compare the contexts. Suppose for example that we have a gr_ type with some big hairy complicated context. The gr.__add__(a, b) method needs to check that the arguments have matching contexts as a first step in every operation. With unique contexts that comparison is always just a pointer comparison like a.ctx is b.ctx. Otherwise the comparison needs to compare by value using the context's __eq__ methods.

On the other hand if we don't guarantee uniqueness then it becomes a lot easier to ensure that memory is cleared etc. The idea I suggested above about moving the dict management up to the Zmod level would basically mean that we compare contexts by value at this lower-level and don't ensure uniqueness but then at the Zmod level we make it very unlikely that the contexts are not unique and could ensure that the cost of creating them is low perhaps by using lru_cache or similar.

The approach used in fmpz_mod_ctx is that uniqueness is not guaranteed but identity is checked as a first step to handle the expected common case:

def __eq__(ctx1, ctx2):
    if ctx1 is ctx2:
        return True
    else:
       return _compare_by_value(ctx1, ctx2)

Then in the happy fast path where the contexts match it is not much worse than a pointer comparison. I don't think any timings have been measured to see how that compares with having guaranteed unique contexts and not defining __eq__ at all as I did for nmod_ctx here.

@GiacomoPope
Copy link
Contributor

I had a look through this and it seems good -- i dont think i have anything big or urgent to add to what has already been discussed.

@oscarbenjamin oscarbenjamin changed the base branch from master to main September 2, 2024 15:41
@oscarbenjamin
Copy link
Collaborator Author

The last two rebases were just because of merge conflicts after all the linting PRs and this time I pushed a commit to handle some remaining lint complaints in this branch.

@GiacomoPope
Copy link
Contributor

IMO I think we can include this, with the only "maybe" being the caching of the context, because im not sure what's best here with whether nmod_ctx should be cached or just some wrapper Zmod later down the line

@oscarbenjamin
Copy link
Collaborator Author

It is simple enough to change the caching of the context later. The basic question that needs resolving is whether the context can be attached to an nmod as a field (attribute). There are two kinds of type in python-flint:

  1. Types like fmpz where every instance of the type is compatible with every other.
  2. Types like nmod, fmpz_mod etc where some additional data (e.g. modulus) is needed to know if the elements are compatible.

In the first case a context can always just be a global object and so no data needs to be attached to instances to identify a context. In the second case we either need to attach a pointer to the context object or some data that identifies the context and could be used to construct a context object on demand.

There is at least in principle some overhead in attaching a pointer to the context object because of reference counting. I don't know whether it might also be problematic in a multithreaded scenario to have all instances sharing a context across multiple threads:

https://peps.python.org/pep-0703/#deferred-reference-counting

Some types like gr will definitely need to have pointers to some sort of context object. For other types like nmod we can possibly have a pointer but we could also choose not to if it seems to have unacceptable overhead and instead we just attach an nmod_t directly to each instance which simplifies memory management for instances.

Since nmod is the smallest type, I think we can say that if the overhead is acceptable for nmod then it is acceptable for all types. If that is the case then every type either uses a global context or has the context attached as a direct attribute. That is a simple and clean model in which retrieving a context from an object is always a quick operation in either Python or Cython code. Then many low-level operations can quickly retrieve the context and use it do something such as performing coercions or accessing cached data about whether an operation is well defined. The same patterns can be applied across the codebase consistently for all types and some code can be shared generically in context superclasses.

If we can't attach the context object directly at the C level then we need to have some way to get a context from an object but it potentially needs to be created on demand (or looked up from a cache). Then for e.g. nmod retrieving the context as an object is not a quick operation any more at the C level. Then we need to have special case code for some types to handle things that would have been handled by the context objects.

Then again we need to special case the types that don't have a context object anyway. Otherwise we could just do:

cdef class flint_elem:
    flint_ctx ctx

Then it would always be fast in Cython to access the ctx field of any Flint type directly from the instance struct. That would make it possible to have fast Cython code that operates with different types generically. For this to work though fmpz and fmpq etc would need to have the ctx field in the instance struct as well.

@GiacomoPope
Copy link
Contributor

If all elements have an associated context, then we can use these context objects for the coercion. Eg we can have a ZZ and QQ (global) context which handle coercion to integers, in the same way that our nmod_ctx and fmpz_mod contexts work. So we would change any_to_fmpz to somehting like ZZ.any_to_fmpz

@oscarbenjamin
Copy link
Collaborator Author

So we would change any_to_fmpz to somehting like ZZ.any_to_fmpz

Yes, exactly. There can be a global fmpz_ctx object with a static any_as_fmpz method.

Cases where the context is not just a global object can use a @cython.final method for the same purpose (as nmod_ctx does in this PR). If you look at the C code generated here you can see that this final method ends up just being a normal C function as do any static methods:

static int __pyx_f_5flint_5types_4nmod_8nmod_ctx_any_as_nmod(
    struct __pyx_obj_5flint_5types_4nmod_nmod_ctx *__pyx_v_ctx, 
    mp_limb_t *__pyx_v_val,
    PyObject *__pyx_v_obj
) 

In itself attaching those methods to the context objects does not give any particular benefit because at the Cython level you still need to type what sort of context object is being called statically and what sort of type (fmpz, nmod etc) is being created. I just tried to find a way of implementing that here that hopefully does not slow anything down.

The benefit of attaching all of this to the contexts is that you can then have more generic Cython or Python code that uses the context objects to do conversions and coercions. You need the context objects so that you have a way of saying what it is you are converting to or from in a generic context like:

ctx.convert(element)
ctx1.convert_to(element, ctx2)
ctx1.convert_from(element, ctx2)
ctx3 = ctx1.unify(ctx2)
poly2 = poly2.convert_to(ctx2)
mat2 = mat1.convert_to(ctx2)
...

Also once you have the context objects they can start to share some generic code.

@GiacomoPope
Copy link
Contributor

yeah i really like the idea of this design decision, having the contexts done like this will help with coercion which is one of the bottlenecks in sage due to all the various parsing needed between libraries. By solving this for python-flint we should have something significantly faster

@oscarbenjamin
Copy link
Collaborator Author

If you look at the C code generated here you can see that this final method ends up just being a normal C function as do any static methods:

static int __pyx_f_5flint_5types_4nmod_8nmod_ctx_any_as_nmod(
    struct __pyx_obj_5flint_5types_4nmod_nmod_ctx *__pyx_v_ctx, 
    mp_limb_t *__pyx_v_val,
    PyObject *__pyx_v_obj
) 

Actually this is only true in the nmod.pyx module itself. In e.g. nmod_poly.ctx the code there accesses this from a struct containing function pointers.

Looking at it if you want it to be a plain function call everywhere then you need to move the body of these methods in to the .pxd file and declare them cdef inline.

I've just pushed a commit to make all nmod_ctx, nmod_poly_ctx, nmod_mat_ctx methods cdef inline so that the comiler can fully inline these functions in all modules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash in nmod_poly factorisation in nmod_poly does not gracefully handle errors
3 participants