Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__Pyx_GetItemInt_Fast should prefer mapping over sequence protocol #1807

Closed
jdemeyer opened this issue Aug 7, 2017 · 12 comments
Closed

__Pyx_GetItemInt_Fast should prefer mapping over sequence protocol #1807

jdemeyer opened this issue Aug 7, 2017 · 12 comments

Comments

@jdemeyer
Copy link
Contributor

jdemeyer commented Aug 7, 2017

The Cython code obj[i] where i is a cdef int is translated to a call to __Pyx_GetItemInt_Fast.

This will use the sequence protocol to get obj[i]. More precisely, Py_TYPE(obj)->tp_as_sequence->sq_item(obj, i) will be called.

However, Python always tries the mapping protocol first. Because of this, it can happen that obj[i] in Cython behaves differently from Python. There is nothing in the Python docs that says that the mapping and sequence protocols should agree, so this is a bug.

Simple repro:

def get(obj, int i):
    return obj[i]

class D(dict):
    pass 

This gives:

>>> d = D([(-1, "hello")]); d
{-1: 'hello'}
>>> d[-1]
'hello' 
>>> get(d, -1)
KeyError: 0
@scoder
Copy link
Contributor

scoder commented Aug 7, 2017

I'm sure this was done because the sequence protocol accepts C integer indices (which makes it more efficient if the index is passed as a C integer), but I agree that this is a bug.

@embray
Copy link
Contributor

embray commented Aug 7, 2017

IIRC many types won't have tp_as_mapping anyways so it can still be skipped over easily in that case. But otherwise it should be checked first.

@jdemeyer
Copy link
Contributor Author

jdemeyer commented Aug 8, 2017

The main difference will be for Python (non-extension) types, since __getitem__ is put both in the mapping protocol as well as the sequence protocol. That is also why class D(dict): pass behaves differently from a plain dict.

@embray
Copy link
Contributor

embray commented Aug 8, 2017

Well, yes, for classes with __getitem__ that's definitely true.

@scoder
Copy link
Contributor

scoder commented Aug 8, 2017

But types that use the mapping protocol will require Python conversion of the 'index' value anyway, whether it's called through the sequence or mapping protocol.

There's a tiny performance impact for types that only use the sequence protocol, because Cython would now have to check the mapping protocol first. As long as the work done inside of the getitem method is non-trivial, it won't make a difference, but getitem might actually be really trivial in many important sequence cases.

I think the only thing we can do is: implement, then measure. Open for pull requests.

@seberg
Copy link
Contributor

seberg commented Nov 8, 2023

This seems to have a major impact on speed for indexing NumPy arrays in cython, see pandas-dev/pandas#55179 (comment)

I suppose, NumPy could have an integer fast path earlier on, not quite sure how much that will help. Is that needed?

@da-woods
Copy link
Contributor

da-woods commented Nov 8, 2023

One option would be from cpython.sequence cimport PySequence_GetItem then to use PySequence_GetItem(arr, i) instead of arr[i] where you're certain that the types justify it.

We could definitely look at other options though - I guess in principle a compiler directive might be a possibility. Would be good to see the use-case to think about if there's anything cleverer we should be doing.

@scoder
Copy link
Contributor

scoder commented Nov 8, 2023 via email

@WillAyd
Copy link
Contributor

WillAyd commented Nov 8, 2023

I think @seberg meant to link to pandas-dev/pandas#55179 (comment) above

One option would be from cpython.sequence cimport PySequence_GetItem then to use PySequence_GetItem(arr, i) instead of arr[i] where you're certain that the types justify it.

Thanks for the workaround. FWIW from a pandas perspective where we often turn off boundschecking I have pretty much always expected arr[i] to work just like a pointer offset. The idea of treating that like a mapping has never really crossed my mind

@da-woods
Copy link
Contributor

da-woods commented Nov 8, 2023

boundschecking I have pretty much always expected arr[i] to work just like a pointer offset.

boundschecking only really applies when Cython knows the type well enough to have special address to the internals. That's mainly for memoryviews and a few other known Python types like list, tuple, bytes (but only if Cython knows the variable type). That's a non-exhaustive list but hopefully gives the idea.

If Cython doesn't know the variable type then it just goes through normal Python __getitem__ lookup. In this case boundschecking doesn't do anything because __getitem__ generates the exception.

@seberg
Copy link
Contributor

seberg commented Nov 9, 2023

In the pandas case, we have typed as ndarray, so a custom method just for cython would also work. It isn't yet quite clear to me if this regression is big enough to worry much (at least in most places).
Also since indexing has a lot of additional overheads that are harder to avoid if you don't go down to the buffer level. (something Brock was wondering about whether we could add more API, but it seemed a bit like we need to target it to the use-case and see what we actually need.)

@WillAyd
Copy link
Contributor

WillAyd commented Nov 9, 2023

Here is a pretty trivial MRE:

from numpy cimport ndarray


def check_timing(ndarray arr):
    cdef:
        Py_ssize_t n = len(arr), i
        object val

    for i in range(n):
        val = arr[i]

Compiling with with -O2 optimization with 0.29.34 yields the following timing:

In [1]: import numpy as np
In [2]: from cython_perf import check_timing
In [3]: arr = np.arange(1000)
In [4]: %timeit check_timing(arr)
18.6 µs ± 512 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Versus 3.0.5

In [1]: import numpy as np
In [2]: from cython_perf import check_timing
In [3]: arr = np.arange(1000)
In [4]: %timeit check_timing(arr)
35.9 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants