New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infer value type from parametrized dict type? #1979

Open
mrocklin opened this Issue Nov 5, 2017 · 9 comments

Comments

Projects
None yet
3 participants
@mrocklin

mrocklin commented Nov 5, 2017

Consider the following example code that sums the values of a dictionary that maps strings to floats:

def f():
    d: Dict[str, float]
    d = {}

    total: float
    total = 0.0

    # k: str  # I would like to avoid declaring these
    # v: float

    for k in d:
        v = d[k]  # Secondarily, I am also curious on how to accelerate this
        total += v

I find that if I manually declare v to be a float that Cython will nicely remove some Python object code. However, this might also be inferred from the declaration d: Dict[str, float]. I suspect that this kind of type inference is just out of scope for Cython, but I thought I'd ask to see if this was possible somehow.

Also, any additional tips for accelerating repeated dict access like this would be very welcome.

@robertwb

This comment has been minimized.

Show comment
Hide comment
@robertwb

robertwb Nov 5, 2017

Contributor
Contributor

robertwb commented Nov 5, 2017

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Nov 5, 2017

Contributor

First of all, I agree that inferring the item type(s) of containers (especially when they are declared via PEP-484 typing) would be nice, and there's probably already a hugely outdated ticket for that which was written years before PEP-484. Cython's support for PEP-484 typing is definitely still young and in flux. It will have to mature over time. Feedback is always appreciated, as it allows us to see actual use cases that we can improve.

Regarding your example, inferring float would be helpful, since it can be safely aliased to C double. But that's the special case. If the type was int, it wouldn't help, since Cython cannot safely map Python ints to any fixed size C integer type. You could use cython.int or cython.long for that, though, which request specific C integer semantics.

Regarding str keys, there is no speed advantage to be gained for the loop you showed, and as Robert suggested, iterating only over d.values() generates much more efficient code already - if that suits your real use case. But also in other cases, str means "str or a subtype" in Python, but Cython can only make use of "exactly str" for optimisation, as subtypes can override the builtin type behaviour in arbitrary and unpredictable ways. If you type a variable as cdef str k in Cython, it will actually reject subtypes on assignment, for exactly that reason. That's why we currently ignore PEP-484 style annotations of type str. They simply don't have sufficiently tight semantics, similar to the int issue above.

The same subtyping issue applies also to a Dict annotation vs. cdef dict d, but this is again a special case that could be exploited at least for looping. The runtime special casing code is already mostly there as we apply an optimistic optimisation whenever we see code that loops over something.items() etc., which has a very high probability of something turning out to be exactly a dict in practice.

Contributor

scoder commented Nov 5, 2017

First of all, I agree that inferring the item type(s) of containers (especially when they are declared via PEP-484 typing) would be nice, and there's probably already a hugely outdated ticket for that which was written years before PEP-484. Cython's support for PEP-484 typing is definitely still young and in flux. It will have to mature over time. Feedback is always appreciated, as it allows us to see actual use cases that we can improve.

Regarding your example, inferring float would be helpful, since it can be safely aliased to C double. But that's the special case. If the type was int, it wouldn't help, since Cython cannot safely map Python ints to any fixed size C integer type. You could use cython.int or cython.long for that, though, which request specific C integer semantics.

Regarding str keys, there is no speed advantage to be gained for the loop you showed, and as Robert suggested, iterating only over d.values() generates much more efficient code already - if that suits your real use case. But also in other cases, str means "str or a subtype" in Python, but Cython can only make use of "exactly str" for optimisation, as subtypes can override the builtin type behaviour in arbitrary and unpredictable ways. If you type a variable as cdef str k in Cython, it will actually reject subtypes on assignment, for exactly that reason. That's why we currently ignore PEP-484 style annotations of type str. They simply don't have sufficiently tight semantics, similar to the int issue above.

The same subtyping issue applies also to a Dict annotation vs. cdef dict d, but this is again a special case that could be exploited at least for looping. The runtime special casing code is already mostly there as we apply an optimistic optimisation whenever we see code that loops over something.items() etc., which has a very high probability of something turning out to be exactly a dict in practice.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Nov 5, 2017

Contributor

I enabled the iteration optimisation for given Dict type declarations. It's a bit hackish, but it's a start.

Contributor

scoder commented Nov 5, 2017

I enabled the iteration optimisation for given Dict type declarations. It's a bit hackish, but it's a start.

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Nov 5, 2017

Regarding str keys, there is no speed advantage to be gained for the loop you showed, and as Robert suggested, iterating only over d.values() generates much more efficient code already - if that suits your real use case. But also in other cases, str means "str or a subtype" in Python, but Cython can only make use of "exactly str" for optimisation, as subtypes can override the builtin type behaviour in arbitrary and unpredictable ways. If you type a variable as cdef str k in Cython, it will actually reject subtypes on assignment, for exactly that reason. That's why we currently ignore PEP-484 style annotations of type str. They simply don't have sufficiently tight semantics, similar to the int issue above.

That's interesting. In my particular case I'm quite willing to be restrictive in my use of types in order to gain performance. My objective is to retain Python interpretability while still gaining some Cython speedups, so the PEP-484 style annotations are quite attractive. I think I read that for cython.int I can also accomplish this by using int and providing a declaration as a file-level comment. Are other similar options available for other similar situations?

mrocklin commented Nov 5, 2017

Regarding str keys, there is no speed advantage to be gained for the loop you showed, and as Robert suggested, iterating only over d.values() generates much more efficient code already - if that suits your real use case. But also in other cases, str means "str or a subtype" in Python, but Cython can only make use of "exactly str" for optimisation, as subtypes can override the builtin type behaviour in arbitrary and unpredictable ways. If you type a variable as cdef str k in Cython, it will actually reject subtypes on assignment, for exactly that reason. That's why we currently ignore PEP-484 style annotations of type str. They simply don't have sufficiently tight semantics, similar to the int issue above.

That's interesting. In my particular case I'm quite willing to be restrictive in my use of types in order to gain performance. My objective is to retain Python interpretability while still gaining some Cython speedups, so the PEP-484 style annotations are quite attractive. I think I read that for cython.int I can also accomplish this by using int and providing a declaration as a file-level comment. Are other similar options available for other similar situations?

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Nov 5, 2017

Contributor

My current assumption for PEP-484 annotations is that they are primarily targeted at static type checking instead of compilation, as that's what they were designed for. Their interpretation by Cython should not conflict with their meaning in Python code.

I think, what we could do at some point, is to automatically generate fused types for builtin Python types, so that int would be aliased as a fused type of all (relevant) C integer types and object, and str etc. would be an alias for "str or object". Meaning, we would optimise the code for the expected case, and generate a generic fallback implementation for the case of arbitrary subtypes. That would give you quite a bit of code explosion (because int wouldn't necessarily be the same for different variables, so you'd end up with the cross product of all declared types), and (sadly) also slower function calls due to the intermediate dispatch, but it would allow to optimise the code in functions while still keeping up full Python compatibility.

We could also extend the meaning of the infer_types=True compiler directive to infer PEP-484/526 annotations with C semantics where possible. Without that option, we only infer safe types that do not break Python semantics. Currently, if this option is enabled, Cython is already free to assume that an inferred expression like C int + C int does not overflow and should result in a C int. Cython could then infer an int annotation as meaning C long.

Contributor

scoder commented Nov 5, 2017

My current assumption for PEP-484 annotations is that they are primarily targeted at static type checking instead of compilation, as that's what they were designed for. Their interpretation by Cython should not conflict with their meaning in Python code.

I think, what we could do at some point, is to automatically generate fused types for builtin Python types, so that int would be aliased as a fused type of all (relevant) C integer types and object, and str etc. would be an alias for "str or object". Meaning, we would optimise the code for the expected case, and generate a generic fallback implementation for the case of arbitrary subtypes. That would give you quite a bit of code explosion (because int wouldn't necessarily be the same for different variables, so you'd end up with the cross product of all declared types), and (sadly) also slower function calls due to the intermediate dispatch, but it would allow to optimise the code in functions while still keeping up full Python compatibility.

We could also extend the meaning of the infer_types=True compiler directive to infer PEP-484/526 annotations with C semantics where possible. Without that option, we only infer safe types that do not break Python semantics. Currently, if this option is enabled, Cython is already free to assume that an inferred expression like C int + C int does not overflow and should result in a C int. Cython could then infer an int annotation as meaning C long.

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Nov 5, 2017

My current assumption for PEP-484 annotations is that they are primarily targeted at static type checking instead of compilation, as that's what they were designed for. Their interpretation by Cython should not conflict with their meaning in Python code.

I'm inclined to rewrite this as "Python types should not be misinterpreted for performance reasons".

Type annotations were, I think, built without a defining use case. Static type checking certainly came first, but my understanding was that the annotations themselves were intended to be general purpose and used by a variety of projects later on in the language's lifetime. Cython's use here seems like a potentially excellent example. If I'm able to use Cython without deviating from the Python language then that will significantly impact the way and the extent to which I integrate Cython into all of my projects.

That being said, I appreciate that my request to interpret int -> cython.int is dangerous. Maybe if I want to operate in this way then I need to accept cython as a runtime dependency.

mrocklin commented Nov 5, 2017

My current assumption for PEP-484 annotations is that they are primarily targeted at static type checking instead of compilation, as that's what they were designed for. Their interpretation by Cython should not conflict with their meaning in Python code.

I'm inclined to rewrite this as "Python types should not be misinterpreted for performance reasons".

Type annotations were, I think, built without a defining use case. Static type checking certainly came first, but my understanding was that the annotations themselves were intended to be general purpose and used by a variety of projects later on in the language's lifetime. Cython's use here seems like a potentially excellent example. If I'm able to use Cython without deviating from the Python language then that will significantly impact the way and the extent to which I integrate Cython into all of my projects.

That being said, I appreciate that my request to interpret int -> cython.int is dangerous. Maybe if I want to operate in this way then I need to accept cython as a runtime dependency.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Nov 5, 2017

Contributor

accept cython as a runtime dependency

And we should finally make it straight forward for user packages to depend on Cython's shadow module (cython.py a.k.a. Cython/Shadow.py) if they support optional Cython compilation. #1981

Contributor

scoder commented Nov 5, 2017

accept cython as a runtime dependency

And we should finally make it straight forward for user packages to depend on Cython's shadow module (cython.py a.k.a. Cython/Shadow.py) if they support optional Cython compilation. #1981

@mrocklin

This comment has been minimized.

Show comment
Hide comment
@mrocklin

mrocklin Nov 5, 2017

Regarding str keys, there is no speed advantage to be gained for the loop you showed, and as Robert suggested, iterating only over d.values() generates much more efficient code already - if that suits your real use case. But also in other cases, str means "str or a subtype" in Python, but Cython can only make use of "exactly str" for optimisation, as subtypes can override the builtin type behaviour in arbitrary and unpredictable ways. If you type a variable as cdef str k in Cython, it will actually reject subtypes on assignment, for exactly that reason. That's why we currently ignore PEP-484 style annotations of type str. They simply don't have sufficiently tight semantics, similar to the int issue above.

I actually do care about accelerating dict.__getitem__ or dict.get, and not just iterating over dict values. In my case my objects are definitely Python strings, and not subtypes. I would be happy to annotate with k: cython.str if necessary.

mrocklin commented Nov 5, 2017

Regarding str keys, there is no speed advantage to be gained for the loop you showed, and as Robert suggested, iterating only over d.values() generates much more efficient code already - if that suits your real use case. But also in other cases, str means "str or a subtype" in Python, but Cython can only make use of "exactly str" for optimisation, as subtypes can override the builtin type behaviour in arbitrary and unpredictable ways. If you type a variable as cdef str k in Cython, it will actually reject subtypes on assignment, for exactly that reason. That's why we currently ignore PEP-484 style annotations of type str. They simply don't have sufficiently tight semantics, similar to the int issue above.

I actually do care about accelerating dict.__getitem__ or dict.get, and not just iterating over dict values. In my case my objects are definitely Python strings, and not subtypes. I would be happy to annotate with k: cython.str if necessary.

@scoder

This comment has been minimized.

Show comment
Hide comment
@scoder

scoder Nov 5, 2017

Contributor

I tuned a couple of percent off the item lookup. You can try the latest master.

There is also an intermediate commit here, where I tried to special case dict lookups of str and unicode keys in Py3.5+. But it turned out to be slower for me, so undid the change later. Can't say why, so if you want to give it a try on your side, you could at least confirm that this is not worth pursuing. You'll have to use cdef typing on the string key variable or the decorator @cython.locals(xyzkey=str) to trigger it (you should see calls to ..._GetItemUnicode in the C file). It suggests to me that typing as str is probably not worth it for lookups.

Contributor

scoder commented Nov 5, 2017

I tuned a couple of percent off the item lookup. You can try the latest master.

There is also an intermediate commit here, where I tried to special case dict lookups of str and unicode keys in Py3.5+. But it turned out to be slower for me, so undid the change later. Can't say why, so if you want to give it a try on your side, you could at least confirm that this is not worth pursuing. You'll have to use cdef typing on the string key variable or the decorator @cython.locals(xyzkey=str) to trigger it (you should see calls to ..._GetItemUnicode in the C file). It suggests to me that typing as str is probably not worth it for lookups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment