New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python3 str to std::string conversion is not automatic #2819

Closed
justinfx opened this Issue Jan 31, 2019 · 5 comments

Comments

Projects
None yet
2 participants
@justinfx
Copy link

justinfx commented Jan 31, 2019

Refs: https://groups.google.com/d/msg/cython-users/oqk3GQ2pJ8M/-oBEvfWXDgAJ

I have a python2 project where the pyx files contain the following directive:

# cython: c_string_type=unicode, c_string_encoding=utf8

In the process of converting to python3, I am finding that even with these directive, the conversion from a python3 str is not automatically encoded to "utf8" bytes when converted to a C++ std::string:

Reproduction:
https://gist.github.com/justinfx/8023d341becc8a1092e5beacd7a249eb

In python3, this results in the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "test.pyx", line 6, in test.test
  File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, str found

I have tested this behaviour both in cython 0.28.5 as well as master, using all available language level values.

My expected results would be that given the directives, any implicit assignment/conversion to std::string would automatically encode to 'utf8' bytes.

My current workaround in dealing with the ton of locations where a python string is assigned to a std::string or passed to an argument, or even part of implicit map or list conversions, is to explicitly wrap each site in a conversion helper:

    # cython: c_string_type=unicode, c_string_encoding=utf8

    from libcpp.string cimport string
    from cpython.version cimport PY_MAJOR_VERSION

    cdef unicode _text(s):
        if type(s) is unicode:
            return <unicode>s

        elif PY_MAJOR_VERSION < 3 and isinstance(s, bytes):
            return (<bytes>s).decode('ascii')
        
        elif isinstance(s, unicode):
            return unicode(s)
        
        else:
            raise TypeError("Could not convert to unicode.")

    cdef string _string(basestring s) except *:
        cdef string c_str = _text(s).encode("utf-8")
        return c_str

    # ...
    self.field = _string(s)

This has been error prone since I keep overlooking hard to spot type conversions. It would be amazing for the behaviour in Cython to be updated to support automatic conversions based on my directives.

@scoder

This comment has been minimized.

Copy link
Contributor

scoder commented Feb 1, 2019

Just guessing, but it might be the setting of __PYX_DEFAULT_STRING_ENCODING_IS_DEFAULT in the generated code. For the c_string_encoding names UTF-8/utf8/…, it should be true in Py3, where the default encoding is always utf8.

Could you try changing __PYX_DEFAULT_STRING_ENCODING_IS_DEFAULT to 1 (it should be 0 now) in the generated C file and try that directly under Py3, without running Cython again?

robertwb added a commit that referenced this issue Feb 3, 2019

@justinfx

This comment has been minimized.

Copy link
Author

justinfx commented Feb 4, 2019

@scoder scoder added this to the 3.0 milestone Feb 16, 2019

@scoder scoder closed this Feb 16, 2019

@justinfx

This comment has been minimized.

Copy link
Author

justinfx commented Feb 16, 2019

@scoder, thanks for this! Is this going to be released in a 0.29.6 patch or is it really waiting for the 3.0.0 tag? Just wondering since I am blocked on this fix.

@scoder

This comment has been minimized.

Copy link
Contributor

scoder commented Feb 16, 2019

I think it's on the edge. It sort-of falls into the "new feature" bucket, and I find it a bit difficult to decide whether it's safe to throw at users in a point release, since it changes the behaviour of existing code in a potentially unexpected way. There might be code out there that deals with this use case in its own way already, and I do not know what such code would do with the new auto-conversion.

@justinfx

This comment has been minimized.

Copy link
Author

justinfx commented Feb 16, 2019

I can see from my current approach of manually handling this problem that I am preventing the std::string from receiving a direct assignment of a python string in all cases to avoid a crash. So if this new patch were introduced to my code, I would end up continuing to step around it until I removed the conversion utility. Maybe I am missing something, but without this fix it feels like a bug when you go from working behavior in py2 to crashing in py3, even when the directives imply that the conversion should work. Seems like a pretty important fix for anyone converting to py3 and hitting tons and tons of string conversions. Before this fix, I am still scared that I have missed implicit conversions with args being passed around. I went and removed all the explicit std::string static argument annotations to avoid accidentally passing a py str before its manually converted.

If it's a 3.0.0 thing, then I just have to manually handle my dependency against a commit. My company is behind a proxy and we request local packaging from 3rd party releases that track a semver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment