Status: Accepted and implemented, Cython 0.12. Later changes are listed below.
The handling of strings in Cython has been a topic of much discussion. [Link to some of the more significant thread postings...] Opinions range from wanting things to be as trivial as possible for the user (with implicit encoding and decoding of unicode happening under the hood) to the attitude that if you're not explicitly dealing with encodings (and prefixing Py2 strings with 'u' for unicode) your code is already broken. The shift from byte strings to unicode strings in Python 3 further complicates matters.
This proposal only treats how string literals, and the str builtin, are handled in Python 3.
Note that it does not impact the behaviour of the unicode_literals future import, which will instruct the parser to read all unprefixed string literals as if they were prefixed with a u. When the future import is used, this CEP no longer applies.
Unmarked string literals, when used in a Python context, would be of the environments str type (byte strings in Py2, unicode strings in Py3). Both the u"some unicode string" and b"some byte string" syntaxes would be supported in Cython.
Also, the str builtin in Cython would get mapped to the str builtin in both Python 2 and Python 3. The builtin names bytes and unicode would be provided by Cython under both environments (with the obvious identification with str in each case).
String literals used in a C context would remain as they are now.
The transition would happen with a warning emitted for unmarked string literals for the 0.11.3 release, with a changeover in 0.12.
Conversion to/from char* can only happen via the bytes type. An attempted assignment between str/unicode and char* should lead to a meaningful error message, such as "Coercing str to char* is not portable to Python 3, please use the bytes type instead" and "Coercing unicode to char* requires an explicit encoding step".
Assignments between the three string types will be forbidden, i.e. bytes != str != unicode != bytes. The only exception is an assignment from a literal str to a bytes or char* type such as in
cdef bytes b = "abcdefg" cdef char* s = "abcdefg"
This makes sense considering that the bytes type is more or less the Python representation of the char* type.
It will not be allowed to do this:
cdef unicode u = "abcdefg"
Assignments to a unicode type will require a unicode literal on the right.
Letting non-identifier (i.e. non-ASCII) strings switch their type at runtime introduces various issues with encodings. Following PEP 3120, the default source code encoding in Cython is UTF-8, and users can override this encoding as described in PEP 263.
Problems arise when users enter plain strings containing escaped characters that are not representable in the current source code encoding. In this case, there is no way to determine what bytes should be used in the Python 2 byte string. Example:
# encoding: ASCII b = b'abc\xFF' s = 'abc\xFF' u = u'abc\xFF'
Both the byte string (b) and the unicode string (u) have well defined content. If \xFF is interpreted as byte value, the str value (s) has well defined byte content in Python 2, but it does not have a meaningful content under Python 3, as there is no indication on how to interpret \xFF as a unicode character. If, on the other hand, \xFF is interpreted as unicode character value, the str value becomes well defined in Python 3, but cannot be represented as a byte sequence in Python 2.
Depending on whether the string is read as unicode string or byte string by the parser, further issues can arise. Imagine this case:
# encoding: UTF-8 s = 'abc\xFF'
If \xFF is interpreted as byte sequence here, the string cannot be decoded as UTF-8. If it is interpreted as a unicode character, i.e. the string is read as a unicode string, the character can be encoded to UTF-8 to result in a valid two byte sequence. However, it is not clear if this is what a user expects. Also, Python 2.6 reads the above string as a four byte string, not a five byte UTF-8 string.
The only obvious way to prevent both confusion and runtime errors is to disallow unprefixed strings that cannot be decoded under the current source code encoding. Following Python 2 string semantics, Cython will therefore read str literals as byte strings, and then try to decode them using the source code encoding at compile time. All strings that fail to decode will be rejected.
Currently, one has the mapping
and str, when use in a Python 3 context, gets mapped to the bytes type.
Most users, when they type "some string", do not want to be bothered with details such as encodings and unicode--they just want a chunk of text. It is most convenient for this string to be in the "native" string format of the environment. This is especially true when the string in question is only used in a Python context and only interacts with from the surrounding environment, and equally applies to less obvious literals such as docstrings and keyword argument dictionary keys.
The bytes object is a poor substitute for str. It has very different semantics and (by design) does not mix with Py3 strings without explicit encoding and decoding. It is also very rare for most usecases that a bytes object was actually intended. Though makes sense to have to deal with explicit encodings and bytes objects when converting to and from the char* type, it is cumbersome and unnecessary to require dealing with it otherwise. Though the Py2 str type has no exact equivalent in Python 3, it is, from an API perspective, more like the Py3 str than the Py3 bytes.
- Simplicity -- When a user types a string literal, they usually just want "some text" and don't want to be bothered with details such as encoding, especially if it's only going to interact with other str objects from the environment.
- Consistency -- currently all other builtins (range, zip, etc.) from Python 3 get mapped to their Python 3 equivalents, despite any semantic changes.
- Completeness -- currently there is no way to get a str object in both Python 2 and Python 3.
- Cleanliness -- currently the 'b' prefix carries no meaning (despite missing the above case).
- Reduce confusion -- The current semantics of "abc", "%s" % x and str(10), when compiled with Cython and built against Python 3, are both confusing and unlike either Python 2 or Python 3.
- Prevent future breakage -- Currently, if a Cython module is written with unprefixed literals, errors may not ever show up when used with Python 2, but it is prone to breaking when others port/try to use it from Python 3.
- Backwards incompatible -- people relying on getting bytes objects for their string literals under Python 3 will have to add b prefixes.
- More lenient -- currently, to get something working nicely with both Python 2 and Python 3, one is forced to use unicode prefixes explicitly in the Cython sources. (Though often good code practice, but in many cases this may be unnecessary, and it's not really Cython's place to enforce this.)
def charpoly(self, algorithm='a', **kwds): """ Compute the charpoly of self. """ if 'message' in kwds: print "Computing charpoly... ", kwds['message'] if algorithm == 'a': ... else if algorithm == 'b': ... else: raise ValueError, "Unknown algorithm: %s" % algorithm
Cython 0.13 introduced a syntax mode that is compatible with Python 3 ("-3" option to cython). Amongst other things, this mode changes unprefixed string literals to Unicode strings. However, unprefixed literals are often used for C (char*) strings as well. It was not considered user friendly to require prefixing all C strings with 'b', the Python byte string prefix. It was therefore decided to keep support for unprefixed C string literals as follows.
- The parser was changed to parse unprefixed literals as both Unicode strings and byte strings in parallel.
- When such a literal is used in a C string context, it is replaced by the byte string literal.
- When Python 3 mode is enabled, byte strings are rejected that contain non-ASCII literal characters, following Python 3 syntax. Escaped byte values are not affected. This effectively disables support for automatically coercing non-ASCII Unicode literals to byte strings in Python 3 mode.
- [TODO] When Unicode character escapes are found in unprefixed string literals used in a C string context, a compile time warning is issued, as these will not be resolved in the resulting byte string. It is therefore likely that the escape sequence was written down in error. This warning is easy to work around by either escaping the leading backslash of the escape sequence or by prefixing the string literal with a 'b', thus clearly marking it as an intended byte string.
The result in Python 2 mode is that unprefixed string literals that are used in a C context continue to be ported over to the C code in their exact byte sequence, just like Python byte strings.
The result in Python 3 mode is that unprefixed plain ASCII strings can be used as literals in a C string context. They may contain arbitrary escaped byte values but no non-ASCII literal characters. Their C string representation is the exact byte string equivalent of the original literal.