For the time being, this page is dedicated to collecting different use cases regarding string data handling that Cython may be able to support in a simpler way.
For handling binary data, the best way is to use the bytes data type, which is supported in CPython 2.x (str), 3.x (bytes) and Cython (bytes). The bytes type coerces freely from and to a C char*.
The remaining use cases deal with non-binary data, i.e. text data, names, numbers, etc.
It is expected that the C library only accepts text as a char*.
For Python 2.x, the code needs to deal with both str (bytes) and unicode, whereas it would only accept unicode strings (str) in Python 3.
For input, the code needs to encode any unicode input (Py2/Py3) to a byte string (bytes) which can be converted to a char*.
For output, the code must either convert char* values to a bytes object (in Py2, if applicable) or decode it to a unicode object (in Py2 and Py3).
The encoding is determined by the requirements of the C library being called, often UTF-8 or plain ASCII.
When strings are only operated on at the C level (usually as a char*), byte strings are most useful for performance reasons, especially when using a single-byte encoding like ASCII or Latin-1. The same as above applies in this case.
For output, data would likely be held in Python byte string objects already, in which case it may or may not be required to decode to a unicode string, depending on data semantics and platform (Py2/Py3).
Code that only uses Python operations on strings can work equally well with bytes and unicode strings. In Python 2, both types (bytes/unicode) must be supported, whereas in Python 3, only unicode (str) would be supported. No recoding is required on the way in or out, as both Python 2 string types have interfaces that are compatible with the Python 3 unicode type.
If strings are treated with both Python and C operations, it is best to encode them as byte strings. However, this has the disadvantage in Python 3 that some Python operations on byte strings behave different from Python operations on unicode strings or Python 2 byte strings. This may or may not be a problem for code. The alternative is to use a unicode string instead and encode it to a byte string explicitly if needed.
If byte strings are used, the code must accept both byte strings (str) and unicode strings in Python 2. In Python 3, only unicode string would be handled. Unicode string input must get encoded on input.
If unicode strings are used, the code must additionally accept and decode byte strings in Python 2.
In any case, either encoding or decoding is required on input.
It seems that the conversion from char* to bytes/unicode is easily handled using either implicit coercion from char* to bytes or explicit decoding using cstring.decode(enc) or cstring[:length].decode(enc). These patterns are rather intuitive and the decoding is explicit.
If users want to allow C strings to become both bytes or unicode in Python 2 to improve the efficiency on that platform (as opposed to always using unicode), it is likely application/context specific what kind of strings should be represented as byte strings instead of unicode strings, so it is rather unlikely that Cython can automate anything here. There is a proposal to add a new builtin function cython.str(char* [, enc]) that would convert a char* to a byte string in Python 2 and decode it to a unicode string in Python 3, based on a C compile time switch.
It has been proposed to support some kind of implicit decoding from char* to unicode as an optional alternative to the coercion to bytes. In this case, the original coercion of char* to bytes would be spelled using an explicit cast: <bytes>cstring. The encoding used for the automatic coercion would be provided as a compiler directive.
Coercion can be required in various places when sending Python strings into a C function, when receiving string input from Python space, or when assigning Python strings to char* variables. In all of these situations, users have to handle the bytes/unicode ambiguity in Python 2. In Python 3, text strings are always represented as unicode strings.