Skip to content

Commit

Permalink
Document CKAN Unicode handling.
Browse files Browse the repository at this point in the history
  • Loading branch information
torfsen committed Jun 6, 2016
1 parent 8e6aab9 commit a0a8f68
Show file tree
Hide file tree
Showing 3 changed files with 174 additions and 0 deletions.
1 change: 1 addition & 0 deletions doc/contributing/index.rst
Expand Up @@ -34,6 +34,7 @@ of contributions to CKAN:
javascript
python
string-i18n
unicode
testing
frontend/index

Expand Down
13 changes: 13 additions & 0 deletions doc/contributing/python.rst
Expand Up @@ -108,6 +108,19 @@ replacement field, for example::

.. _new .format() method: http://docs.python.org/2/library/stdtypes.html#str.format


Unicode handling
----------------
CKAN strives to only use Unicode internally (via the ``unicode`` type) and to
convert to/from ASCII at the interface to other systems and libraries if
necessary.

.. seealso::

:doc:`unicode`
Details on Unicode handling in CKAN


.. _docstrings:

Docstrings
Expand Down
160 changes: 160 additions & 0 deletions doc/contributing/unicode.rst
@@ -0,0 +1,160 @@
================
Unicode handling
================
This document explains how Unicode and related issues are handled in CKAN.
For a general introduction to Unicode and Unicode handling in Python 2 please
read the `Python 2 Unicode HOWTO`_. Since Unicode handling differs greatly
between Python 2 and Python 3 you might also be interested in the
`Python 3 Unicode HOWTO`_.

.. _Python 2 Unicode HOWTO: https://docs.python.org/2/howto/unicode.html
.. _Python 3 Unicode HOWTO: https://docs.python.org/3/howto/unicode.html

.. note::

This document describes the intended future state of Unicode handling in
CKAN. For historic reasons, some existing code does not yet follow the
rules described here.

*New code should always comply with the rules in this document. Exceptions
must be documented.*


Overall Strategy
----------------
CKAN only uses Unicode internally (``unicode`` on Python 2). Conversion to/from
ASCII strings happens on the boundary to other systems/libaries if necessary.


Encoding of Python files
------------------------
Files containing Python source code (``*.py``) must be encoded using UTF-8, and
the encoding must be declared using the following header::

# encoding: utf-8

This line must be the first or second line in the file. See `PEP 263`_ for
details.

.. _PEP 263: https://www.python.org/dev/peps/pep-0263/


String literals
---------------
String literals are string values given directly in the source code (as opposed
to strings variables read from a file, received via argument, etc.). In
Python 2, string literals by default have type ``str``. They can be changed to
``unicode`` by adding a ``u`` prefix. In addition, the ``b`` prefix can be used
to explicitly mark a literal as ``str``::

x = "I'm a str literal"
y = u"I'm a unicode literal"
z = b"I'm also a str literal"

In CKAN, every string literal must carry either a ``u`` or a ``b`` prefix.
While the latter is redundant in Python 2, it makes the developer's intention
explicit and eases a future migration to Python 3.

This rule also holds for *raw strings*, which are created using an ``r``
prefix. Simply use ``ur`` instead::

m = re.match(ur'A\s+Unicode\s+pattern')

For more information on string prefixes please refer to the
`Python documentation`_.

.. _Python documentation: https://docs.python.org/2.7/reference/lexical_analysis.html#string-literals

.. note::

The ``unicode_literals`` `future statement`_ is *not* used in CKAN.

.. _future statement: https://docs.python.org/2/reference/simple_stmts.html#future


Best Practices
--------------

Use ``io.open`` to open text files
```````````````````````````````````
When opening text (not binary) files you should use `io.open`_ instead of
``open``. This allows you to specify the file's encoding and reads will return
``unicode`` instead of ``str``::

import io

with io.open(b'my_file.txt', u'r', encoding=u'utf-8') as f:
text = f.read() # contents is automatically decoded
# to unicode using UTF-8

.. _io.open: https://docs.python.org/2/library/io.html#io.open

Text files should be encoded using UTF-8 if possible.


Normalize strings before comparing them
```````````````````````````````````````
For many characters, Unicode offers multiple descriptions. For example, a small
latin ``e`` with an acute accent (``é``) can either be specified using its
dedicated code point (`U+00E9`_) or by combining the code points for ``e``
(`U+0065`_) and the accent (`U+0301`_). Both variants will look the same but
are different from a numerical point of view::

>>> x = u'\N{LATIN SMALL LETTER E WITH ACUTE}'
>>> y = u'\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}'
>>> print x, y
é é
>>> print repr(x), repr(y)
u'\xe9' u'e\u0301'
>>> x == y
False

.. _U+00E9: http://www.fileformat.info/info/unicode/char/e9
.. _U+0065: http://www.fileformat.info/info/unicode/char/0065
.. _U+0301: http://www.fileformat.info/info/unicode/char/0301

Therefore, if you want to compare two Unicode strings based on their characters
you need to normalize them first using `unicodedata.normalize`_::

>>> from unicodedata import normalize
>>> x_norm = normalize(u'NFC', x)
>>> y_norm = normalize(u'NFC', y)
>>> print x_norm, y_norm
é é
>>> print repr(x_norm), repr(y_norm)
u'\xe9' u'\xe9'
>>> x_norm == y_norm
True

.. _unicodedata.normalize: https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize


Use the Unicode flag in regular expressions
```````````````````````````````````````````
By default, the character classes of Python's `re`_ module (``\w``, ``\d``,
...) only match ASCII-characters. For example, ``\w`` (alphanumeric character)
does, by default, not match ``ö``::

>>> print re.match(ur'^\w$', u'ö')
None

Therefore, you need to explicitly activate Unicode mode by passing the `re.U`_
flag::

>>> print re.match(ur'^\w$', u'ö', re.U)
<_sre.SRE_Match object at 0xb60ea2f8>

The type of the values returned by ``re.split``, ``re.MatchObject.group``, etc.
depends on the type of the input string::

>>> re.split(ur'\W+', b'Just a string!', flags=re.U)
['Just', 'a', 'string', '']

>>> re.split(ur'\W+', u'Just some Unicode!', flags=re.U)
[u'Just', u'some', u'Unicode', u'']

Note that the type of the *pattern string* does not influence the return type.

.. _re: https://docs.python.org/2/library/re.html
.. _re.U: https://docs.python.org/2/library/re.html#re.U

0 comments on commit a0a8f68

Please sign in to comment.