# Unicode

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)

In [2]:
from addutils import css_notebook
css_notebook()

## 1 The World and Unicode

There are five unavoidable facts to take in consideration:

**Computers are built on Bytes**

It's not possible to send Unicode to someone: the only thing possible is to send bytes. You don't send text: you send bytes.

To send text to people what we decided to do, because bytes are meaningless, is to assign meaning to bytes. The most known convenction id the *ASCII* code that takes 95 graphic symbols and maps it to bytes values. Other codings assigned more values: *ISO 8859-1= added 96 more characters and *Windows-1252* added 27 more characters.

**256 Symbols are not enough for the world to communicate text**

There a lot of languages in the world and it's not possible to ask everyone to speak English!

To deal with more than 256 symbols the first idea was to make new single-bytes codes. Thea was infeasible because many different codes where made at the same time, not to mention alphabets with more than 256 symbols. Next idea was dealing with two-bytes codes but this generated problems anyway.

We end-up by making *Unicode*. Unicode is foundamentally a giant catalogue of symbols assigned to integers. The structure of Unicode allows 1.1M characters of which 110K has been assigned so far, so we used nearly 10% of the available space and we covered all the world languages.

What we have to do is to figure out how to map this 1.1M integers in bytes. This is done by encodings: encodings are ways to map integer to bytes and vice-versa.

There is a number of different encodings (UTF-8, UTF-16, UTF-32, UCS-2, UCS-4) but the king of all the encodings is *UTF-8*.

**UTF-8** is the most important encoding you must know about: is variable lenght (from one to four bytes) and ASCII character are still represented by one byte. This means that ASCII strings are valid UTF-8 strings.

## 2 Encoding and decoding in Python 2.x

In Python 2.x there are two distinct datatypes: one for storing *bytes* which is called `str` and one for storing *Unicode code points*, called `unicode`

In [3]:
my_string = 'Hello World!'
print type(my_string)

<type 'str'>


In [4]:
my_unicode_codepoints = u'Ṳηїcod€ †ℯ✖t'
print type(my_unicode_codepoints)
print repr(my_unicode_codepoints)
print my_unicode_codepoints

<type 'unicode'>
u'\u1e72\u03b7\u0457cod\u20ac \u2020\u212f\u2716t'
Ṳηїcod€ †ℯ✖t


A *Unicode* string has an `encode` method who turns codepoints into bytes and byte strings hade a `decode` method who turn bytes in code points

* **decode()** BYTES -> CODEPOINTS
* **encode()** CODEPOINTS -> BYTES


In [5]:
my_utf8_bytes = my_unicode_codepoints.encode('utf-8') # bytes
print repr(my_utf8_bytes)

'\xe1\xb9\xb2\xce\xb7\xd1\x97cod\xe2\x82\xac \xe2\x80\xa0\xe2\x84\xaf\xe2\x9c\x96t'


In [6]:
print repr(my_utf8_bytes.decode('utf-8')) # Codepoints
print my_utf8_bytes.decode('utf-8')

u'\u1e72\u03b7\u0457cod\u20ac \u2020\u212f\u2716t'
Ṳηїcod€ †ℯ✖t


Most errors are generated because the encoder can't convert properly. In this case the ASCII decoder finds that some characters are not in the 0-127 range

In [7]:
my_unicode_codepoints.encode('ASCII')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

An usual work-around is to tell the decoder to replace, ignore or convert the unrecognized characters in xml

In [8]:
print my_unicode_codepoints.encode('ASCII', 'replace')
print my_unicode_codepoints.encode('ASCII', 'ignore')
print my_unicode_codepoints.encode('ASCII', 'xmlcharrefreplace')

???cod? ???t
cod t
&#7794;&#951;&#1111;cod&#8364; &#8224;&#8495;&#10006;t


Same for decoding

In [9]:
print my_utf8_bytes.decode('ASCII', 'replace')

�������cod��� ���������t


## 3 Implicit conversion in Python 2.x

Python 2.x implicitly decodes strings while mixing bytes and unicode togheter. So for example if we add an Unicode string with a byte string, the byte string is decoded using the dafault that in this case is ASCII

In [10]:
import sys
print sys.getdefaultencoding()
u'Hello ' + 'World!'

ascii


u'Hello World!'

This is equivalent to:

In [11]:
u'Hello ' + ('World!'.decode('ascii'))

u'Hello World!'

This automatic decoding works well just when everything is encoded as system default (usually ASCII). In all the other cases you're in trouble in very tricky ways. In this case the codepoints are correctly encoded to bytes but then `print` triggers an automatic decoding using the `ASCII` encoder:

In [12]:
my_unicode_codepoints = u'Ṳηїcod€ †ℯ✖t'
print my_unicode_codepoints                # Unicode (codepoints) print correctly
my_utf8_bytes = my_unicode_codepoints.encode('utf-8')
print my_utf8_bytes                        # Bytes DO NOT print correctly
print my_utf8_bytes.decode('utf-8')        # Force the right decoding

Ṳηїcod€ †ℯ✖t
Ṳηїcod€ †ℯ✖t
Ṳηїcod€ †ℯ✖t


## 4 Bytes Outside, Unicode Inside

The most important thing we have to remember to deal with unicode is to convert to Unicode as soon as possible (when we get the raw data) and to convert to bytes just before sending the data out of our program. In other words:

* **do a decode()** bytes -> codepoints **as soon as possible**
* **do a encode()** codepoints -> bytes **as late as possible**

*But remember: you can't infer encoding from bytes: someone has to tell you the encoding*

*TEST with text generators:* the best way to check if your code works well is to feed it with unicode strings and not just with ASCII strings. If you want to find a Text Generator just type *"fancy text generators"* on Google.

If you want to have an idea of what's inside Unicode, just use the Python `unichr()` command. Have fun!

In [13]:
start = 10000
for i in xrange(start, start+500):
    print unichr(i),

✐ ✑ ✒ ✓ ✔ ✕ ✖ ✗ ✘ ✙ ✚ ✛ ✜ ✝ ✞ ✟ ✠ ✡ ✢ ✣ ✤ ✥ ✦ ✧ ✨ ✩ ✪ ✫ ✬ ✭ ✮ ✯ ✰ ✱ ✲ ✳ ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ ❁ ❂ ❃ ❄ ❅ ❆ ❇ ❈ ❉ ❊ ❋ ❌ ❍ ❎ ❏ ❐ ❑ ❒ ❓ ❔ ❕ ❖ ❗ ❘ ❙ ❚ ❛ ❜ ❝ ❞ ❟ ❠ ❡ ❢ ❣ ❤ ❥ ❦ ❧ ❨ ❩ ❪ ❫ ❬ ❭ ❮ ❯ ❰ ❱ ❲ ❳ ❴ ❵ ❶ ❷ ❸ ❹ ❺ ❻ ❼ ❽ ❾ ❿ ➀ ➁ ➂ ➃ ➄ ➅ ➆ ➇ ➈ ➉ ➊ ➋ ➌ ➍ ➎ ➏ ➐ ➑ ➒ ➓ ➔ ➕ ➖ ➗ ➘ ➙ ➚ ➛ ➜ ➝ ➞ ➟ ➠ ➡ ➢ ➣ ➤ ➥ ➦ ➧ ➨ ➩ ➪ ➫ ➬ ➭ ➮ ➯ ➰ ➱ ➲ ➳ ➴ ➵ ➶ ➷ ➸ ➹ ➺ ➻ ➼ ➽ ➾ ➿ ⟀ ⟁ ⟂ ⟃ ⟄ ⟅ ⟆ ⟇ ⟈ ⟉ ⟊ ⟋ ⟌ ⟍ ⟎ ⟏ ⟐ ⟑ ⟒ ⟓ ⟔ ⟕ ⟖ ⟗ ⟘ ⟙ ⟚ ⟛ ⟜ ⟝ ⟞ ⟟ ⟠ ⟡ ⟢ ⟣ ⟤ ⟥ ⟦ ⟧ ⟨ ⟩ ⟪ ⟫ ⟬ ⟭ ⟮ ⟯ ⟰ ⟱ ⟲ ⟳ ⟴ ⟵ ⟶ ⟷ ⟸ ⟹ ⟺ ⟻ ⟼ ⟽ ⟾ ⟿ ⠀ ⠁ ⠂ ⠃ ⠄ ⠅ ⠆ ⠇ ⠈ ⠉ ⠊ ⠋ ⠌ ⠍ ⠎ ⠏ ⠐ ⠑ ⠒ ⠓ ⠔ ⠕ ⠖ ⠗ ⠘ ⠙ ⠚ ⠛ ⠜ ⠝ ⠞ ⠟ ⠠ ⠡ ⠢ ⠣ ⠤ ⠥ ⠦ ⠧ ⠨ ⠩ ⠪ ⠫ ⠬ ⠭ ⠮ ⠯ ⠰ ⠱ ⠲ ⠳ ⠴ ⠵ ⠶ ⠷ ⠸ ⠹ ⠺ ⠻ ⠼ ⠽ ⠾ ⠿ ⡀ ⡁ ⡂ ⡃ ⡄ ⡅ ⡆ ⡇ ⡈ ⡉ ⡊ ⡋ ⡌ ⡍ ⡎ ⡏ ⡐ ⡑ ⡒ ⡓ ⡔ ⡕ ⡖ ⡗ ⡘ ⡙ ⡚ ⡛ ⡜ ⡝ ⡞ ⡟ ⡠ ⡡ ⡢ ⡣ ⡤ ⡥ ⡦ ⡧ ⡨ ⡩ ⡪ ⡫ ⡬ ⡭ ⡮ ⡯ ⡰ ⡱ ⡲ ⡳ ⡴ ⡵ ⡶ ⡷ ⡸ ⡹ ⡺ ⡻ ⡼ ⡽ ⡾ ⡿ ⢀ ⢁ ⢂ ⢃ ⢄ ⢅ ⢆ ⢇ ⢈ ⢉ ⢊ ⢋ ⢌ ⢍ ⢎ ⢏ ⢐ ⢑ ⢒ ⢓ ⢔ ⢕ ⢖ ⢗ ⢘ ⢙ ⢚ ⢛ ⢜ ⢝ ⢞ ⢟ ⢠ ⢡ ⢢ ⢣ ⢤ ⢥ ⢦ ⢧ ⢨ ⢩ ⢪ ⢫ ⢬ ⢭ ⢮ ⢯ ⢰ ⢱ ⢲ ⢳ ⢴ ⢵ ⢶ ⢷ ⢸ ⢹ ⢺ ⢻ ⢼ ⢽ ⢾ ⢿ ⣀ ⣁ ⣂ ⣃ ⣄ ⣅ ⣆ ⣇ ⣈ ⣉ ⣊ ⣋ ⣌ ⣍ ⣎ ⣏ ⣐ ⣑ ⣒ ⣓ ⣔ ⣕ ⣖ ⣗ ⣘ ⣙ ⣚ ⣛ ⣜ ⣝ ⣞ ⣟ ⣠ ⣡ ⣢ ⣣ ⣤ ⣥ ⣦ ⣧ ⣨ ⣩ ⣪ ⣫ ⣬ ⣭ ⣮ ⣯ ⣰ ⣱ ⣲ ⣳ ⣴ ⣵ ⣶ ⣷ ⣸ ⣹ ⣺ ⣻ ⣼ ⣽ ⣾ ⣿ ⤀ ⤁ ⤂ ⤃


For more information check this video from PyCon2012:

In [14]:
# Video credit: Ned Batchelder
from IPython.display import YouTubeVideo
YouTubeVideo('sgHbC6udIqc', width=640, height=480)

---

Visit [www.add-for.com](<http://www.add-for.com/IT>) for more tutorials and updates
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.