# Chapter 4 — Unicode Text Versus Bytes

## Character Issues

#### Example 4-1. Encoding and decoding

In [1]:
s = 'café'
len(s)

4

In [2]:
b = s.encode('utf8')
b

b'caf\xc3\xa9'

In [3]:
len(b)

5

In [4]:
b.decode('utf8')

'café'

## Byte Essentials

#### Example 4-2. A five-byte sequence as `bytes` and as `bytearray`

In [5]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

In [6]:
cafe[0]

99

In [7]:
cafe[:1]

b'c'

In [8]:
cafe_arr = bytearray(cafe)
cafe_arr

bytearray(b'caf\xc3\xa9')

In [9]:
cafe_arr[-1:]

bytearray(b'\xa9')

In [10]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

#### Example 4-3. Initializing bytes from the raw data of an array

In [11]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2])
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

### Basic Encoders/Decoders

#### Example 4-4. The string “El Niño” encoded with three codecs producing very different byte sequences

In [12]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


## Understanding Encode/Decode Problems

### Coping with UnicodeEncode Error

#### Example 4-5. Encoding to bytes: success and error handling

In [13]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [14]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [15]:
city.encode('iso8859_1')

b'S\xe3o Paulo'

In [16]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [17]:
city.encode('cp437', errors='ignore')

b'So Paulo'

In [18]:
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [19]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

### Coping with UnicodeDecodeError

#### Example 4-6. Decoding from str to bytes: success and error handling

In [20]:
octets = b'Montr\xe9al'
octets.decode('cp1252')

'Montréal'

In [21]:
octets.decode('iso8859_7')

'Montrιal'

In [22]:
octets.decode('koi8_r')

'MontrИal'

In [23]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [24]:
octets.decode('utf_8', errors='replace')

'Montr�al'

### SyntaxError when Loading Modules with Unexpected Encoding

#### Example 4-7. [ola.py](ola.py): “Hello, World!” in Portuguese

In [25]:
!python3 ola.py

Olá, Mundo!


### BOM: A Useful Gremlin

In [26]:
u16 = 'El Niño'.encode('utf_16')
u16

b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

In [27]:
list(u16)

[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [28]:
u16le = 'El Niño'.encode('utf_16le')
list(u16le)

[69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [29]:
u16be = 'El Niño'.encode('utf_16be')
list(u16be)

[0, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111]

## Handling Text Files

#### Example 4-8. A platform encoding issue

In [30]:
open('cafe.txt', 'w', encoding='utf_8').write('café')

4

In [31]:
# Note: We are forcing the bug by assigning the encoding 'cp1252' in this example
open('cafe.txt', encoding='cp1252').read()

'cafÃ©'

#### Example 4-9. Closer inspection of Example 4-8 running on Windows reveals the bug and how to fix it

In [32]:
fp = open('cafe.txt', 'w', encoding='utf_8')
fp

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>

In [33]:
fp.write('café')

4

In [34]:
fp.close()
import os
os.stat('cafe.txt').st_size

5

In [35]:
# Note: We are forcing the issue by assigning the encoding 'cp1252' in this example
fp2 = open('cafe.txt', encoding='cp1252')
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='cp1252'>

In [36]:
fp2.encoding

'cp1252'

In [37]:
fp3 = open('cafe.txt', encoding='utf_8')
fp3

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='utf_8'>

In [38]:
fp3.read()

'café'

In [39]:
fp4 = open('cafe.txt', 'rb')
fp4

<_io.BufferedReader name='cafe.txt'>

In [40]:
fp4.read()

b'caf\xc3\xa9'

### Beware of Encoding Defaults

#### Example 4-10. Exploring encoding defaults: [default_encodings.py](default_encodings.py)

In [41]:
!python default_encodings.py  # Unix machine output

 locale.getpreferredencoding() -> 'UTF-8'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'UTF-8'
           sys.stdout.isatty() -> True
           sys.stdout.encoding -> 'utf-8'
            sys.stdin.isatty() -> True
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> True
           sys.stderr.encoding -> 'utf-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


#### Example 4-12. [stdout_check.py](stdout_check.py)

In [42]:
!python stdout_check.py  # Unix machine output

3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0]

sys.stdout.isatty(): True
sys.stdout.encoding: utf-8

Trying to output HORIZONTAL ELLIPSIS:
…
Trying to output INFINITY:
∞
Trying to output CIRCLED NUMBER FORTY TWO:
㊷


## Normalizing Unicode for Reliable Comparisons

In [43]:
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
s1, s2

('café', 'café')

In [44]:
len(s1), len(s2)

(4, 5)

In [45]:
s1 == s2

False

In [46]:
from unicodedata import normalize
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
len(s1), len(s2)

(4, 5)

In [47]:
len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [48]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [49]:
normalize('NFC', s1) == normalize('NFC', s2)

True

In [50]:
normalize('NFD', s1) == normalize('NFD', s2)

True

In [51]:
from unicodedata import name
ohm = '\u2126'
name(ohm)

'OHM SIGN'

In [52]:
ohm_c = normalize('NFC', ohm)
name(ohm_c)

'GREEK CAPITAL LETTER OMEGA'

In [53]:
ohm == ohm_c

False

In [54]:
normalize('NFC', ohm) == normalize('NFC', ohm_c)

True

In [55]:
half = '\N{VULGAR FRACTION ONE HALF}'
print(half)

½


In [56]:
normalize('NFKC', half)

'1⁄2'

In [57]:
for char in normalize('NFKC', half):
    print(char, name(char), sep='\t')

1	DIGIT ONE
⁄	FRACTION SLASH
2	DIGIT TWO


In [58]:
four_squared = '4²'
normalize('NFKC', four_squared)

'42'

In [59]:
micro = 'µ'
micro_kc = normalize('NFKC', micro)
micro, micro_kc

('µ', 'μ')

In [60]:
ord(micro), ord(micro_kc)

(181, 956)

In [61]:
name(micro), name(micro_kc)

('MICRO SIGN', 'GREEK SMALL LETTER MU')

### Case Folding

In [62]:
micro = 'µ'
name(micro)

'MICRO SIGN'

In [63]:
micro_cf = micro.casefold()
name(micro_cf)

'GREEK SMALL LETTER MU'

In [64]:
micro, micro_cf

('µ', 'μ')

In [65]:
eszett = 'ß'
name(eszett)

'LATIN SMALL LETTER SHARP S'

In [66]:
eszett_cf = eszett.casefold()
eszett, eszett_cf

('ß', 'ss')

### Utility Functions for Normalized Text Matching

#### Example 4-13. [normeq.py](normeq.py): normalized Unicode string comparison

In [67]:
from normeq import nfc_equal
s1 = 'café'
s2 = 'cafe\u0301'
s1 == s2

False

In [68]:
nfc_equal(s1, s2)

True

In [69]:
nfc_equal('A', 'a')

False

In [70]:
from normeq import fold_equal
s3 = 'Straße'
s4 = 'strasse'
s3 == s4

False

In [71]:
nfc_equal(s3, s4)

False

In [72]:
fold_equal(s3, s4)

True

In [73]:
fold_equal(s1, s2)

True

In [74]:
fold_equal('A', 'a')

True

### Extreme “Normalization”: Taking Out Diacritics

#### Example 4-14. [simplify.py](simplify.py): function to remove all combining marks

In [75]:
from simplify import shave_marks
order = '“Herr Voß: • ½ cup of Œtker™ caffè latte • bowl of açaí.”'
shave_marks(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [76]:
Greek = 'Ζέφυρος, Zéfiro'
shave_marks(Greek)

'Ζεφυρος, Zefiro'

#### Example 4-16.  Function to remove combining marks from Latin characters

In [77]:
from simplify import shave_marks_latin
shave_marks_latin(order)

'“Herr Voß: • ½ cup of Œtker™ caffe latte • bowl of acai.”'

In [78]:
shave_marks_latin(Greek)

'Ζέφυρος, Zefiro'

#### Example 4-17. Transform some Western typographical symbols into ASCII

In [79]:
from simplify import dewinize, asciize
dewinize(order)

'"Herr Voß: - ½ cup of OEtker(TM) caffè latte - bowl of açaí."'

In [80]:
asciize(order)

'"Herr Voss: - 1⁄2 cup of OEtker(TM) caffe latte - bowl of acai."'

## Sorting Unicode Text

In [81]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted(fruits)
# Ideal result should be: ['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']


#### Example 4-19. [locale_sort.py](locale_sort.py): using the `locale.strxfrm` function as the sort key

**Note**: Requires the locale '`pt_BR.UTF-8`' to be installed 

In [82]:
# !python locale_sort.py

### Sorting with the Unicode Collation Algorithm

#### Example 4-20. Using the pyuca.Collator.sort_key method

In [83]:
import pyuca
coll = pyuca.Collator()
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

## The Unicode Database

### Finding Characters by Name

#### Example 4-21. [cf.py](cf.py): the character finder utility

In [84]:
!python3 charfinder/cf.py smiling cat

U+1F638	😸	GRINNING CAT FACE WITH SMILING EYES
U+1F63A	😺	SMILING CAT FACE WITH OPEN MOUTH
U+1F63B	😻	SMILING CAT FACE WITH HEART-SHAPED EYES


### Numeric Meaning of Characters

#### Example 4-22. [numerics_demo.py](numerics_demo.py): Demo of Unicode database numerical character metadata (callouts describe each column in the output)

In [85]:
!python3 numerics_demo.py

U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


## Dual-Mode str and bytes APIs

### str Versus bytes in Regular Expressions

#### Example 4-23. [ramanujan.py](ramanujan.py): compare behavior of simple str and bytes regular expressions

In [86]:
!python3 ramanujan.py

Text
  'Ramanujan saw ௧௭௨௯ as 1729 = 1³ + 12³ = 9³ + 10³.'
Numbers
  str  : ['௧௭௨௯', '1729', '1', '12', '9', '10']
  bytes: [b'1729', b'1', b'12', b'9', b'10']
Words
  str  : ['Ramanujan', 'saw', '௧௭௨௯', 'as', '1729', '1³', '12³', '9³', '10³']
  bytes: [b'Ramanujan', b'saw', b'as', b'1729', b'1', b'12', b'9', b'10']
