# Unicode Text Vs Bytes

- Python encodes strings using the UTF standard. It converts between bytes and human readable strings.

In [3]:
s = "café"
s.encode('utf-8')

b'caf\xc3\xa9'

In [4]:
len(s)

4

In [5]:
len(s.encode('utf-8'))

5

### Bytes Essentials

Python has a `bytes` and a `bytesarray` type. First is immutable and the second is mutable.

In [8]:
cafe = bytes('café', encoding='utf-8')
cafe

b'caf\xc3\xa9'

In [22]:
cafe[1]

97

cafe_arr = bytearray(cafe)
cafe_arr

In [28]:
import array

# Note that arrays store only one data type. The First argument 'd' dictates how much memory is allocated 
# to each slot in the array
numbers = array.array('d', [1, 2, 3])
numbers
bytes(numbers)

b'\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'

### Encoders and Decoders

- Python has a bunch of decoders and encoders (codecs) for converting str to bytes and vice versa eg `utf_8` each of which has a bunch of aliases, eg. `utf-8`, `U8`, `utf8` etc.
- These can be used with functions such as `open()` `str.decode()`, `bytes.decode()` etc

In [29]:
# Encoding a string using different codecs

codecs = ['latin_1', 'utf_8', 'utf_16']

for codec in codecs:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


### Encoding Errors

In [34]:
# UnicodeEncode  error is raised when trying to encode characters out of the range of the codec
city = 'São Paulo'
city.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 1: ordinal not in range(128)

In [36]:
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [39]:
# Passing errors="ignore" swallows the error
city.encode('cp437', errors='ignore')

b'So Paulo'

In [41]:
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

### Decoding Errors

- Decoding errors may result in 'gremlins' or 'mojibake', japanese for translated text.

In [44]:
octets = b'Montr\xe9al'
octets.decode('cp1252')

'Montréal'

In [45]:
octets.decode('iso8859_7')

'Montrιal'

In [46]:
octets.decode('koi8_r')

'MontrИal'

In [47]:
octets.decode('utf')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [49]:
# Using errors='replace' replaces missing characters with the universal REPLACEMENT_CHARACTER 
octets.decode('utf8', errors='replace')

'Montr�al'

In [50]:
name = "Wilson"

name.encode('utf8')

b'Wilson'

In [52]:
name.encode('utf16')

b'\xff\xfeW\x00i\x00l\x00s\x00o\x00n\x00'

In [53]:
len(name.encode('utf8'))

6

In [60]:
for b in name.encode('utf16'):
    print(b, end=" ")

255 254 87 0 105 0 108 0 115 0 111 0 110 0 

In [62]:
for b in name.encode('utf8'):
    print(b, end=" ")

87 105 108 115 111 110 

In [66]:
import sys

In [69]:
sys.stdout.encoding

'UTF-8'

### Normalizing Unicode

In [71]:
s1 = 'café'
s1

'café'

In [74]:
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
s2

'café'

In [77]:
# The above strings are not equal
len(s1), len(s2)

(4, 5)

In [87]:
# Normalizing makes the equal

from unicodedata import normalize

# NFC Composes the characters to the shortest representation
print(normalize('NFC', s1), len(normalize('NFC', s1)))
print(normalize('NFC', s2), len(normalize('NFC', s2)))


café 4
café 4


In [88]:
# NFD Decomposes the characters to the commponents resultint in a longer representation
print(normalize('NFD', s1), len(normalize('NFD', s1)))
print(normalize('NFD', s2), len(normalize('NFD', s2)))

café 5
café 5


___it is good practice to normalize text before saving___

Unicode can give you the name of a character

In [90]:
from unicodedata import name

name("&"), name("^")

('AMPERSAND', 'CIRCUMFLEX ACCENT')