# Chapter 4. Unicode Text Versus Bytes
---

## ToC

[Objectives](#objectives)

1. [Character Issues](#character-issues)
2. [Byte Essentials](#byte-essentials)
---

## Objectives
- Characters, code points, and byte representations
- Unique features of binary sequences: `bytes`, `bytearray`, and `memoryview`
- Encodings for full Unicode and legacy character sets
- Avoiding and dealing with encoding errors
- Best practices when handling text files
- The default encoding trap and standard I/O issues
- Safe Unicode text comparisons with normalization
- Utility functions for normalization, case folding, and brute-force diacritic
removal
- Proper sorting of Unicode text with locale and the pyuca library
- Character metadata in the Unicode database
- Dual-mode APIs that handle str and bytes

## Character Issues

![Figure 50](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/50.PNG)

**Links:**  

[Parsing binary records with
struct](https://fpy.li/4-3)  
[Building Multi-character Emojis](https://fpy.li/4-4)


## Character Issues

The concept of “string” is simple enough: a string is a sequence of characters. The problem lies in the definition of “character.” In 2021, the best definition of “character” we have is a Unicode character. The Unicode standard explicitly separates the identity of characters from specific byte representations:



- The identity of a character—its *code point*—is a number from 0 to 1,114,111
(base 10), shown in the Unicode standard as 4 to 6 hex digits with a “U+” prefix,
from U+0000 to U+10FFFF. For example, the code point for the letter A is U
+0041, the Euro sign is U+20AC, and the musical symbol G clef is assigned to
code point U+1D11E. About 13% of the valid code points have characters
assigned to them in Unicode 13.0.0, the standard used in Python 3.10.0b4.

- The actual bytes that represent a character depend on the *encoding* in use. An
encoding is an algorithm that converts code points to byte sequences and vice
versa. The code point for the letter A (U+0041) is encoded as the single byte \x41
in the UTF-8 encoding, or as the bytes \x41\x00 in UTF-16LE encoding. As
another example, UTF-8 requires three bytes—\xe2\x82\xac—to encode the
Euro sign (U+20AC), but in UTF-16LE the same code point is encoded as two
bytes: \xac\x20.

#### Refresher on ASII, UTF-8, Unicode

![Figure 51](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/51.PNG)

![Figure 52](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/52.PNG)

![Figure 53](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/53.PNG)

![Figure 54](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/54.PNG)

![Figure 55](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/55.PNG)

Converting from code points to bytes is encoding; converting from bytes to code
points is decoding.

In [4]:
s = 'café'
len(s)

4

In [5]:
b = s.encode('utf8')
b

b'caf\xc3\xa9'

In [3]:
len(b)

5

In [6]:
b.decode('utf8')

'café'

Display Unicode code points:

In [9]:
for c in b.decode('utf8'):
    print(f"{c} -> U+{ord(c):04X}")


c -> U+0063
a -> U+0061
f -> U+0066
é -> U+00E9


![Figure 56](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/56.PNG)

## Byte Essentials

There are two basic built-in types for binary sequences: the immutable `bytes` type introduced in Python 3 and the mutable `bytearray`, added way back in Python 2.6. Each item in `bytes` or `bytearray` is an integer from 0 to 255, and not a one-character string like in the Python 2 `str`. However, a slice of a binary sequence always produces a binary sequence of the same type—including slices of length 1. 

In [18]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

In [24]:
# Each item is an integer in range(256).
cafe[0]

99

In [20]:
ord('café'[0])

99

In [21]:
cafe[:1]

b'c'

In [23]:
cafe[2:3]

b'f'

![Figure 57](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/57.PNG)