# Chapter 4. Unicode Text Versus Bytes
---

## ToC

[Objectives](#objectives)

1. [Character Issues](#character-issues)
2. [Byte Essentials](#byte-essentials)
---

## Objectives
- Characters, code points, and byte representations
- Unique features of binary sequences: `bytes`, `bytearray`, and `memoryview`
- Encodings for full Unicode and legacy character sets
- Avoiding and dealing with encoding errors
- Best practices when handling text files
- The default encoding trap and standard I/O issues
- Safe Unicode text comparisons with normalization
- Utility functions for normalization, case folding, and brute-force diacritic
removal
- Proper sorting of Unicode text with locale and the pyuca library
- Character metadata in the Unicode database
- Dual-mode APIs that handle str and bytes

## Character Issues

![Figure 50](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/50.PNG)

**Links:**  

[Parsing binary records with
struct](https://fpy.li/4-3)  
[Building Multi-character Emojis](https://fpy.li/4-4)


## Character Issues

The concept of “string” is simple enough: a string is a sequence of characters. The problem lies in the definition of “character.” In 2021, the best definition of “character” we have is a Unicode character. The Unicode standard explicitly separates the identity of characters from specific byte representations:



- The identity of a character—its *code point*—is a number from 0 to 1,114,111
(base 10), shown in the Unicode standard as 4 to 6 hex digits with a “U+” prefix,
from U+0000 to U+10FFFF. For example, the code point for the letter A is U
+0041, the Euro sign is U+20AC, and the musical symbol G clef is assigned to
code point U+1D11E. About 13% of the valid code points have characters
assigned to them in Unicode 13.0.0, the standard used in Python 3.10.0b4.

- The actual bytes that represent a character depend on the *encoding* in use. An
encoding is an algorithm that converts code points to byte sequences and vice
versa. The code point for the letter A (U+0041) is encoded as the single byte \x41
in the UTF-8 encoding, or as the bytes \x41\x00 in UTF-16LE encoding. As
another example, UTF-8 requires three bytes—\xe2\x82\xac—to encode the
Euro sign (U+20AC), but in UTF-16LE the same code point is encoded as two
bytes: \xac\x20.

#### Refresher on ASII, UTF-8, Unicode

![Figure 51](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/51.PNG)

![Figure 52](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/52.PNG)

![Figure 53](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/53.PNG)

![Figure 54](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/54.PNG)

![Figure 55](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/55.PNG)

Converting from code points to bytes is encoding; converting from bytes to code
points is decoding.

In [4]:
s = 'café'
len(s)

4

In [5]:
b = s.encode('utf8')
b

b'caf\xc3\xa9'

In [3]:
len(b)

5

In [6]:
b.decode('utf8')

'café'

Display Unicode code points:

In [9]:
for c in b.decode('utf8'):
    print(f"{c} -> U+{ord(c):04X}")


c -> U+0063
a -> U+0061
f -> U+0066
é -> U+00E9


![Figure 56](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/56.PNG)

## Byte Essentials

There are two basic built-in types for binary sequences: the immutable `bytes` type introduced in Python 3 and the mutable `bytearray`, added way back in Python 2.6. Each item in `bytes` or `bytearray` is an integer from 0 to 255, and not a one-character string like in the Python 2 `str`. However, a slice of a binary sequence always produces a binary sequence of the same type—including slices of length 1. 

In [18]:
cafe = bytes('café', encoding='utf_8')
cafe

b'caf\xc3\xa9'

In result above, we notice that the first three bytes `b'caf'` are in the printable ASCII range, the last two are not.

In [24]:
# Each item is an integer in range(256).
cafe[0]

99

In [20]:
ord('café'[0])

99

In [21]:
cafe[:1]

b'c'

In [23]:
cafe[2:3]

b'f'

![Figure 57](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/57.PNG)

Although binary sequences are really sequences of integers, their literal notation reflects the fact that ASCII text is often embedded in them. Therefore, four different displays are used, depending on each byte value:
- For bytes with decimal codes 32 to 126—from space to `~` (tilde)—the ASCII character
itself is used.
- For bytes corresponding to tab, newline, carriage return, and `\`, the escape
sequences `\t`, `\n`, `\r`, and `\\` are used.
- If both string delimiters `'` and `"` appear in the byte sequence, the whole sequence
is delimited by `'`, and any `'` inside are escaped as `\'`.
- For other byte values, a hexadecimal escape sequence is used (e.g., `\x00` is the
null byte).

Both `bytes` and `bytearray` support every `str` method except those that do formatting
(`format`, `format_map`) and those that depend on Unicode data, including `case
fold`, `isdecimal`, `isidentifier`, `isnumeric`, `isprintable`, and `encode`.

In [45]:
bytes.fromhex('31 4B CE A9')

b'1K\xce\xa9'

In [None]:
text = "hello"      # str
binary = b"hello"   # bytes

print(text, type(text))
print(binary, type(binary))

hello <class 'str'>
b'hello' <class 'bytes'>


In [None]:
import re

# Compile a regex for binary data
binary_pattern = re.compile(b'foo')

# Binary data
data = b'barfooqux1'

# Search for the pattern in the binary data
match = binary_pattern.search(data)

if match:
    print("Match found:", match.group())
else:
    print("No match found.")


Match found: b'foo'


In [44]:
import re

data = b"user:john|id:1234;user:jane|id:5678;user:bob|id:9012,user:L\xc3\xa9o|id:2222"

pattern = re.compile(rb'user:(\w+)\|id:(\d+)')

matches = pattern.findall(data)

for idx, (username, user_id) in enumerate(matches):
    print(f"{idx} - Byte Version -> Username: {username}, ID: {user_id}")
    print(f"{idx} - Str Version  -> Username: {username.decode('utf-8')}, ID: {user_id.decode()}")
    
    # Show raw bytes as lists of integers
    print(f"{idx} - Username Bytes (Decimal): {[b for b in username]}")
    print(f"{idx} - User ID Bytes (Decimal): {[b for b in user_id]}")
    print(f"{idx} - Username Bytes (Hexadecimal): {[format(b, '02x') for b in username]}")
    print(f"{idx} - User ID Bytes (Hexadecimal): {[format(b, '02x') for b in user_id]}")
    print("-" * 40)

0 - Byte Version -> Username: b'john', ID: b'1234'
0 - Str Version  -> Username: john, ID: 1234
0 - Username Bytes (Decimal): [106, 111, 104, 110]
0 - User ID Bytes (Decimal): [49, 50, 51, 52]
0 - Username Bytes (Hexadecimal): ['6a', '6f', '68', '6e']
0 - User ID Bytes (Hexadecimal): ['31', '32', '33', '34']
----------------------------------------
1 - Byte Version -> Username: b'jane', ID: b'5678'
1 - Str Version  -> Username: jane, ID: 5678
1 - Username Bytes (Decimal): [106, 97, 110, 101]
1 - User ID Bytes (Decimal): [53, 54, 55, 56]
1 - Username Bytes (Hexadecimal): ['6a', '61', '6e', '65']
1 - User ID Bytes (Hexadecimal): ['35', '36', '37', '38']
----------------------------------------
2 - Byte Version -> Username: b'bob', ID: b'9012'
2 - Str Version  -> Username: bob, ID: 9012
2 - Username Bytes (Decimal): [98, 111, 98]
2 - User ID Bytes (Decimal): [57, 48, 49, 50]
2 - Username Bytes (Hexadecimal): ['62', '6f', '62']
2 - User ID Bytes (Hexadecimal): ['39', '30', '31', '32']
----

non-ASCI characters are ignored here such as `L\xc3\xa9o` which is `Léo` when decoded.

In [41]:
import re

data = b"user:john|id:1234;user:jane|id:5678;user:bob|id:9012,user:L\xc3\xa9o|id:2222"
text_data = data.decode('utf-8')

pattern = re.compile(r'user:(\w+)\|id:(\d+)')

matches = pattern.findall(text_data)

for idx, (username, user_id) in enumerate(matches):
    print(f"{idx} - str Version -> Username: {username}, ID: {user_id}")
    
    print("-" * 40)

0 - str Version -> Username: john, ID: 1234
----------------------------------------
1 - str Version -> Username: jane, ID: 5678
----------------------------------------
2 - str Version -> Username: bob, ID: 9012
----------------------------------------
3 - str Version -> Username: Léo, ID: 2222
----------------------------------------


**Example:** Initializing bytes from the raw data of an array

In [49]:
import array
numbers = array.array('h', [-2, -1, 0, 1, 2])
octets = bytes(numbers)
octets

b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00'

Creating a bytes or bytearray object from any buffer-like source will always copy the bytes. In contrast, `memoryview` objects let you share memory between binary data structures

In [50]:
import array

# Create an array
numbers = array.array('h', [-2, -1, 0, 1, 2])

# Create a memoryview
memv = memoryview(numbers)

print("Original numbers:", numbers)
print("Memoryview (as list):", memv.tolist())

# Modify memoryview
memv[2] = 9999  # Change the 3rd element (index 2)

print("After changing memoryview:")
print("numbers:", numbers)
print("memv:", memv.tolist())


Original numbers: array('h', [-2, -1, 0, 1, 2])
Memoryview (as list): [-2, -1, 0, 1, 2]
After changing memoryview:
numbers: array('h', [-2, -1, 9999, 1, 2])
memv: [-2, -1, 9999, 1, 2]


**Compare memory**

In [52]:
import array
import sys

# Use 'i' for signed 32-bit integers
numbers = array.array('i', range(1_000_000))  # Now OK!

memv = memoryview(numbers)

list_from_array = list(numbers)
list_from_memv = list(memv)

print(f"Array size: {sys.getsizeof(numbers)} bytes")
print(f"Memoryview size: {sys.getsizeof(memv)} bytes")
print(f"List from array size: {sys.getsizeof(list_from_array)} bytes")
print(f"List from memoryview size: {sys.getsizeof(list_from_memv)} bytes")


Array size: 4091948 bytes
Memoryview size: 184 bytes
List from array size: 8000056 bytes
List from memoryview size: 8000056 bytes
