# Chapter 4. Unicode Text Versus Bytes
---

## ToC

1. [Basic Encoders/Decoders](#basic-encodersdecoders)  
2. [Understanding Encode/Decode Problems](#understanding-encodedecode-problems)  
    2.1. [Coping with UnicodeEncodeError](#coping-with-unicodeencodeerror)  
    2.2. [Coping with UnicodeDecodeError](#coping-with-unicodedecodeerror)  
    2.3. [SyntaxError When Loading Modules with Unexpected Encoding](#syntaxerror-when-loading-modules-with-unexpected-encoding)  
    2.4. [How to Discover the Encoding of a Byte Sequence](#how-to-discover-the-encoding-of-a-byte-sequence)
---

## Basic Encoders/Decoders

The Python distribution bundles more than 100 *codecs* (encoder/decoders) for text to
byte conversion and vice versa. Each codec has a name, like `utf_8`, and often
aliases, such as `utf8`, `utf-8`, and `U8`, which you can use as the encoding argument
in functions like `open()`, `str.encode()`, `bytes.decode()`, and so on.

**Example:** 3 codecs

In [1]:
for codec in ['latin_1', 'utf_8', 'utf_16']:
    print(codec, 'El Niño'.encode(codec), sep='\t')

latin_1	b'El Ni\xf1o'
utf_8	b'El Ni\xc3\xb1o'
utf_16	b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'


![Figure 58](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/58.PNG)

## Understanding Encode/Decode Problems

Although there is a generic `UnicodeError` exception, the error reported by Python is
usually more specific: either a `UnicodeEncodeError` (when converting `str` to binary
sequences) or a `UnicodeDecodeError` (when reading binary sequences into `str`).
Loading Python modules may also raise `SyntaxError` when the source encoding is
unexpected.

### Coping with UnicodeEncodeError

In [2]:
city = 'São Paulo'
city.encode('utf_8')

b'S\xc3\xa3o Paulo'

In [3]:
city.encode('utf_16')

b'\xff\xfeS\x00\xe3\x00o\x00 \x00P\x00a\x00u\x00l\x00o\x00'

In [4]:
city.encode('iso8859_1')

b'S\xe3o Paulo'

`cp437` can’t encode the `ã` (`a` with tilde). The default error handler
—`strict`—raises `UnicodeEncodeError`.

In [5]:
# cp437 can’t encode the 'ã'
city.encode('cp437')

UnicodeEncodeError: 'charmap' codec can't encode character '\xe3' in position 1: character maps to <undefined>

In [6]:
# skips characters that cannot be encoded
city.encode('cp437', errors='ignore')

b'So Paulo'

In [7]:
# substitutes unencodable characters with '?'
city.encode('cp437', errors='replace')

b'S?o Paulo'

In [8]:
# replaces unencodable characters with an XML entity
city.encode('cp437', errors='xmlcharrefreplace')

b'S&#227;o Paulo'

![Figure 59](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/59.PNG)

Link: [`codecs.register_error` documentation](https://docs.python.org/3/library/codecs.html#codecs.register_error)

### Coping with UnicodeDecodeError

Not every byte holds a valid ASCII character, and not every byte sequence is valid
UTF-8 or UTF-16; therefore, when you assume one of these encodings while converting
a binary sequence to text, you will get a `UnicodeDecodeError` if unexpected bytes
are found.
On the other hand, many legacy 8-bit encodings like `cp1252`, `iso8859_1`, and
`koi8_r` are able to decode any stream of bytes, including random noise, without
reporting errors. Therefore, if your program assumes the wrong 8-bit encoding, it
will silently decode garbage.

![Figure 60](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/60.PNG)

**Example:** using the wrong codec may produce gremlins or a `UnicodeDecodeError`.

In [9]:
# '\xe9' is the byte for 'é'.
octets = b'Montr\xe9al'

# Decoding with Windows 1252 works because it is a superset of latin1.
octets.decode('cp1252')

'Montréal'

In [10]:
# ISO-8859-7 is intended for Greek
octets.decode('iso8859_7')

'Montrιal'

In [11]:
# KOI8-R is for Russian
octets.decode('koi8_r')

'MontrИal'

In [12]:
octets.decode('utf_8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 5: invalid continuation byte

In [13]:
octets.decode('utf_8', errors='replace')

'Montr�al'

### SyntaxError When Loading Modules with Unexpected Encoding

If you load a *.py* module containing non-UTF-8 data and no encoding declaration, you get a message like this:

```
SyntaxError: Non-UTF-8 code starting with '\xe1' in file ola.py on line
1, but no encoding declared; see https://python.org/dev/peps/pep-0263/
for details
```

Because UTF-8 is widely deployed in GNU/Linux and macOS systems, a likely scenario is opening a *.py* file created on Windows with cp1252. Note that this error happens even in Python for Windows, because the default encoding for Python 3 source is UTF-8 across all platforms.

In [14]:
import subprocess
import os

# Step 1: Create the broken file (without coding declaration)
broken_code = "print('Olá, Mundo!')"

with open('materials/ola_broken.py', 'w', encoding='cp1252') as f:
    f.write(broken_code)

# Step 2: Create the fixed file (with coding declaration)
fixed_code = "# coding: cp1252\nprint('Olá, Mundo!')"

with open('materials/ola_fixed.py', 'w', encoding='cp1252') as f:
    f.write(fixed_code)

# Step 3: Function to run a .py file and capture output/errors
def run_python_file(filename):
    try:
        result = subprocess.run(
            ['python', filename],
            capture_output=True,
            text=True
        )
        if result.returncode == 0:
            print(f"Output from {filename}:\n{result.stdout}")
        else:
            print(f"Error from {filename}:\n{result.stderr}")
    except Exception as e:
        print(f"Failed to run {filename}: {e}")

# Step 4: Run both files from the materials directory
print("\n--- Trying ola_broken.py ---")
run_python_file(os.path.join('materials', 'ola_broken.py'))

print("\n--- Trying ola_fixed.py ---")
run_python_file(os.path.join('materials', 'ola_fixed.py'))


--- Trying ola_broken.py ---
Error from materials\ola_broken.py:
SyntaxError: Non-UTF-8 code starting with '\xe1' in file c:\Users\HamedVAHEB\Documents\Training\Python\FluentPython\repo\Training-Python\src\Part_I\Chapter_04_UnicodeTextsVSBytes\materials\ola_broken.py on line 1, but no encoding declared; see https://peps.python.org/pep-0263/ for details


--- Trying ola_fixed.py ---
Output from materials\ola_fixed.py:
OlÃ¡, Mundo!



![Figure 61](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/61.PNG)

### How to Discover the Encoding of a Byte Sequence

How do you find the encoding of a byte sequence? Short answer: you can’t. You must
be told.

Some communication protocols and file formats, like HTTP and XML, contain headers
that explicitly tell us how the content is encoded.
You can be sure that some byte streams are not ASCII because they contain byte values over 127,

![Figure 62](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/62.PNG)

However, considering that human languages also have their rules and restrictions,
once you assume that a stream of bytes is human *plain text*, it may be possible to sniff
out its encoding using heuristics and statistics. For example, if `b'\x00'` bytes are
common, it is probably a 16- or 32-bit encoding, and not an 8-bit scheme, because
null characters in plain text are bugs. When the byte sequence `b'\x20\x00'` appears
often, it is more likely to be the space character `(U+0020)` in a UTF-16LE encoding,

#### Package Chardet

The [package Chardet](https://pypi.org/project/chardet/) works to guess one of more than 30 supported encodings

In [15]:
# Example: Simulate a real-world text with Portuguese/Spanish accents
long_text = """
O Brasil é o maior país da América do Sul e o quinto maior do mundo em área territorial. 
Com uma população de mais de 210 milhões de habitantes, é também o sexto país mais populoso. 
A língua oficial é o português, sendo o único país da América a falar esse idioma predominantemente.
As principais cidades são São Paulo, Rio de Janeiro, Brasília (a capital), Salvador e Fortaleza.
A diversidade cultural, as florestas tropicais como a Amazônia, e as belas praias são alguns dos seus atrativos.
"""

# Save the text with Windows-1252 encoding (cp1252)
with open("materials/text-byte.asciidoc.txt", "w", encoding="cp1252") as f:
    f.write(long_text)


In [16]:
import chardet

# Read the raw bytes of the file
with open('materials/text-byte.asciidoc.txt', 'rb') as f:
    raw_data = f.read()

# Detect encoding
result = chardet.detect(raw_data)

print(result)


{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


#### BOM: A Useful Gremlin

In [17]:
u16 = 'El Niño'.encode('utf_16')
u16

b'\xff\xfeE\x00l\x00 \x00N\x00i\x00\xf1\x00o\x00'

The first two bytes are: `b'\xff\xfe'. That is a BOM—byte-order mark—denoting the “littleendian”
byte ordering of the Intel CPU where the encoding was performed.
Byte Order Mark (BOM) is an optional signature at the start of a text file that tells readers:

- "Hey, I'm UTF-16"

- "Hey, my bytes are little-endian" (little-endian = lower byte first)

Because, by design, there is no `U+FFFE` character in Unicode, the byte sequence `b'\xff\xfe'` must mean the `ZERO WIDTH NO-BREAK SPACE` on a little-endian encoding, so the codec knows which byte ordering to use.

In [18]:
list(u16)

[255, 254, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

There is a variant of UTF-16—UTF-16LE—that is explicitly little-endian, and
another one explicitly big-endian, UTF-16BE. If you use them, a BOM is not
generated:

In [19]:
u16le = 'El Niño'.encode('utf_16le')

In [20]:
list(u16le)

[69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111, 0]

In [21]:
u16be = 'El Niño'.encode('utf_16be')

In [22]:
list(u16be)

[0, 69, 0, 108, 0, 32, 0, 78, 0, 105, 0, 241, 0, 111]

One big advantage of UTF-8 is that it produces the same byte sequence regardless of machine endianness, so no BOM is needed.

![Figure 63](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/63.PNG)