# Chapter 4. Unicode Text Versus Bytes
---

## ToC

1. [Handling Text Files](#handling-text-files)
2. [Normalizing Unicode for Reliable Comparisons](#normalizing-unicode-for-reliable-comparisons)  
    2.1.[Case Folding](#case-folding)  
    2.2.[Utility Functions for Normalized Text Matching](#utility-functions-for-normalized-text-matching)  
    2.3.[Extreme “Normalization”: Taking Out Diacritics](#extreme-normalization-taking-out-diacritics)  
---

## Handling Text Files

The best practice for handling text I/O is the “Unicode sandwich”

![Figure 64](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/64.PNG)

This means that `bytes` should be decoded to `str` as early as possible on input (e.g., when opening a file for reading). The “filling” of the sandwich is the business logic of your program, where text handling is done exclusively on `str` objects. You should never be encoding or decoding in the middle of other processing. On output, the `str` are encoded to `bytes` as late as possible. Most web frameworks work like that, and we rarely touch `bytes` when using them. In Django, for example, your views should output Unicode `str`; Django itself takes care of encoding the response to `bytes`, using UTF-8 by default.


Python 3 makes it easier to follow the advice of the Unicode sandwich, because the `open()` built-in does the necessary decoding when reading and encoding when writing files in text mode, so all you get from `my_file.read()` and pass to `my_file.write(text)` are `str` objects.

**Example:** A platform encoding issue

In [6]:
open('cafe.txt', 'w', encoding='utf_8').write('café')

4

In [7]:
open('cafe.txt').read()

'cafÃ©'

**The bug:** I specified UTF-8 encoding when writing the file but failed to do so when
reading it, so Python assumed Windows default file encoding—code page 1252—and
the trailing bytes in the file were decoded as characters `Ã©` instead of `é`.

In [45]:
fp = open('cafe.txt', 'w', encoding='utf_8')
fp

<_io.TextIOWrapper name='cafe.txt' mode='w' encoding='utf_8'>

In [46]:
fp.encoding

'utf_8'

In [47]:
fp.write('café')

4

In [48]:
fp.close()

In [49]:
import os
os.stat('cafe.txt').st_size

5

In [50]:
fp2 = open('cafe.txt')

In [51]:
fp2

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='cp1252'>

In [52]:
fp2.encoding

'cp1252'

In [53]:
fp2.read()

'cafÃ©'

In [54]:
fp3 = open('cafe.txt', encoding='utf_8')
fp3

<_io.TextIOWrapper name='cafe.txt' mode='r' encoding='utf_8'>

In [55]:
fp3.read()

'café'

In [56]:
fp4 = open('cafe.txt', 'rb')
fp4

<_io.BufferedReader name='cafe.txt'>

In [57]:
fp4.read()

b'caf\xc3\xa9'

![Figure 65](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/65.PNG)

![Figure 66](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/66.PNG)

### Beware of Encoding Defaults

The following output is from a Jupyter Noteobook, executed in VS Code, on Windows 10. To see more details:

In [8]:
import sys
import os
import platform
import locale

def get_environment_info():
    info = {
        "OS": platform.system(),
        "OS Version": platform.version(),
        "Platform": platform.platform(),
        "Architecture": platform.architecture(),
        "Machine": platform.machine(),
        "Processor": platform.processor(),
        "Python Version": sys.version,
        "Python Executable": sys.executable,
        "Python Encoding (Preferred)": locale.getpreferredencoding(),
        "Filesystem Encoding": sys.getfilesystemencoding(),
        "Default Encoding": sys.getdefaultencoding(),
       #"Environment Variables": dict(os.environ),
        "Terminal stdout is TTY": sys.stdout.isatty(),
        "Terminal stdout Encoding": sys.stdout.encoding,
        "Terminal stdin is TTY": sys.stdin.isatty(),
        "Terminal stdin Encoding": sys.stdin.encoding,
        "Terminal stderr is TTY": sys.stderr.isatty(),
        "Terminal stderr Encoding": sys.stderr.encoding,
    }
    return info

if __name__ == "__main__":
    env_info = get_environment_info()
    for key, value in env_info.items():
        print(f"{key:>25}: {value}")


                       OS: Windows
               OS Version: 10.0.19045
                 Platform: Windows-10-10.0.19045-SP0
             Architecture: ('64bit', 'WindowsPE')
                  Machine: AMD64
                Processor: Intel64 Family 6 Model 186 Stepping 3, GenuineIntel
           Python Version: 3.13.0 (tags/v3.13.0:60403a5, Oct  7 2024, 09:38:07) [MSC v.1941 64 bit (AMD64)]
        Python Executable: c:\Users\HamedVAHEB\Documents\Training\Python\FluentPython\repo\Training-Python\env_train\Scripts\python.exe
Python Encoding (Preferred): cp1252
      Filesystem Encoding: utf-8
         Default Encoding: utf-8
   Terminal stdout is TTY: False
 Terminal stdout Encoding: UTF-8
    Terminal stdin is TTY: False
  Terminal stdin Encoding: utf-8
   Terminal stderr is TTY: False
 Terminal stderr Encoding: UTF-8


In [22]:
import locale
import sys
expressions = """
locale.getpreferredencoding()
type(my_file)
my_file.encoding
sys.stdout.isatty()
sys.stdout.encoding
sys.stdin.isatty()
sys.stdin.encoding
sys.stderr.isatty()
sys.stderr.encoding
sys.getdefaultencoding()
sys.getfilesystemencoding()

"""
my_file = open('dummy', 'w')
for expression in expressions.split():
    value = eval(expression)
    print(f'{expression:>30} -> {value!r}')

 locale.getpreferredencoding() -> 'cp1252'
                 type(my_file) -> <class '_io.TextIOWrapper'>
              my_file.encoding -> 'cp1252'
           sys.stdout.isatty() -> False
           sys.stdout.encoding -> 'UTF-8'
            sys.stdin.isatty() -> False
            sys.stdin.encoding -> 'utf-8'
           sys.stderr.isatty() -> False
           sys.stderr.encoding -> 'UTF-8'
      sys.getdefaultencoding() -> 'utf-8'
   sys.getfilesystemencoding() -> 'utf-8'


The output of the previous example on GNU/Linux (Ubuntu 14.04 to 19.10) and macOS (10.9 to 10.14) is identical, showing that UTF-8 is used everywhere in these systems:

```python
$ python3 default_encodings.py
locale.getpreferredencoding() -> 'UTF-8'
type(my_file) -> <class '_io.TextIOWrapper'>
my_file.encoding -> 'UTF-8'
sys.stdout.isatty() -> True
sys.stdout.encoding -> 'utf-8'
sys.stdin.isatty() -> True
sys.stdin.encoding -> 'utf-8'
sys.stderr.isatty() -> True
sys.stderr.encoding -> 'utf-8'
sys.getdefaultencoding() -> 'utf-8'
sys.getfilesystemencoding() -> 'utf-8'
```

### Normalizing Unicode for Reliable Comparisons

String comparisons are complicated by the fact that Unicode has combining characters:
diacritics and other marks that attach to the preceding character, appearing as
one when printed.

In [11]:
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
s1, s2

('café', 'café')

In [12]:
len(s1), len(s2)

(4, 5)

In [13]:
s1 == s2

False

Placing `COMBINING ACUTE ACCENT` (U+0301) after `e` renders `é`. In the Unicode standard, sequences like `é` and `e\u0301` are called *canonical equivalents* and applications are supposed to treat them as the same. But Python sees two different sequences of code points, and considers them not equal.

The solution is `unicodedata.normalize()`. The first argument to that function is one of four strings: `'NFC'`, `'NFD'`, `'NFKC'`, and `'NFKD'`.

In [63]:
from unicodedata import normalize
s1 = 'café'
s2 = 'cafe\N{COMBINING ACUTE ACCENT}'
len(s1), len(s2)

(4, 5)

In [64]:
len(normalize('NFC', s1)), len(normalize('NFC', s2))

(4, 4)

In [65]:
len(normalize('NFD', s1)), len(normalize('NFD', s2))

(5, 5)

In [66]:
normalize('NFC', s1) == normalize('NFC', s2)

True

NFC is also the normalization form recommended by the W3C in ["Character Model for the World Wide Web: String
Matching and Searching"](https://www.w3.org/TR/charmod-norm/).

Some single characters are normalized by NFC into another single character. The
symbol for the ohm (`Ω`) unit of electrical resistance is normalized to the Greek uppercase
omega. They are visually identical, but they compare as unequal, so it is essential
to normalize to avoid surprises:

In [21]:
from unicodedata import normalize, name
ohm = '\u2126'
name(ohm)

'OHM SIGN'

In [22]:
ohm_c = normalize('NFC', ohm)
name(ohm_c)

'GREEK CAPITAL LETTER OMEGA'

In [23]:
ohm == ohm_c

False

In [24]:
normalize('NFC', ohm) == normalize('NFC', ohm_c)

True

The other two normalization forms are NFKC and NFKD, where the letter K stands
for “compatibility.” These are stronger forms of normalization, affecting the so-called
“compatibility characters.”

In [49]:
from unicodedata import normalize, name
half = '\N{VULGAR FRACTION ONE HALF}'
print(half)

½


In [50]:
normalize('NFKC', half)

'1⁄2'

In [51]:
for char in normalize('NFKC', half):
    print(char, name(char), sep='\t')

1	DIGIT ONE
⁄	FRACTION SLASH
2	DIGIT TWO


In [52]:
four_squared = '4²'
normalize('NFKC', four_squared)

'42'

In [53]:
micro = 'μ'
micro_kc = normalize('NFKC', micro)
micro, micro_kc

('μ', 'μ')

In [54]:
ord(micro), ord(micro_kc)

(956, 956)

In [48]:
name(micro), name(micro_kc)

('GREEK SMALL LETTER MU', 'GREEK SMALL LETTER MU')

![Figure 67](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/67.PNG)

### Case Folding

Case folding is essentially converting all text to lowercase, with some additional transformations. It is supported by the `str.casefold()` method.

For any string `s` containing only `latin1` characters, `s.casefold()` produces the same result as `s.lower()`, with only two exceptions—the micro sign `'μ'` is changed to the Greek lowercase `mu` (which looks the same in most fonts) and the German Eszett or
`"sharp s"` (`ß`) becomes `"ss"`:

In [58]:
str1 = 'A'
str2 = str1.casefold()
str2

'a'

In [30]:
micro = 'μ'
name(micro)


'GREEK SMALL LETTER MU'

In [31]:
micro_cf = micro.casefold()
name(micro_cf)

'GREEK SMALL LETTER MU'

In [32]:
micro, micro_cf

('μ', 'μ')

In [33]:
eszett = 'ß'
name(eszett)

'LATIN SMALL LETTER SHARP S'

In [34]:
eszett_cf = eszett.casefold()
eszett, eszett_cf

('ß', 'ss')

There are nearly 300 code points for which `str.casefold()` and `str.lower()` return different results.

NFC is the best normalized form for most applications. `str.casefold()` is the way to go for case-insensitive comparisons.

### Utility Functions for Normalized Text Matching

In [57]:
"""
Utility functions for normalized Unicode string comparison.
Using Normal Form C, case sensitive:
>>> s1 = 'café'
>>> s2 = 'cafe\u0301'
>>> s1 == s2
False
>>> nfc_equal(s1, s2)
True
>>> nfc_equal('A', 'a')
False
Using Normal Form C with case folding:
>>> s3 = 'Straße'
>>> s4 = 'strasse'
>>> s3 == s4
False
>>> nfc_equal(s3, s4)
False
>>> fold_equal(s3, s4)
True
>>> fold_equal(s1, s2)
True
>>> fold_equal('A', 'a')
True
"""
from unicodedata import normalize

def nfc_equal(str1, str2):
    return normalize('NFC', str1) == normalize('NFC', str2)

def fold_equal(str1, str2):
    return (normalize('NFC', str1).casefold() ==
            normalize('NFC', str2).casefold())

#### Extreme “Normalization”: Taking Out Diacritics

The Google Search secret sauce involves many tricks, but one of them apparently is
ignoring diacritics (e.g., accents, cedillas, etc.), at least in some contexts. Removing
diacritics is not a proper form of normalization because it often changes the meaning
of words and may produce false positives when searching. But it helps coping with
some facts of life: people sometimes are lazy or ignorant about the correct use of diacritics,
and spelling rules change over time, meaning that accents come and go in living
languages.

**Diacritics VS Combining Marks**

 Diacritics are marks added to letters to alter their pronunciation or to distinguish between similar words.


 Combining marks are Unicode characters that combine with the preceding base character to visually form a single glyph with a diacritic.

| Aspect             | Diacritics                                 | Combining Marks                             |
|--------------------|---------------------------------------------|---------------------------------------------|
| Concept            | Linguistic / Orthographic                  | Encoding / Digital representation           |
| Role               | Modify pronunciation or meaning            | Modify appearance of base character         |
| Unicode            | May be a **precomposed character** (e.g., `é` = U+00E9) | Typically a **separate character** that follows the base (e.g., `e` + U+0301) |
| Usage              | Used in natural languages                  | Used in digital text to represent diacritics |
| Example            | `é`, `ñ` (single character)                | `e` + `́`, `n` + `̃` (two characters visually combined) |


**Example:** function to remove all combining marks

In [2]:
s = 'e\u0301'

print(s)

é


In [3]:
s = 'e\u00e9'

print(s)

eé


In [36]:
import unicodedata
import string
def shave_marks(txt):
    """Remove all diacritic marks"""
    # Decompose all characters into base characters and combining marks.
    norm_txt = unicodedata.normalize('NFD', txt)
    # Filter out all combining marks.
    shaved = ''.join(c for c in norm_txt
        if not unicodedata.combining(c))
    # Recompose all characters.
    return unicodedata.normalize('NFC', shaved)

In [37]:
order = '“Herr Voß: • ½ cup of OEtker™ caffè latte • bowl of açaí.”'
order

'“Herr Voß: • ½ cup of OEtker™ caffè latte • bowl of açaí.”'

In [38]:
shave_marks(order)

'“Herr Voß: • ½ cup of OEtker™ caffe latte • bowl of acai.”'

In [39]:
order_normed = unicodedata.normalize('NFD', order)
order_normed

'“Herr Voß: • ½ cup of OEtker™ caffè latte • bowl of açaí.”'

In [40]:
def show_differences(s1: str, s2: str) -> list:
    return [(i, c1, c2) for i, (c1, c2) in enumerate(zip(s1, s2)) if c1 != c2]

In [None]:
show_differences(order, order_normed)

[(34, 'è', 'e'),
 (35, ' ', '̀'),
 (36, 'l', ' '),
 (37, 'a', 'l'),
 (38, 't', 'a'),
 (40, 'e', 't'),
 (41, ' ', 'e'),
 (42, '•', ' '),
 (43, ' ', '•'),
 (44, 'b', ' '),
 (45, 'o', 'b'),
 (46, 'w', 'o'),
 (47, 'l', 'w'),
 (48, ' ', 'l'),
 (49, 'o', ' '),
 (50, 'f', 'o'),
 (51, ' ', 'f'),
 (52, 'a', ' '),
 (53, 'ç', 'a'),
 (54, 'a', 'c'),
 (55, 'í', '̧'),
 (56, '.', 'a'),
 (57, '”', 'i')]

In [41]:
show_differences(order, shave_marks(order))

[(34, 'è', 'e'), (53, 'ç', 'c'), (55, 'í', 'i')]

In [42]:
import regex
import unicodedata

def show_grapheme_differences(s1: str, s2: str):
    g1 = regex.findall(r'\X', s1)
    g2 = regex.findall(r'\X', s2)

    for i, (c1, c2) in enumerate(zip(g1, g2)):
        if c1 != c2:
            print(f"Position {i}:")
            print(f"  Original   : {repr(c1)} → {[f'U+{ord(ch):04X}' for ch in c1]} ({[unicodedata.name(ch, 'UNKNOWN') for ch in c1]})")
            print(f"  Normalized : {repr(c2)} → {[f'U+{ord(ch):04X}' for ch in c2]} ({[unicodedata.name(ch, 'UNKNOWN') for ch in c2]})\n")

In [43]:
show_grapheme_differences(order, order_normed)

Position 34:
  Original   : 'è' → ['U+00E8'] (['LATIN SMALL LETTER E WITH GRAVE'])
  Normalized : 'è' → ['U+0065', 'U+0300'] (['LATIN SMALL LETTER E', 'COMBINING GRAVE ACCENT'])

Position 53:
  Original   : 'ç' → ['U+00E7'] (['LATIN SMALL LETTER C WITH CEDILLA'])
  Normalized : 'ç' → ['U+0063', 'U+0327'] (['LATIN SMALL LETTER C', 'COMBINING CEDILLA'])

Position 55:
  Original   : 'í' → ['U+00ED'] (['LATIN SMALL LETTER I WITH ACUTE'])
  Normalized : 'í' → ['U+0069', 'U+0301'] (['LATIN SMALL LETTER I', 'COMBINING ACUTE ACCENT'])



In [45]:
show_grapheme_differences(order_normed, shave_marks(order))

Position 34:
  Original   : 'è' → ['U+0065', 'U+0300'] (['LATIN SMALL LETTER E', 'COMBINING GRAVE ACCENT'])
  Normalized : 'e' → ['U+0065'] (['LATIN SMALL LETTER E'])

Position 53:
  Original   : 'ç' → ['U+0063', 'U+0327'] (['LATIN SMALL LETTER C', 'COMBINING CEDILLA'])
  Normalized : 'c' → ['U+0063'] (['LATIN SMALL LETTER C'])

Position 55:
  Original   : 'í' → ['U+0069', 'U+0301'] (['LATIN SMALL LETTER I', 'COMBINING ACUTE ACCENT'])
  Normalized : 'i' → ['U+0069'] (['LATIN SMALL LETTER I'])



In [21]:
Greek = 'Ζέφυρος, Zéfiro'
shave_marks(Greek)

'Ζεφυρος, Zefiro'

An even more radical step would be to replace common symbols in Western texts
(e.g., curly quotes, em dashes, bullets, etc.) into ASCII equivalents:

**Example:** Transform some Western typographical symbols into ASCII

In [64]:
def shave_marks_latin(txt):
    """Remove all diacritic marks from Latin base characters"""
    norm_txt = unicodedata.normalize('NFD', txt)
    latin_base = False
    preserve = []
    for c in norm_txt:
        if unicodedata.combining(c) and latin_base:
            continue # ignore diacritic on Latin base char
        preserve.append(c)
        # if it isn't a combining char, it's a new base char
        if not unicodedata.combining(c):
            latin_base = c in string.ascii_letters
    shaved = ''.join(preserve)
    return unicodedata.normalize('NFC', shaved)

In [None]:
# Build mapping table for char-to-char replacement.
single_map = str.maketrans("""‚ƒ„ˆ‹‘’“”•–—˜›""",
"""'f"^<''""---~>""")

# Build mapping table for char-to-string replacement.
multi_map = str.maketrans({
    '€': 'EUR',
    '…': '...',
    'Æ': 'AE',
    'æ': 'ae',
    'Œ': 'OE',
    'œ': 'oe',
    '™': '(TM)',
    '‰': '<per mille>',
    '†': '**',
    '‡': '***', 
})

In [None]:
# Merge mapping tables.
multi_map.update(single_map)

In [None]:
def dewinize(txt):
    """Replace Win1252 symbols with ASCII chars or sequences"""
    return txt.translate(multi_map)

`dewinize` does not affect ASCII or latin1 text, only the Microsoft additions to latin1 in cp1252.

In [70]:
def asciize(txt):
    # Apply dewinize and remove diacritical marks.
    no_marks = shave_marks_latin(dewinize(txt))
    # Replace the Eszett with “ss” (we are not using case fold here because we want to preserve the case).
    no_marks = no_marks.replace('ß', 'ss')
    # Apply NFKC normalization to compose characters with their compatibility code points.
    return unicodedata.normalize('NFKC', no_marks)

In [71]:
order = '“Herr Voß: • ½ cup of OEtker™ caffè latte • bowl of açaí.”'
dewinize(order)

'"Herr Voß: - ½ cup of OEtker(TM) caffè latte - bowl of açaí."'

In [72]:
asciize(order)

'"Herr Voss: - 1⁄2 cup of OEtker(TM) caffe latte - bowl of acai."'

![Figure 68](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/68.PNG)