# Chapter 4. Unicode Text Versus Bytes
---

## ToC

1. [Sorting Unicode Text](#sorting-unicode-text)  
    1.1. [Sorting with the Unicode Collation Algorithm](#sorting-with-the-unicode-collation-algorithm)  
2. [The Unicode Database](#the-unicode-database)  
    2.1.1 [Finding Characters by Name](#finding-characters-by-name)  
    2.1.1 [My Version of char_finder](#my-version-of-char_finder)  
    2.2. [Numeric Meaning of Characters](#numeric-meaning-of-characters)
---

## Sorting Unicode Text

Python sorts sequences of any type by comparing the items in each sequence one by
one. For strings, this means comparing the code points. Unfortunately, this produces
unacceptable results for anyone who uses non-ASCII characters.

In [8]:
fruits = ['caju', 'atemoia', 'caj√°', 'a√ßa√≠', 'acerola']
sorted(fruits)

['acerola', 'atemoia', 'a√ßa√≠', 'caju', 'caj√°']

Sorting rules vary for different locales, but in Portuguese and many languages that use the Latin alphabet, accents and cedillas rarely make a difference when sorting. So ‚Äúcaj√°‚Äù is sorted as ‚Äúcaja,‚Äù and must come before ‚Äúcaju.‚Äù

The sorted `fruits` list should be:

```python
['a√ßa√≠', 'acerola', 'atemoia', 'caj√°', 'caju']
```

The standard way to sort non-ASCII text in Python is to use the `locale.strxfrm`
function which, according to the [`locale` module docs](https://docs.python.org/3/library/locale.html#locale.strxfrm), ‚Äútransforms a string to one
that can be used in locale-aware comparisons.‚Äù
To enable locale.strxfrm, you must first set a suitable locale for your application,
and pray that the OS supports it.

In [6]:
import locale
my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
print(my_locale)

pt_BR.UTF-8


In [7]:
fruits = ['caju', 'atemoia', 'caj√°', 'a√ßa√≠', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)
print(sorted_fruits)

['a√ßa√≠', 'acerola', 'atemoia', 'caj√°', 'caju']


You need to call `setlocale(LC_COLLATE, ¬´your_locale¬ª)` before using `locale.strxfrm` as the key when sorting.
The `local`, standard library solution to internationalized sorting works, but due to multiple reasons, depending on locale setting creates deployment headaches. Fortunately there is a simpler solution, presented in following.

### Sorting with the Unicode Collation Algorithm

In [9]:
import pyuca
coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'caj√°', 'a√ßa√≠', 'acerola']
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['a√ßa√≠', 'acerola', 'atemoia', 'caj√°', 'caju']

![Figure 69](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/69.PNG)

## The Unicode Database

The Unicode standard provides an entire database‚Äîin the form of several structured text files‚Äîthat includes not only the table mapping code points to character names, but also metadata about the individual characters and how they are related. For example, the Unicode database records whether a character is printable, is a letter, is a decimal digit, or is some other numeric symbol. That‚Äôs how the str methods isalpha, isprintable, isdecimal, and isnumeric work. str.casefold also uses information from a Unicode table.

![Figure 70](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/70.PNG)

[Link: Unicode_character_property#General_Category](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category)

### Finding Characters by Name

The `unicodedata` module has functions to retrieve character metadata, including `unicodedata.name()`, which returns a character‚Äôs official name in the standard. You can use the `name()` function to build apps that let users search for characters by
name.

In [12]:
from unicodedata import name
print(name('A'))
print(name('√£'))
print(name('üò∏'))
print(name('‚ôõ'))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER A WITH TILDE
GRINNING CAT FACE WITH SMILING EYES
BLACK CHESS QUEEN


In [26]:
import sys
import os

# Path to the folder containing 'cf.py' to sys.path
module_path = os.path.abspath(os.path.join("..", "charfinder"))
if module_path not in sys.path:
    sys.path.append(module_path)

#import cf

In [24]:
!python ./charfinder/cf.py cat smiling

U+1F638	üò∏	GRINNING CAT FACE WITH SMILING EYES
U+1F63A	üò∫	SMILING CAT FACE WITH OPEN MOUTH
U+1F63B	üòª	SMILING CAT FACE WITH HEART-SHAPED EYES


In [51]:
!python ./charfinder/cf.py arrow

U+02FF	Àø	MODIFIER LETTER LOW LEFT ARROW
U+034D	Õç	COMBINING LEFT RIGHT ARROW BELOW
U+034E	Õé	COMBINING UPWARDS ARROW BELOW
U+0362	Õ¢	COMBINING DOUBLE RIGHTWARDS ARROW BELOW
U+1AB3	·™≥	COMBINING DOWNWARDS ARROW
U+20D4	‚Éî	COMBINING ANTICLOCKWISE ARROW ABOVE
U+20D5	‚Éï	COMBINING CLOCKWISE ARROW ABOVE
U+20D6	‚Éñ	COMBINING LEFT ARROW ABOVE
U+20D7	‚Éó	COMBINING RIGHT ARROW ABOVE
U+20E1	‚É°	COMBINING LEFT RIGHT ARROW ABOVE
U+20EA	‚É™	COMBINING LEFTWARDS ARROW OVERLAY
U+20EE	‚ÉÆ	COMBINING LEFT ARROW BELOW
U+20EF	‚ÉØ	COMBINING RIGHT ARROW BELOW
U+2190	‚Üê	LEFTWARDS ARROW
U+2191	‚Üë	UPWARDS ARROW
U+2192	‚Üí	RIGHTWARDS ARROW
U+2193	‚Üì	DOWNWARDS ARROW
U+2194	‚Üî	LEFT RIGHT ARROW
U+2195	‚Üï	UP DOWN ARROW
U+2196	‚Üñ	NORTH WEST ARROW
U+2197	‚Üó	NORTH EAST ARROW
U+2198	‚Üò	SOUTH EAST ARROW
U+2199	‚Üô	SOUTH WEST ARROW
U+219A	‚Üö	LEFTWARDS ARROW WITH STROKE
U+219B	‚Üõ	RIGHTWARDS ARROW WITH STROKE
U+219C	‚Üú	LEFTWARDS WAVE ARROW
U+219D	‚Üù	RIGHTWARDS WAVE ARROW
U+219E	‚Üû	LEFTWARDS TWO HEADED ARROW
U+

### My version of char_finder

I added several featuers, and adhered to best practices to extend the original character finder function `cf.py`

**Setup**

In [21]:
import sys
import os

# Go two levels down to reach the charfinder_hmd directory
module_path = os.path.abspath(os.path.join("charfinder", "charfinder_hmd"))
if module_path not in sys.path:
    sys.path.insert(0, module_path)

from cf_lib import find_chars
for line in find_chars("snowman"):
    print(line)

[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] Found 3 match(es) for query: 'snowman'
U+2603	‚òÉ	SNOWMAN  (\u2603)
U+26C4	‚õÑ	SNOWMAN WITHOUT SNOW  (\u26c4)
U+26C7	‚õá	BLACK SNOWMAN  (\u26c7)


In [22]:
for line in list(find_chars("heart", verbose=False))[:3]:
    print(repr(line))

'U+2619\t‚òô\tREVERSED ROTATED FLORAL HEART BULLET  (\\u2619)'
'U+2661\t‚ô°\tWHITE HEART SUIT  (\\u2661)'
'U+2665\t‚ô•\tBLACK HEART SUIT  (\\u2665)'


In [23]:
import os
import subprocess
import sys

# Step 1: Get absolute path to the project root only if not already there
if not os.path.basename(os.getcwd()) == "charfinder_hmd":
    os.chdir(os.path.join(os.getcwd(), "charfinder", "charfinder_hmd"))

# Step 2: Confirm script path
CLI_SCRIPT = os.path.abspath("cf_cli.py")
assert os.path.exists(CLI_SCRIPT), f"cf_cli.py not found at {CLI_SCRIPT}"


In [24]:
def run_example(title, args, show_exit_code=True):
    print(f"\n=== {title} ===")

    result = subprocess.run(
        [sys.executable, CLI_SCRIPT] + args,
        capture_output=True,
        text=True,
        encoding="utf-8"
    )

    out = result.stdout.strip()
    err = result.stderr.strip()

    if out:
        print("üì§ STDOUT:\n" + out)
    else:
        print("üì§ STDOUT: (no output)")

    if err:
        print("‚ö†Ô∏è STDERR:\n" + err)

    if show_exit_code:
        print(f"üîö EXIT CODE: {result.returncode}")

**Unit Tests**

In [25]:
import pytest
pytest.main(["tests", "-v", "--maxfail=1", "--disable-warnings", "--color=yes"])

platform win32 -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- c:\Users\HamedVAHEB\Documents\Training\Python\FluentPython\repo\Training-Python\env_train\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\HamedVAHEB\Documents\Training\Python\FluentPython\repo\Training-Python\src\Part_I\Chapter_04_UnicodeTextsVSBytes\charfinder\charfinder_hmd
collecting ... collected 26 items

tests/test_cli.py::test_cli_strict_match PASSED                          [  3%]
tests/test_cli.py::test_cli_fuzzy_match PASSED                           [  7%]
tests/test_cli.py::test_cli_threshold_loose PASSED                       [ 11%]
tests/test_cli.py::test_cli_threshold_strict FAILED                      [ 15%]

__________________________ test_cli_threshold_strict __________________________

    def test_cli_threshold_strict():
        out, err, code = run_cli(['-q', 'grnning', '--fuzzy', '--threshold', '0.95'])
>       assert code == 0
E       assert 2 == 0

tests\test_cli.py:34: AssertionError
FA

<ExitCode.TESTS_FAILED: 1>

#### Examples

**Example 0:** Comparison with original

In [12]:
run_example("Basic match: arrow", ["-q", "arrow"])


=== Basic match: arrow ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] Found 626 match(es) for query: 'arrow'
U+02C2	ÀÇ	MODIFIER LETTER LEFT ARROWHEAD  (\u02c2)
U+02C3	ÀÉ	MODIFIER LETTER RIGHT ARROWHEAD  (\u02c3)
U+02C4	ÀÑ	MODIFIER LETTER UP ARROWHEAD  (\u02c4)
U+02C5	ÀÖ	MODIFIER LETTER DOWN ARROWHEAD  (\u02c5)
U+02EF	ÀØ	MODIFIER LETTER LOW DOWN ARROWHEAD  (\u02ef)
U+02F0	À∞	MODIFIER LETTER LOW UP ARROWHEAD  (\u02f0)
U+02F1	À±	MODIFIER LETTER LOW LEFT ARROWHEAD  (\u02f1)
U+02F2	À≤	MODIFIER LETTER LOW RIGHT ARROWHEAD  (\u02f2)
U+02FF	Àø	MODIFIER LETTER LOW LEFT ARROW  (\u02ff)
U+034D	Õç	COMBINING LEFT RIGHT ARROW BELOW  (\u034d)
U+034E	Õé	COMBINING UPWARDS ARROW BELOW  (\u034e)
U+0350	Õê	COMBINING RIGHT ARROWHEAD ABOVE  (\u0350)
U+0354	Õî	COMBINING LEFT ARROWHEAD BELOW  (\u0354)
U+0355	Õï	COMBINING RIGHT ARROWHEAD BELOW  (\u0355)
U+0356	Õñ	COMBINING RIGHT ARROWHEAD AND UP ARROWHEAD BELOW  (\u0356)
U+0362	Õ¢	COMBINING DOUBLE RIGHTWARDS ARROW BELOW 

**Example 1:** Basic strict match ‚Äî "heart"

In [13]:
run_example("Example 1: Strict Match - 'heart'", ["-q", "heart"])


=== Example 1: Strict Match - 'heart' ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] Found 52 match(es) for query: 'heart'
U+2619	‚òô	REVERSED ROTATED FLORAL HEART BULLET  (\u2619)
U+2661	‚ô°	WHITE HEART SUIT  (\u2661)
U+2665	‚ô•	BLACK HEART SUIT  (\u2665)
U+2763	‚ù£	HEAVY HEART EXCLAMATION MARK ORNAMENT  (\u2763)
U+2764	‚ù§	HEAVY BLACK HEART  (\u2764)
U+2765	‚ù•	ROTATED HEAVY BLACK HEART BULLET  (\u2765)
U+2766	‚ù¶	FLORAL HEART  (\u2766)
U+2767	‚ùß	ROTATED FLORAL HEART BULLET  (\u2767)
U+2E96	‚∫ñ	CJK RADICAL HEART ONE  (\u2e96)
U+2E97	‚∫ó	CJK RADICAL HEART TWO  (\u2e97)
U+2F3C	‚ºº	KANGXI RADICAL HEART  (\u2f3c)
U+1F0B1	üÇ±	PLAYING CARD ACE OF HEARTS  (\u1f0b1)
U+1F0B2	üÇ≤	PLAYING CARD TWO OF HEARTS  (\u1f0b2)
U+1F0B3	üÇ≥	PLAYING CARD THREE OF HEARTS  (\u1f0b3)
U+1F0B4	üÇ¥	PLAYING CARD FOUR OF HEARTS  (\u1f0b4)
U+1F0B5	üÇµ	PLAYING CARD FIVE OF HEARTS  (\u1f0b5)
U+1F0B6	üÇ∂	PLAYING CARD SIX OF HEARTS  (\u1f0b6)
U+1F0B7	üÇ∑	PLAYING CARD SEV

 **Example 2:** Fuzzy Match with Typo: "grnning" (intended: 'grinning')

In [14]:
run_example("Example 2: Fuzzy Match with Typo - 'grnning' (intended: 'grinning')", ["-q", "grnning", "--fuzzy"])


=== Example 2: Fuzzy Match with Typo - 'grnning' (intended: 'grinning') ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] No exact match found for 'grnning', trying fuzzy matching (threshold=0.7)...
[INFO] Found 2 match(es) for query: 'grnning'
U+1F48D	üíç	RING  (\u1f48d)
U+1F600	üòÄ	GRINNING FACE  (\u1f600)
üîö EXIT CODE: 0


 **Example 3:** Unicode with Diacritics: "acute"

In [15]:
run_example("Example 3: Diacritics - 'acute'", ["-q", "acute"])


=== Example 3: Diacritics - 'acute' ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] Found 98 match(es) for query: 'acute'
U+00B4	¬¥	ACUTE ACCENT  (\u00b4)
U+00C1	√Å	LATIN CAPITAL LETTER A WITH ACUTE  (\u00c1)
U+00C9	√â	LATIN CAPITAL LETTER E WITH ACUTE  (\u00c9)
U+00CD	√ç	LATIN CAPITAL LETTER I WITH ACUTE  (\u00cd)
U+00D3	√ì	LATIN CAPITAL LETTER O WITH ACUTE  (\u00d3)
U+00DA	√ö	LATIN CAPITAL LETTER U WITH ACUTE  (\u00da)
U+00DD	√ù	LATIN CAPITAL LETTER Y WITH ACUTE  (\u00dd)
U+00E1	√°	LATIN SMALL LETTER A WITH ACUTE  (\u00e1)
U+00E9	√©	LATIN SMALL LETTER E WITH ACUTE  (\u00e9)
U+00ED	√≠	LATIN SMALL LETTER I WITH ACUTE  (\u00ed)
U+00F3	√≥	LATIN SMALL LETTER O WITH ACUTE  (\u00f3)
U+00FA	√∫	LATIN SMALL LETTER U WITH ACUTE  (\u00fa)
U+00FD	√Ω	LATIN SMALL LETTER Y WITH ACUTE  (\u00fd)
U+0106	ƒÜ	LATIN CAPITAL LETTER C WITH ACUTE  (\u0106)
U+0107	ƒá	LATIN SMALL LETTER C WITH ACUTE  (\u0107)
U+0139	ƒπ	LATIN CAPITAL LETTER L WITH ACUTE  (\u0139)
U+013A	ƒ∫

 **Example 4:** Partial Word Match: "snow"

In [16]:
run_example("Example 4: Partial Word - 'snow'", ["-q", "snow"])


=== Example 4: Partial Word - 'snow' ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] Found 9 match(es) for query: 'snow'
U+2603	‚òÉ	SNOWMAN  (\u2603)
U+26C4	‚õÑ	SNOWMAN WITHOUT SNOW  (\u26c4)
U+26C7	‚õá	BLACK SNOWMAN  (\u26c7)
U+2744	‚ùÑ	SNOWFLAKE  (\u2744)
U+2745	‚ùÖ	TIGHT TRIFOLIATE SNOWFLAKE  (\u2745)
U+2746	‚ùÜ	HEAVY CHEVRON SNOWFLAKE  (\u2746)
U+1F328	üå®	CLOUD WITH SNOW  (\u1f328)
U+1F3C2	üèÇ	SNOWBOARDER  (\u1f3c2)
U+1F3D4	üèî	SNOW CAPPED MOUNTAIN  (\u1f3d4)
üîö EXIT CODE: 0


**Example 5:** Tweaking Fuzzy Matching Threshold

In [17]:
# Moderate Threshold (Default)
run_example("Example 5a: Fuzzy Match (threshold 0.7)", ["-q", "grnning", "--fuzzy", "--threshold", "0.7"])


=== Example 5a: Fuzzy Match (threshold 0.7) ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] No exact match found for 'grnning', trying fuzzy matching (threshold=0.7)...
[INFO] Found 2 match(es) for query: 'grnning'
U+1F48D	üíç	RING  (\u1f48d)
U+1F600	üòÄ	GRINNING FACE  (\u1f600)
üîö EXIT CODE: 0


In [18]:
# Loose Threshold
run_example("Example 5b: Fuzzy Match (threshold 0.6)", ["-q", "grnning", "--fuzzy", "--threshold", "0.6"])


=== Example 5b: Fuzzy Match (threshold 0.6) ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] No exact match found for 'grnning', trying fuzzy matching (threshold=0.6)...
[INFO] Found 3 match(es) for query: 'grnning'
U+2607	‚òá	LIGHTNING  (\u2607)
U+1F48D	üíç	RING  (\u1f48d)
U+1F600	üòÄ	GRINNING FACE  (\u1f600)
üîö EXIT CODE: 0


In [19]:
#  Strict Threshold
run_example("Example 5c: Fuzzy Match (threshold 0.71)", ["-q", "grnning", "--fuzzy", "--threshold", "0.71"])


=== Example 5c: Fuzzy Match (threshold 0.71) ===
üì§ STDOUT:
[INFO] Loaded Unicode name cache from: unicode_name_cache.json
[INFO] No exact match found for 'grnning', trying fuzzy matching (threshold=0.71)...
[INFO] Found 1 match(es) for query: 'grnning'
U+1F48D	üíç	RING  (\u1f48d)
üîö EXIT CODE: 0


### Numeric Meaning of Characters