# Chapter 4. Unicode Text Versus Bytes
---

## ToC

1. [Sorting Unicode Text](#sorting-unicode-text)  
    1.1. [Sorting with the Unicode Collation Algorithm](#sorting-with-the-unicode-collation-algorithm)  
2. [The Unicode Database](#the-unicode-database)  
    2.1.1 [Finding Characters by Name](#finding-characters-by-name)  
    2.1.1 [My Version of char_finder](#my-version-cf_optimizedpy)  
    2.2. [Numeric Meaning of Characters](#numeric-meaning-of-characters)
---

## Sorting Unicode Text

Python sorts sequences of any type by comparing the items in each sequence one by
one. For strings, this means comparing the code points. Unfortunately, this produces
unacceptable results for anyone who uses non-ASCII characters.

In [8]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted(fruits)

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

Sorting rules vary for different locales, but in Portuguese and many languages that use the Latin alphabet, accents and cedillas rarely make a difference when sorting. So “cajá” is sorted as “caja,” and must come before “caju.”

The sorted `fruits` list should be:

```python
['açaí', 'acerola', 'atemoia', 'cajá', 'caju']
```

The standard way to sort non-ASCII text in Python is to use the `locale.strxfrm`
function which, according to the [`locale` module docs](https://docs.python.org/3/library/locale.html#locale.strxfrm), “transforms a string to one
that can be used in locale-aware comparisons.”
To enable locale.strxfrm, you must first set a suitable locale for your application,
and pray that the OS supports it.

In [6]:
import locale
my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
print(my_locale)

pt_BR.UTF-8


In [7]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)
print(sorted_fruits)

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']


You need to call `setlocale(LC_COLLATE, «your_locale»)` before using `locale.strxfrm` as the key when sorting.
The `local`, standard library solution to internationalized sorting works, but due to multiple reasons, depending on locale setting creates deployment headaches. Fortunately there is a simpler solution, presented in following.

### Sorting with the Unicode Collation Algorithm

In [9]:
import pyuca
coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

![Figure 69](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/69.PNG)

## The Unicode Database

The Unicode standard provides an entire database—in the form of several structured text files—that includes not only the table mapping code points to character names, but also metadata about the individual characters and how they are related. For example, the Unicode database records whether a character is printable, is a letter, is a decimal digit, or is some other numeric symbol. That’s how the str methods isalpha, isprintable, isdecimal, and isnumeric work. str.casefold also uses information from a Unicode table.

![Figure 70](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/70.PNG)

[Link: Unicode_character_property#General_Category](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category)

### Finding Characters by Name

The `unicodedata` module has functions to retrieve character metadata, including `unicodedata.name()`, which returns a character’s official name in the standard. You can use the `name()` function to build apps that let users search for characters by
name.

In [12]:
from unicodedata import name
print(name('A'))
print(name('ã'))
print(name('😸'))
print(name('♛'))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER A WITH TILDE
GRINNING CAT FACE WITH SMILING EYES
BLACK CHESS QUEEN


In [26]:
import sys
import os

# Path to the folder containing 'cf.py' to sys.path
module_path = os.path.abspath(os.path.join("..", "charfinder"))
if module_path not in sys.path:
    sys.path.append(module_path)

#import cf

In [24]:
!python ./charfinder/cf.py cat smiling

U+1F638	😸	GRINNING CAT FACE WITH SMILING EYES
U+1F63A	😺	SMILING CAT FACE WITH OPEN MOUTH
U+1F63B	😻	SMILING CAT FACE WITH HEART-SHAPED EYES


In [51]:
!python ./charfinder/cf.py arrow

U+02FF	˿	MODIFIER LETTER LOW LEFT ARROW
U+034D	͍	COMBINING LEFT RIGHT ARROW BELOW
U+034E	͎	COMBINING UPWARDS ARROW BELOW
U+0362	͢	COMBINING DOUBLE RIGHTWARDS ARROW BELOW
U+1AB3	᪳	COMBINING DOWNWARDS ARROW
U+20D4	⃔	COMBINING ANTICLOCKWISE ARROW ABOVE
U+20D5	⃕	COMBINING CLOCKWISE ARROW ABOVE
U+20D6	⃖	COMBINING LEFT ARROW ABOVE
U+20D7	⃗	COMBINING RIGHT ARROW ABOVE
U+20E1	⃡	COMBINING LEFT RIGHT ARROW ABOVE
U+20EA	⃪	COMBINING LEFTWARDS ARROW OVERLAY
U+20EE	⃮	COMBINING LEFT ARROW BELOW
U+20EF	⃯	COMBINING RIGHT ARROW BELOW
U+2190	←	LEFTWARDS ARROW
U+2191	↑	UPWARDS ARROW
U+2192	→	RIGHTWARDS ARROW
U+2193	↓	DOWNWARDS ARROW
U+2194	↔	LEFT RIGHT ARROW
U+2195	↕	UP DOWN ARROW
U+2196	↖	NORTH WEST ARROW
U+2197	↗	NORTH EAST ARROW
U+2198	↘	SOUTH EAST ARROW
U+2199	↙	SOUTH WEST ARROW
U+219A	↚	LEFTWARDS ARROW WITH STROKE
U+219B	↛	RIGHTWARDS ARROW WITH STROKE
U+219C	↜	LEFTWARDS WAVE ARROW
U+219D	↝	RIGHTWARDS WAVE ARROW
U+219E	↞	LEFTWARDS TWO HEADED ARROW
U+219F	↟	UPWARDS TWO HEADED ARROW
U+21A0	↠	RIGHTWARDS 

### My version of char_finder

I added several featuers, and adhered to best practices to extend the original character finder function `cf.py`

**Setup**

In [112]:
import os
import subprocess
import sys

# Step 1: Get absolute path to the project root only if not already there
if not os.path.basename(os.getcwd()) == "charfinder_hmd":
    os.chdir(os.path.join(os.getcwd(), "charfinder", "charfinder_hmd"))

# Step 2: Confirm script path
CLI_SCRIPT = os.path.abspath("cf_cli.py")
assert os.path.exists(CLI_SCRIPT), f"cf_cli.py not found at {CLI_SCRIPT}"


In [135]:
def run_example(title, args, show_exit_code=True):
    print(f"\n=== {title} ===")

    result = subprocess.run(
        [sys.executable, CLI_SCRIPT] + args,
        capture_output=True,
        text=True,
        encoding="utf-8"
    )

    out = result.stdout.strip()
    err = result.stderr.strip()

    if out:
        print("📤 STDOUT:\n" + out)
    else:
        print("📤 STDOUT: (no output)")

    if err:
        print("⚠️ STDERR:\n" + err)

    if show_exit_code:
        print(f"🔚 EXIT CODE: {result.returncode}")

**Unit Tests**

In [126]:
import pytest
pytest.main(["tests", "-v", "--maxfail=1", "--disable-warnings"])

platform win32 -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- c:\Users\HamedVAHEB\Documents\Training\Python\FluentPython\repo\Training-Python\env_train\Scripts\python.exe
cachedir: .pytest_cache
rootdir: c:\Users\HamedVAHEB\Documents\Training\Python\FluentPython\repo\Training-Python\src\Part_I\Chapter_04_UnicodeTextsVSBytes\charfinder\charfinder_hmd
[1mcollecting ... [0mcollected 25 items

tests/test_cli.py::test_cli_strict_match [32mPASSED[0m[32m                          [  4%][0m
tests/test_cli.py::test_cli_fuzzy_match [32mPASSED[0m[32m                           [  8%][0m
tests/test_cli.py::test_cli_threshold_loose [32mPASSED[0m[32m                       [ 12%][0m
tests/test_cli.py::test_cli_threshold_strict [32mPASSED[0m[32m                      [ 16%][0m
tests/test_cli.py::test_cli_invalid_threshold [32mPASSED[0m[32m                     [ 20%][0m
tests/test_cli.py::test_cli_empty_query [32mPASSED[0m[32m                           [ 24%][0m
tests/test_cli.p

<ExitCode.OK: 0>

#### Examples

**Example 0:** Comparison with original

In [137]:
run_example("Basic match: arrow", ["-q", "arrow"])


=== Basic match: arrow ===
📤 STDOUT:
U+02C2      ˂     MODIFIER LETTER LEFT ARROWHEAD                    (\u02c2)
U+02C3      ˃     MODIFIER LETTER RIGHT ARROWHEAD                   (\u02c3)
U+02C4      ˄     MODIFIER LETTER UP ARROWHEAD                      (\u02c4)
U+02C5      ˅     MODIFIER LETTER DOWN ARROWHEAD                    (\u02c5)
U+02EF      ˯     MODIFIER LETTER LOW DOWN ARROWHEAD                (\u02ef)
U+02F0      ˰     MODIFIER LETTER LOW UP ARROWHEAD                  (\u02f0)
U+02F1      ˱     MODIFIER LETTER LOW LEFT ARROWHEAD                (\u02f1)
U+02F2      ˲     MODIFIER LETTER LOW RIGHT ARROWHEAD               (\u02f2)
U+02FF      ˿     MODIFIER LETTER LOW LEFT ARROW                    (\u02ff)
U+034D      ͍     COMBINING LEFT RIGHT ARROW BELOW                  (\u034d)
U+034E      ͎     COMBINING UPWARDS ARROW BELOW                     (\u034e)
U+0350      ͐     COMBINING RIGHT ARROWHEAD ABOVE                   (\u0350)
U+0354      ͔     COMBINING LEFT ARROW

**Example 1:** Basic strict match — "heart"

In [138]:
run_example("Example 1: Strict Match - 'heart'", ["-q", "heart"])


=== Example 1: Strict Match - 'heart' ===
📤 STDOUT:
U+2619      ☙     REVERSED ROTATED FLORAL HEART BULLET              (\u2619)
U+2661      ♡     WHITE HEART SUIT                                  (\u2661)
U+2665      ♥     BLACK HEART SUIT                                  (\u2665)
U+2763      ❣     HEAVY HEART EXCLAMATION MARK ORNAMENT             (\u2763)
U+2764      ❤     HEAVY BLACK HEART                                 (\u2764)
U+2765      ❥     ROTATED HEAVY BLACK HEART BULLET                  (\u2765)
U+2766      ❦     FLORAL HEART                                      (\u2766)
U+2767      ❧     ROTATED FLORAL HEART BULLET                       (\u2767)
U+2E96      ⺖     CJK RADICAL HEART ONE                             (\u2e96)
U+2E97      ⺗     CJK RADICAL HEART TWO                             (\u2e97)
U+2F3C      ⼼     KANGXI RADICAL HEART                              (\u2f3c)
U+1F0B1     🂱     PLAYING CARD ACE OF HEARTS                        (\u1f0b1)
U+1F0B2     🂲     PLAY

 **Example 2:** Fuzzy Match with Typo: "grnning" (intended: 'grinning')

In [139]:
run_example("Example 2: Fuzzy Match with Typo - 'grnning' (intended: 'grinning')", ["-q", "grnning", "--fuzzy"])


=== Example 2: Fuzzy Match with Typo - 'grnning' (intended: 'grinning') ===
📤 STDOUT:
U+1F48D     💍     RING                                              (\u1f48d)
U+1F600     😀     GRINNING FACE                                     (\u1f600)
🔚 EXIT CODE: 0


 **Example 3:** Unicode with Diacritics: "acute"

In [140]:
run_example("Example 3: Diacritics - 'acute'", ["-q", "acute"])


=== Example 3: Diacritics - 'acute' ===
📤 STDOUT:
U+00B4      ´     ACUTE ACCENT                                      (\u00b4)
U+00C1      Á     LATIN CAPITAL LETTER A WITH ACUTE                 (\u00c1)
U+00C9      É     LATIN CAPITAL LETTER E WITH ACUTE                 (\u00c9)
U+00CD      Í     LATIN CAPITAL LETTER I WITH ACUTE                 (\u00cd)
U+00D3      Ó     LATIN CAPITAL LETTER O WITH ACUTE                 (\u00d3)
U+00DA      Ú     LATIN CAPITAL LETTER U WITH ACUTE                 (\u00da)
U+00DD      Ý     LATIN CAPITAL LETTER Y WITH ACUTE                 (\u00dd)
U+00E1      á     LATIN SMALL LETTER A WITH ACUTE                   (\u00e1)
U+00E9      é     LATIN SMALL LETTER E WITH ACUTE                   (\u00e9)
U+00ED      í     LATIN SMALL LETTER I WITH ACUTE                   (\u00ed)
U+00F3      ó     LATIN SMALL LETTER O WITH ACUTE                   (\u00f3)
U+00FA      ú     LATIN SMALL LETTER U WITH ACUTE                   (\u00fa)
U+00FD      ý     LATIN S

 **Example 4:** Partial Word Match: "snow"

In [141]:
run_example("Example 4: Partial Word - 'snow'", ["-q", "snow"])


=== Example 4: Partial Word - 'snow' ===
📤 STDOUT:
U+2603      ☃     SNOWMAN                                           (\u2603)
U+26C4      ⛄     SNOWMAN WITHOUT SNOW                              (\u26c4)
U+26C7      ⛇     BLACK SNOWMAN                                     (\u26c7)
U+2744      ❄     SNOWFLAKE                                         (\u2744)
U+2745      ❅     TIGHT TRIFOLIATE SNOWFLAKE                        (\u2745)
U+2746      ❆     HEAVY CHEVRON SNOWFLAKE                           (\u2746)
U+1F328     🌨     CLOUD WITH SNOW                                   (\u1f328)
U+1F3C2     🏂     SNOWBOARDER                                       (\u1f3c2)
U+1F3D4     🏔     SNOW CAPPED MOUNTAIN                              (\u1f3d4)
🔚 EXIT CODE: 0


**Example 5:** Tweaking Fuzzy Matching Threshold

In [142]:
# Moderate Threshold (Default)
run_example("Example 5a: Fuzzy Match (threshold 0.7)", ["-q", "grnning", "--fuzzy", "--threshold", "0.7"])


=== Example 5a: Fuzzy Match (threshold 0.7) ===
📤 STDOUT:
U+1F48D     💍     RING                                              (\u1f48d)
U+1F600     😀     GRINNING FACE                                     (\u1f600)
🔚 EXIT CODE: 0


In [134]:
# Loose Threshold
run_example("Example 5b: Fuzzy Match (threshold 0.6)", ["-q", "grnning", "--fuzzy", "--threshold", "0.6"])


=== Example 5b: Fuzzy Match (threshold 0.6) ===
📤 STDOUT:
U+2607      ☇     LIGHTNING  (\u2607)
U+1F48D     💍     RING  (\u1f48d)
U+1F600     😀     GRINNING FACE  (\u1f600)
🔚 EXIT CODE: 0


In [143]:
#  Strict Threshold
run_example("Example 5c: Fuzzy Match (threshold 0.71)", ["-q", "grnning", "--fuzzy", "--threshold", "0.71"])


=== Example 5c: Fuzzy Match (threshold 0.71) ===
📤 STDOUT:
U+1F48D     💍     RING                                              (\u1f48d)
🔚 EXIT CODE: 0


### Numeric Meaning of Characters