# Chapter 4. Unicode Text Versus Bytes
---

## ToC

1. [Sorting Unicode Text](#sorting-unicode-text)  
    1.1. [Sorting with the Unicode Collation Algorithm](#sorting-with-the-unicode-collation-algorithm)  
2. [The Unicode Database](#the-unicode-database)  
    2.1. [Finding Characters by Name](#finding-characters-by-name)  
    2.2. [My Version of char_finder](#my-version-of-char_finder-pypi-package)  
    2.3. [Numeric Meaning of Characters](#numeric-meaning-of-characters)
---

## Sorting Unicode Text

Python sorts sequences of any type by comparing the items in each sequence one by
one. For strings, this means comparing the code points. Unfortunately, this produces
unacceptable results for anyone who uses non-ASCII characters.

In [1]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted(fruits)

['acerola', 'atemoia', 'açaí', 'caju', 'cajá']

Sorting rules vary for different locales, but in Portuguese and many languages that use the Latin alphabet, accents and cedillas rarely make a difference when sorting. So “cajá” is sorted as “caja,” and must come before “caju.”

The sorted `fruits` list should be:

```python
['açaí', 'acerola', 'atemoia', 'cajá', 'caju']
```

The standard way to sort non-ASCII text in Python is to use the `locale.strxfrm`
function which, according to the [`locale` module docs](https://docs.python.org/3/library/locale.html#locale.strxfrm), “transforms a string to one
that can be used in locale-aware comparisons.”
To enable locale.strxfrm, you must first set a suitable locale for your application,
and pray that the OS supports it.

In [2]:
import locale
my_locale = locale.setlocale(locale.LC_COLLATE, 'pt_BR.UTF-8')
print(my_locale)

pt_BR.UTF-8


In [3]:
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=locale.strxfrm)
print(sorted_fruits)

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']


You need to call `setlocale(LC_COLLATE, «your_locale»)` before using `locale.strxfrm` as the key when sorting.
The `local`, standard library solution to internationalized sorting works, but due to multiple reasons, depending on locale setting creates deployment headaches. Fortunately there is a simpler solution, presented in following.

### Sorting with the Unicode Collation Algorithm

In [4]:
import pyuca
coll = pyuca.Collator()
fruits = ['caju', 'atemoia', 'cajá', 'açaí', 'acerola']
sorted_fruits = sorted(fruits, key=coll.sort_key)
sorted_fruits

['açaí', 'acerola', 'atemoia', 'cajá', 'caju']

![Figure 69](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/69.PNG)

## The Unicode Database

The Unicode standard provides an entire database—in the form of several structured text files—that includes not only the table mapping code points to character names, but also metadata about the individual characters and how they are related. For example, the Unicode database records whether a character is printable, is a letter, is a decimal digit, or is some other numeric symbol. That’s how the str methods isalpha, isprintable, isdecimal, and isnumeric work. str.casefold also uses information from a Unicode table.

![Figure 70](https://raw.githubusercontent.com/berserkhmdvhb/Training-Python/main/figures/Part_I/70.PNG)

[Link: Unicode_character_property#General_Category](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category)

### Finding Characters by Name

The `unicodedata` module has functions to retrieve character metadata, including `unicodedata.name()`, which returns a character’s official name in the standard. You can use the `name()` function to build apps that let users search for characters by
name.

In [5]:
from unicodedata import name
print(name('A'))
print(name('ã'))
print(name('😸'))
print(name('♛'))

LATIN CAPITAL LETTER A
LATIN SMALL LETTER A WITH TILDE
GRINNING CAT FACE WITH SMILING EYES
BLACK CHESS QUEEN


In [6]:
import sys
import os

# Path to the folder containing 'cf.py' to sys.path
module_path = os.path.abspath(os.path.join("..", "charfinder"))
if module_path not in sys.path:
    sys.path.append(module_path)

#import cf

In [7]:
!python ./charfinder/cf.py cat smiling

U+1F638	😸	GRINNING CAT FACE WITH SMILING EYES
U+1F63A	😺	SMILING CAT FACE WITH OPEN MOUTH
U+1F63B	😻	SMILING CAT FACE WITH HEART-SHAPED EYES


In [8]:
!python ./charfinder/cf.py arrow

U+02FF	˿	MODIFIER LETTER LOW LEFT ARROW
U+034D	͍	COMBINING LEFT RIGHT ARROW BELOW
U+034E	͎	COMBINING UPWARDS ARROW BELOW
U+0362	͢	COMBINING DOUBLE RIGHTWARDS ARROW BELOW
U+1AB3	᪳	COMBINING DOWNWARDS ARROW
U+20D4	⃔	COMBINING ANTICLOCKWISE ARROW ABOVE
U+20D5	⃕	COMBINING CLOCKWISE ARROW ABOVE
U+20D6	⃖	COMBINING LEFT ARROW ABOVE
U+20D7	⃗	COMBINING RIGHT ARROW ABOVE
U+20E1	⃡	COMBINING LEFT RIGHT ARROW ABOVE
U+20EA	⃪	COMBINING LEFTWARDS ARROW OVERLAY
U+20EE	⃮	COMBINING LEFT ARROW BELOW
U+20EF	⃯	COMBINING RIGHT ARROW BELOW
U+2190	←	LEFTWARDS ARROW
U+2191	↑	UPWARDS ARROW
U+2192	→	RIGHTWARDS ARROW
U+2193	↓	DOWNWARDS ARROW
U+2194	↔	LEFT RIGHT ARROW
U+2195	↕	UP DOWN ARROW
U+2196	↖	NORTH WEST ARROW
U+2197	↗	NORTH EAST ARROW
U+2198	↘	SOUTH EAST ARROW
U+2199	↙	SOUTH WEST ARROW
U+219A	↚	LEFTWARDS ARROW WITH STROKE
U+219B	↛	RIGHTWARDS ARROW WITH STROKE
U+219C	↜	LEFTWARDS WAVE ARROW
U+219D	↝	RIGHTWARDS WAVE ARROW
U+219E	↞	LEFTWARDS TWO HEADED ARROW
U+219F	↟	UPWARDS TWO HEADED ARROW
U+21A0	↠	RIGHTWARDS 

### My version of char_finder: PyPI package

I implemented a complete package and published it PyPI, extends the original `cf.py`, adding various featuers and best practices. You can find this package on [PyPI: charfinder](https://pypi.org/project/charfinder/)

### Numeric Meaning of Characters

The `unicodedata` module includes functions to check whether a Unicode character represents a number and, if so, its numeric value for humans—as opposed to its code point number.

**Example:** Unicode database numerical character metadata

In [9]:
import unicodedata
import re

re_digit = re.compile(r'\d')
sample = '1\xbc\xb2\u0969\u136b\u216b\u2466\u2480\u3285'
divider = "-" * 60

print("CODE\tCHAR\t re?\t isdig\t isnum\tNUM  \tNAME")
print(divider)
for char in sample:
    print(f'U+{ord(char):04x}',
        char.center(6),
        're_dig' if re_digit.match(char) else '-',
        'isdig' if char.isdigit() else '-',
        'isnum' if char.isnumeric() else '-',
        f'{unicodedata.numeric(char):5.2f}',
        unicodedata.name(char),
        sep='\t')


CODE	CHAR	 re?	 isdig	 isnum	NUM  	NAME
------------------------------------------------------------
U+0031	  1   	re_dig	isdig	isnum	 1.00	DIGIT ONE
U+00bc	  ¼   	-	-	isnum	 0.25	VULGAR FRACTION ONE QUARTER
U+00b2	  ²   	-	isdig	isnum	 2.00	SUPERSCRIPT TWO
U+0969	  ३   	re_dig	isdig	isnum	 3.00	DEVANAGARI DIGIT THREE
U+136b	  ፫   	-	isdig	isnum	 3.00	ETHIOPIC DIGIT THREE
U+216b	  Ⅻ   	-	-	isnum	12.00	ROMAN NUMERAL TWELVE
U+2466	  ⑦   	-	isdig	isnum	 7.00	CIRCLED DIGIT SEVEN
U+2480	  ⒀   	-	-	isnum	13.00	PARENTHESIZED NUMBER THIRTEEN
U+3285	  ㊅   	-	-	isnum	 6.00	CIRCLED IDEOGRAPH SIX


The example shows that the regular expression `r'\d'` matches the digit “1” and the
Devanagari digit 3, but not some other characters that are considered digits by the
isdigit function. The `re` module is not as savvy about Unicode as it could be. The
new `regex` module available on PyPI was designed to eventually replace re and provides
better Unicode support

[`unicode` module docs](https://docs.python.org/3/library/unicodedata.html)