# Introduction

Python has a built-in `.sort()` method that modifies the list inplace. The `sorted()` built-in function builds a new sorted list from an iterable.

In [24]:
a = ["A", "E", "Z", "a", "e", "é", "z"]
print(sorted(a))
print(a)
a.sort()
print(a)

['A', 'E', 'Z', 'a', 'e', 'z', 'é']
['A', 'E', 'Z', 'a', 'e', 'é', 'z']
['A', 'E', 'Z', 'a', 'e', 'z', 'é']


The built-in `.sort()` and `sorted()` perform a crude codepoint ordering.


# Using the locale module for collation

For information on using the Locale module in Google Colab refer to the code snippets in [locale_module_colab.ipynb](https://github.com/enabling-languages/python-i18n/blob/main/colab/locale_module_colab.ipynb).

In [25]:
import locale
locale.setlocale(locale.LC_ALL, '')

'en_AU.UTF-8'

## Available locales

It is possible to use the `locale` module and the key parameter of `list.sort()` and `sorted()` to obtain a tailored sort using a supported locale.

_It is important to note that use of the `locale` module will not produce platform independant code. Each server or workstation may have a different set of locales available, this is most noticable in Linux and Unix based servers and workstations._

To get a list of available locales, supported by the locale module, for Windows users:

```py
for lang in locale.windows_locale.values():
    print(lang)
```

To get a list of available locales, supported by the locale module, for other operating systems:

```py
for lang in locale.locale_alias.values():
    print(lang)
```

To use a locale, it must be available within the operating systems. To see what locales are available on Linux and MacOS workstations and servers:

```shell
locale -a
```

_A Python script will use the system default encoding, unless the locale is explicitly changed. It is important to note that changing the locale can affect the execution of Python code in multiple ways. A script should set the locale once and not change it. If more than one language and set of collation rules is needed, alternative solutions should be used._


## Identifying the current locale

Use `locale.getlocale()` to identify current locale:

In [26]:
loc = locale.getlocale()
default_loc = locale.getdefaultlocale()

print("Current locale: " + str(loc))
print("Default locale: " + str(default_loc))

Current locale: ('en_AU', 'UTF-8')
Default locale: ('en_AU', 'UTF-8')



## Setting a locale

Use `locale.setlocale()` to set locale. You can set all the `LC_COLLATE` locale environment variables using:

```py
locale.setlocale(locale.LC_COLLATE, 'de_DE.UTF-8')
```

You can set the current locale to your default settings:

```py
locale.setlocale(locale.LC_ALL, '')
```

Changing locales is not necessrily thread safe, and is not recommeneded. Ideally, the locale is set once at the beginning for the script or program:

```py
import locale
locale.setlocale(locale.LC_ALL, '')
```


## Locale specific sorting

For locale aware sorting, use [locale.strxfrm()](https://docs.python.org/3/library/locale.html#locale.strxfrm) for a key function or [locale.strcoll()](https://docs.python.org/3/library/locale.html#locale.strcoll) for a comparison function.


In [27]:
locale.setlocale(locale.LC_COLLATE, "fr_FR.UTF-8")
a = ["A", "E", "Z", "a", "e", "é", "z"]
print(sorted(a))
print(sorted(a, key=locale.strxfrm))

['A', 'E', 'Z', 'a', 'e', 'z', 'é']
['A', 'E', 'Z', 'a', 'e', 'é', 'z']


In [28]:
corpus = ["Art", "Älg", "Ved", "Wasa"]
locale.setlocale(locale.LC_COLLATE, "sv_SE.UTF-8")

print(corpus)
corpus.sort(key=locale.strxfrm)
print(corpus)

['Art', 'Älg', 'Ved', 'Wasa']
['Art', 'Ved', 'Wasa', 'Älg']


In [29]:
locale.setlocale(locale.LC_COLLATE, "de_DE.UTF-8")
lastnames = ["Bange", "Änger", "Amman", "Änger", "Zelch", "Ösbach"]
print(sorted(lastnames))
print(sorted(lastnames, key=locale.strxfrm)) 

['Amman', 'Bange', 'Zelch', 'Änger', 'Änger', 'Ösbach']
['Amman', 'Änger', 'Änger', 'Bange', 'Ösbach', 'Zelch']


In [30]:
# Reset
locale.setlocale(locale.LC_ALL, '')

'en_AU.UTF-8'

# PyUCA

[pyuca](https://github.com/jtauber/pyuca) implements the _Default Unicode Collation Element Table_ (DUCET). These are collation rules from the [Unicode collation algorithm](https://en.wikipedia.org/wiki/Unicode_collation_algorithm) that are language or locale independant.

Unfortunately `pyuca` only supports up to Unicode 10.

# Using PyICU for collation

PyICU is a Python extension wrapping ICU4C (the [ICU](http://site.icu-project.org/) C/C++ libraries).

To use PyICU, it is necessary for ICU to be installed. A version of ICU may be already available on some systems. Prebuilt packages are available for a number of operating systems, alternatively ICU4C can be built from source code. Please use an approach best suited to your operating system. Prepackages versions may contain older versions of ICU4C.

The folowing command run in a Jupyter Notebook or Google Collab will provide information on the version of ICU4C installed:

```py
!icuinfo
```

In [31]:
# Test if running on Google Colab, if yes install required 
# software and Python packages.

try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False
if IN_COLAB:
  !pip install -q pyicu

In [32]:
#from icu import Locale, Collator, RuleBasedCollator, ICU_VERSION, VERSION, UCollAttribute, UCollAttributeValue
from icu import *

print("ICU version: " + ICU_VERSION)
print("PyICU version: " + VERSION)

ICU version: 69.1
PyICU version: 2.7.4


Various software and libraries that implement collation do so in a number of different ways. Python itself uses a simple codepoint or byte sequence order for collation. The PyUCA package implements the [Default Unicode Collation Element Table](https://unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table) (DUCET) of the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/).

ICU collation is based on the CLDR Collation Alorithm, which is based on DUCET. ICU provides access to both the ROOT collation rules and to language specific tailorings where required. The ICU4C API provides mechanisms for creating collator instamces using DUCET, the CLDR Coolation Algorithm, or the European ordering rules (EOR / EN 13710).

To use DUCET for the collator instance:

In [33]:
ducet_loc = Locale.createCanonical("und-u-co-ducet")
ducet_collator = Collator.createInstance(ducet_loc)

DUCET is useful when you wish your code to be compatible with software and libraries that use DUCET for collation. The CLDR Collation Algorithm  should be the preferred way of handling collation when a single language neutral collator needs to be used.

To create a collator instance using the root CLDR Collation Algorithm:

In [34]:
root_collator = Collator.createInstance(Locale.getRoot())

## `icu.Collator` vs `icu.RuleBasedCollator`

PyICU exposes two methods to handle collation in ICU. The usual approach is to use `icu.Collator()` with a supported locale. The second approach is to write your own custom collation rules and use them with `icu.RuleBasedCollator`.

The second approach is useful for languages with no locale support, or in situations where you wish to modify the rules for a supported language.

To get a list of supported locales for collation use `Collator.getAvailableLocales()`:

In [35]:
#for key in Collator.getAvailableLocales().keys():
#  print(key)

" ".join(list(Collator.getAvailableLocales().keys()))

'af am ar ar_SA as az be bg bn bo br bs bs_Cyrl ca ceb chr cs cy da de de_AT dsb dz ee el en en_US en_US_POSIX eo es et fa fa_AF ff ff_Adlm fi fil fo fr fr_CA ga gl gu ha haw he he_IL hi hr hsb hu hy id id_ID ig is it ja ka kk kl km kn ko kok ku ky lb lkt ln lo lt lv mk ml mn mr ms mt my nb nb_NO ne nl nn no om or pa pa_Guru pa_Guru_IN pl ps pt ro ru sa se si sk sl smn sq sr sr_Cyrl sr_Cyrl_BA sr_Cyrl_ME sr_Cyrl_RS sr_Latn sr_Latn_BA sr_Latn_RS sv sw ta te th tk to tr ug uk ur uz vi wae wo xh yi yo zh zh_Hans zh_Hans_CN zh_Hans_SG zh_Hant zh_Hant_HK zh_Hant_MO zh_Hant_TW zu'

The available locales will differ between versions of ICU4C being used. With each new version of ICU, additional locales are made available. ICU locales are based on the [Common Locale Data Repository](http://cldr.unicode.org/) (CLDR) datasets.

It is possible to tailor existing collation rules by:

1. Using the BCP47 Unicode and CLDR extensions to specify options to apply to the collation rules, when the Collator instance is created.
2. Change the collation attributes of a Colltor instance using `.setAttribute()` method on a collator.
3. Retrieve the rules for an existing locale and prepend options to the rules then use `icu.RuleBasedCollator` to create a collator instance.

## Using `icu.Collator`

1. Define a collator instance using `icu.Collator`
2. Use the `key` parameter of `list.sort()` or `sorted()` to access the collator's SortKey.

A Turkish example:

In [36]:
tr_list = ["İstanbul", "Ankara", "İzmir", "Bursa", "Antalya", "Adana", "Şanlıurfa", "Konya", "Ceyhan", "Gaziantep", "Kayseri", "Samsun", "Çanakkale", "Iğdır"]
#loc = 'tr_TR.UTF-8'
loc = Locale("tr")
tr_collator = Collator.createInstance(loc)
print(sorted(tr_list, key=tr_collator.getSortKey))
#tr_list.sort(key=tr_collator.getSortKey)

['Adana', 'Ankara', 'Antalya', 'Bursa', 'Ceyhan', 'Çanakkale', 'Gaziantep', 'Iğdır', 'İstanbul', 'İzmir', 'Kayseri', 'Konya', 'Samsun', 'Şanlıurfa']


A Thai collation example:

In [37]:
thai_words = ["หมู", "เห็ด", "เป็ด", "ไก่", "ช้าง", "ม้า", "วัว", "ควาย"]
th_loc = Locale.createCanonical("th-TH") 
th_collator = Collator.createInstance(th_loc)
print(sorted(thai_words, key=th_collator.getSortKey))
# ['ไก่', 'ควาย', 'ช้าง', 'เป็ด', 'ม้า', 'วัว', 'หมู', 'เห็ด']

['ไก่', 'ควาย', 'ช้าง', 'เป็ด', 'ม้า', 'วัว', 'หมู', 'เห็ด']


### Customising `icu.Collator` using BCP47 extentions

It is possible to use extensions to BCP47 to tailor collation. Some locales have more than one set of collation rules available. For instance German has a default linguistic collation. There is also a phonebook collation, used for names, which orders letters differently.

Given the following list of names:

In [38]:
de_names = ['Hafermann, Ulrich', 'Hacker, Simon', 'Hackmann, Gustav', 'Häcker, Emil', 'Haecker, Manfred', 'Häcker, Xaver', 'Assemann, Simon', 'Aßmann, Erika', 'Astmann, Manfred', 'Assmann, Frank']

The default German collation rules sorts these as:

In [39]:
de_loc = Locale("de")
de_collator = Collator.createInstance(de_loc)
print(sorted(de_names, key=de_collator.getSortKey))

['Assemann, Simon', 'Aßmann, Erika', 'Assmann, Frank', 'Astmann, Manfred', 'Häcker, Emil', 'Hacker, Simon', 'Häcker, Xaver', 'Hackmann, Gustav', 'Haecker, Manfred', 'Hafermann, Ulrich']


Alternatively, this could be written as:

In [None]:
de_collator = Collator.createInstance(Locale("de"))
print(sorted(de_names, key=de_collator.getSortKey))

The German Phonebook collation rules have the following changes to the German collation rules:

Ä/ä = Ae/ae \
Ö/ö = Oe/oe \
Ü/ü = Ue/ue \
ß = ss

This will sort the list of names as follows:

In [40]:
de_phbk_loc = Locale.createCanonical("de-u-co-phonebk") 
de_phpk_collator = Collator.createInstance(de_phbk_loc)
print(sorted(de_names, key=de_phpk_collator.getSortKey))

['Assemann, Simon', 'Aßmann, Erika', 'Assmann, Frank', 'Astmann, Manfred', 'Hacker, Simon', 'Hackmann, Gustav', 'Häcker, Emil', 'Haecker, Manfred', 'Häcker, Xaver', 'Hafermann, Ulrich']


The basic structure of a BCP47 tag is a _language subtag_, followed by _script_ and _region_ subtags as required. This is then followed by a _u_ subtag indicating following subtags are UNicode extension key and value pairs, e.g.

```
en-u-ks-level2
de-u-co-phonebk-ks-level2
fr-u-ks-level3-ka-shifted
```

When the value of a key value pair is `true`, the key value pair is omitted from the tag, only the key is included.

Some commonly used collation settings:

* __Ignore accents__: `-ks-level1`
* __Ignore accents, but take case into account__: `-ks-level1-kc`
* __Ignore case__: `-ks-level2`
* __Ignore punctuation (completely)__: `-ks-level3-ka-shifted`
* __Ignore punctuation, but distinguish among punctuation marks__: `-ks-level4-ka-shifted`


### BCP 47 subtags that can be used to tailor collation

Refer to  [cldr/common/bcp47/collation.xml](https://github.com/unicode-org/cldr/blob/master/common/bcp47/collation.xml)

|BCP47 subtags  |Old syntax           |Description   |Locales  |
|-------------  |-------------------- |------------- |-------- |
|-co-big5han |collation=big5han  |Pinyin ordering for Latin, big5 charset ordering for CJK characters (used in Chinese  |zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW |
|-co-compat |collation=compat  |A previous version of the ordering, for compatibility  |ar, ar_SA |
|-co-dict |collation=dictionary  |Dictionary style ordering (such as in Sinhala)  |si |
|-co-direct |collation=direct  |DEPRECATED. Binary code point order (used in Hindi)  | |
|-co-ducet |collation=ducet  |The default Unicode collation element table order  | |
|-co-emoji |collation=emoji  |Recommended ordering for emoji characters  |af, am, ar, ar_SA, as, az, be, bg, bn, bo, br, bs, bs_Cyrl, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US_POSIX, eo, es, et, fa, fa_AF, ff, ff_Adlm, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, no, om, or, pa, pa_Guru, pa_Guru_IN, pl, ps, pt, ro, ru, sa, se, si, sk, sl, smn, sq, sr, sr_Cyrl, sr_Cyrl_BA, sr_Cyrl_ME, sr_Cyrl_RS, sr_Latn, sr_Latn_BA, sr_Latn_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW, zu |
|-co-eor |collation=eor  |European ordering rules  |af, am, ar, ar_SA, as, az, be, bg, bn, bo, br, bs, bs_Cyrl, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US_POSIX, eo, es, et, fa, fa_AF, ff, ff_Adlm, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, no, om, or, pa, pa_Guru, pa_Guru_IN, pl, ps, pt, ro, ru, sa, se, si, sk, sl, smn, sq, sr, sr_Cyrl, sr_Cyrl_BA, sr_Cyrl_ME, sr_Cyrl_RS, sr_Latn, sr_Latn_BA, sr_Latn_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW, zu |
|-co-gb2312 |collation=gb2312han  |Pinyin ordering for Latin, gb2312han charset ordering for CJK characters (used in Chinese)  |zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW |
|-co-phonebk |collation=phonebook  |Phonebook style ordering (such as in German)  |de, de_AT |
|-co-phonetic |collation=phonetic  |Phonetic ordering (sorting based on pronunciation)  |ln |
|-co-pinyin |collation=pinyin  |Pinyin ordering for Latin and for CJK characters (used in Chinese)  |zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW |
|-co-reformed |collation=reformed  |Reformed ordering (such as in Swedish)  |sv |
|-co-search |collation=search  |Special collation type for string search  |af, am, ar, ar_SA, as, az, be, bg, bn, bo, br, bs, bs_Cyrl, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US_POSIX, eo, es, et, fa, fa_AF, ff, ff_Adlm, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, no, om, or, pa, pa_Guru, pa_Guru_IN, pl, ps, pt, ro, ru, sa, se, si, sk, sl, smn, sq, sr, sr_Cyrl, sr_Cyrl_BA, sr_Cyrl_ME, sr_Cyrl_RS, sr_Latn, sr_Latn_BA, sr_Latn_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW, zu |
|-co-searchjl |collation=searchjl  |Special collation type for Korean initial consonant search  |ko |
|-co-standard |collation=standard  |Default ordering for each language  |af, am, ar, ar_SA, as, az, be, bg, bn, bo, br, bs, bs_Cyrl, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US_POSIX, eo, es, et, fa, fa_AF, ff, ff_Adlm, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, no, om, or, pa, pa_Guru, pa_Guru_IN, pl, ps, pt, ro, ru, sa, se, si, sk, sl, smn, sq, sr, sr_Cyrl, sr_Cyrl_BA, sr_Cyrl_ME, sr_Cyrl_RS, sr_Latn, sr_Latn_BA, sr_Latn_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW, zu |
|-co-stroke |collation=stroke  |Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese)  |zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW |
|-co-trad |collation=traditional  |Traditional style ordering (such as in Spanish)  |bn, es, fi, kn, vi |
|-co-unihan |collation=unihan  |Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters (used in Chinese)  |ja, ko, zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW |
|-co-zhuyin |collation=zhuyin  |Pinyin ordering for Latin, zhuyin order for Bopomofo and CJK characters (used in Chinese)  |zh, zh_Hans, zh_Hans_CN, zh_Hans_SG, zh_Hant, zh_Hant_HK, zh_Hant_MO, zh_Hant_TW |

#### Collation parameter key for alternate handling

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-ka-noignore |colAlternate=non-ignorable  |Variable collation elements are not reset to ignorable  |
|-ka-shifted |colAlternate=shifted  |Variable collation elements are reset to zero at levels one through three  |

#### Collation parameter key for backward collation weight

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kb-true |colBackwards=yes  |The second level to be backwards  |
|-kb-false |colBackwards=no  |No backwards (the second level to be forwards)  |

#### Collation parameter key for case level

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kc-true |colCaseLevel=yes  |The case level is inserted in front of tertiary  |
|-kc-false |colCaseLevel=no  |No special case level handling  |

#### Collation parameter key for ordering by case

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kf-upper |colCaseFirst=upper  |Upper case to be sorted before lower case  |
|-kf-lower |colCaseFirst=lower  |Lower case to be sorted before upper case  |
|-kf-false |colCaseFirst=false  |No special case ordering  |

#### <span style="color:reg;">DEPRECATED:</span> Collation parameter key for special Hiragana handling

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kh-true |colHiraganaQuaternary=yes  |Hiragana to be sorted before all non-variable on quaternary level  |
|-kh-false |colHiraganaQuaternary=no  |No special handling for Hiragana  |

#### Collation parameter key for normalization

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kk-true |colNormalization=yes  |Convert text into Normalization Form D before calculating collation weights  |
|-kk-false |colNormalization=no  |Skip normalization  |

#### Collation parameter key for numeric handling

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kn-true |colNumeric=yes  |A sequence of decimal digits is sorted at primary level with its numeric value  |
|-kn-false |colNumeric=no |No special handling for numeric ordering  |

#### Collation reorder codes

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kr-space |colReorder=space  |Whitespace reordering code, see LDML Part 5: Collation  |
|-kr-punct |colReorder=punct  |Punctuation reordering code, see LDML Part 5: Collation  |
|-kr-symbol |colReorder=symbol  |Symbol reordering code (other than currency), see LDML Part 5: Collation  |
|-kr-currency |colReorder=currency  |Currency reordering code, see LDML Part 5: Collation  |
|-kr-digit |colReorder=digit  |Digit (number) reordering code, see LDML Part 5: Collation  |
|-kr-REORDER_CODE |colReorder=REORDER_CODE  |Other collation reorder code — for script, see LDML Part 5: Collation  |

#### Collation parameter key for collation strength

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-ks-level1 |colStrength=primary  |The primary level  |
|-ks-level2 |colStrength=secondary  |he secondary level  |
|-ks-level3 |colStrength=tertiary  |The tertiary level  |
|-ks-level4 |colStrength=quaternary <br>colStrength=quarternary  |The quaternary level  |
|-ks-identic |colStrength=identical  |The identical level  |

#### Collation parameter key for maxVariable, the last reordering group to be affected by ka-shifted

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-kv-space |  |Only spaces are affected by ka-shifted  |
|-kv-punct |  |Spaces and punctuation are affected by ka-shifted (CLDR default)  |
|-kv-symbol |  |Spaces, punctuation and symbols except for currency symbols are affected by ka-shifted (UCA default)  |
|-kv-currency |  |Spaces, punctuation and all symbols are affected by ka-shifted  |

#### <span style="color:reg;">DEPRECATED:</span> Collation parameter key for variable top

|BCP47 subtags  |Old syntax            |Description   |
|-------------  |-------------------- |------------- |
|-vt-CODEPOINTS |variableTop=CODEPOINTS  |The variable top (one or more Unicode code points: LDML Appendix Q)  |


### Customising `icu.Collator` using collator attributes

Atributes can be used to modify varous aspects of collation rules being used. 

1. Define a collator instance using `icu.Collator`
2. Update the attributes of the collator instance, using `.setAttribute()`
3. Use the `key` parameter of `list.sort()` or `sorted()` to access the collator's SortKey.

|Attribute  |Values  |Notes  |
|---------- |------- |------ |
|UCollAttribute.ALTERNATE_HANDLING |UCollAttributeValue.NON_IGNORABLE, UCollAttributeValue.SHIFTED |For handling variable elements. __NON_IGNORABLE__ (default) treats all the codepoints with non-ignorable primary weights in the same way. __SHIFTED__: codepoints with primary weights that are equal or below the variable top value to be ignored on primary level and moved to the quaternary level.  |
|UCollAttribute.CASE_FIRST |UCollAttributeValue.OFF, UCollAttributeValue.LOWER_FIRST, UCollAttributeValue.UPPER_FIRST | Ordering of upper and lower case letters. __OFF__ (default), which orders upper and lower case letters in accordance to their tertiary weights. __UPPER_FIRST__ which forces upper case letters to sort before lower case letters, and __LOWER_FIRST__ which forces lower case letters to sort before upper case letters. |
|UCollAttribute.CASE_LEVEL |UCollAttributeValue.ON, UCollAttributeValue.OFF |__On__ if case sensitive (case level is generated), __off__ is case insensitive (case level is not generated). |
|UCollAttribute.DECOMPOSITION_MODE |UCollAttributeValue.ON, UCollAttributeValue.OFF  |Alias for UCollAttribute.NORMALIZATION_MODE attribute.  |
|UCollAttribute.FRENCH_COLLATION |UCollAttributeValue.ON, UCollAttributeValue.OFF |Direction of secondary weights (Canadian French).  __ON__: secondary weights considered backwards. __OFF__: secondary weights considered in the order they appear.   |
|UCollAttribute.HIRAGANA_QUATERNARY_MODE |UCollAttributeValue.ON, UCollAttributeValue.OFF |When turned on, this attribute positions Hiragana before all non-ignorables on quaternary level. Use depreciated. |
|UCollAttribute.NORMALIZATION_MODE |UCollAttributeValue.ON, UCollAttributeValue.OFF |Controls whether the normalization check and necessary normalizations are performed. __OFF__: (default) no normalization check is performed. __ON__: check to see if data in in FCD form, if not then NFD normalization is performed.   |
|UCollAttribute.NUMERIC_COLLATION |UCollAttributeValue.ON, UCollAttributeValue.OFF  |When turned on, this attribute makes substrings of digits sort according to their numeric values.  |
|UCollAttribute.STRENGTH |UCollAttributeValue.PRIMARY, UCollAttributeValue.SECONDARY, UCollAttributeValue.TERTIARY, UCollAttributeValue.QUATERNARY, UCollAttributeValue.IDENTICAL |The strength attribute. The usual strength for most locales (except Japanese) is tertiary.  |

See [UColAttribute](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a583fbe7fc4a850e2fcc692e766d2826c) in [ICU4C API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/) reference.


In the following code snippet, `.setAttribute()` is used to modify the collation, allowing digits to be sorted as numbers, rather than being sorted as characters.

In [44]:
en_list = ['3 oranges', '1 apple', '10 strawberries', '2 pears']
print(sorted(en_list))
en_collator = Collator.createInstance(Locale('en_AU'))
en_collator.setAttribute(UCollAttribute.NUMERIC_COLLATION, UCollAttributeValue.ON)


print(sorted(en_list, key=en_collator.getSortKey))

['1 apple', '10 strawberries', '2 pears', '3 oranges']
['1 apple', '2 pears', '3 oranges', '10 strawberries']


To enable case insensitve collation:

`coll.setAttribute(UCollAttribute.STRENGTH, UCollAttributeValue.SECONDARY)` or `coll.setStrength(UCollAttributeValue.SECONDARY)`

To enable case and diacritic insensitive collation:

`coll.setAttribute(UCollAttribute.STRENGTH, UCollAttributeValue.PRIMARY)` or `coll.setStrength(UCollAttributeValue.PRIMARY)`


In [45]:
def str_compare(col, s1, s2):
    if col.compare(s1, s2) == 0:
        print( str(s1) + " equals " + str(s2))
    elif col.compare(s1, s2) < 0:
        print(str(s1) +  " is less than " + str(s2))
    else:
        print(str(s1) +  " is greater than " + str(s2))

str_compare(en_collator, "abc", "ABC")

# Case insensitive collation
en_collator.setAttribute(UCollAttribute.STRENGTH, UCollAttributeValue.SECONDARY)
str_compare(en_collator, "abc", "ABC")


abc is less than ABC
abc equals ABC


## Using `icu.RuleBasedCollator`

1. Define the [collation rules](https://unicode-org.github.io/icu/userguide/collation/customization/)
2. Define a collator instance using `icu.RuleBasedCollator`
3. Use the `key` parameter of `list.sort()` or `sorted()` to access the collator's SortKey.

The following code snippet sorts a list of Dinka lexemes:

In [46]:
din_rules = "[normalization on]&a<<<A<<aa<<<Aa<<<AA<<ä<<<Ä<<ää<<<Ää<<<ÄÄ&d<<<D<dh<<<Dh<<<DH<e<<<E<<ee<<<Ee<<<EE<<ë<<<Ë<<ëë<<<Ëë<<<ËË<ɛ<<<Ɛ<<ɛɛ<<<Ɛɛ<<<ƐƐ<<ɛ̈<<<Ɛ̈<<ɛ̈ɛ̈<<<Ɛ̈ɛ̈<<<Ɛ̈Ɛ̈&g<<<G<ɣ<<<Ɣ&i<<<I<<ii<<<II<<ï<<<Ï<<ïï<<<Ïï<<<ÏÏ&n<<<N<nh<<<Nh<<<NH<ny<<<Ny<<<NY<ŋ<<<Ŋ<o<<<O<<oo<<<Oo<<<OO<<ö<<<Ö<<öö<<<Öö<<<ÖÖ<ɔ<<<Ɔ<<ɔɔ<<<Ɔɔ<<<ƆƆ<<ɔ̈<<<Ɔ̈<<ɔ̈ɔ̈<<<Ɔ̈ɔ̈<<<Ɔ̈Ɔ̈<t<<<T<th<<<Th<<<TH<u<<<U<<uu<<<UU"
din_list = ['agook', 'abeeric', 'abuɔk', 'agol', 'acuut', 'abany', 'abur', 'abëër', 'abaany', 'aber', 'akɔ̈r', 'acut', 'acuth', 'abenh', 'abeer', 'akuny', 'ago', 'akɔrcok', 'akuŋɛŋ', 'abuɔ̈k', 'aberŋic', 'abuɔɔk', 'abeŋ', 'abuɔ̈c', 'abaŋ']
din_collator = RuleBasedCollator(din_rules)
print(sorted(din_list, key=din_collator.getSortKey))

['abany', 'abaany', 'abaŋ', 'abenh', 'abeŋ', 'aber', 'abeer', 'abëër', 'abeeric', 'aberŋic', 'abuɔ̈c', 'abuɔk', 'abuɔɔk', 'abuɔ̈k', 'abur', 'acut', 'acuut', 'acuth', 'ago', 'agook', 'agol', 'akɔ̈r', 'akɔrcok', 'akuny', 'akuŋɛŋ']


### Modifying existing rules

It is possible to make changes to collation rules to tailor the collation sequence. In Welsh __*Ch*__ is a seperate letter of the alphabet that comes after __*C*__. The default collaton rules for Welsh sort majuscules after minuscules:

In [47]:
cy_list = ["chwaeth", "Cyflym", "Clust", "cyflym", "Chwaeth", "chwerw", "Chwerw", "clust"]
cy_collator = Collator.createInstance(Locale("cy"))
print(sorted(cy_list, key=cy_collator.getSortKey))

['clust', 'Clust', 'cyflym', 'Cyflym', 'chwaeth', 'Chwaeth', 'chwerw', 'Chwerw']


Use `getRules()` to retrieve the collation rules from a collator instance. You can then modify the existing rules. The following snippet retrieved the Welsh collation rules and prepends [an option](https://unicode-org.github.io/icu/userguide/collation/customization/#default-options) that changes the collation order of majuscules and minuscules. These rules will sort Majuscules before minuscules:

In [48]:
cy_rules = cy_collator.getRules()   
#cy_rules = Collator.createInstance(Locale('cy')).getRules()
cy_rules = '[caseFirst upper]' + cy_rules
cy_collator_alt = RuleBasedCollator(cy_rules)
print(sorted(cy_list, key=cy_collator_alt.getSortKey))


['Clust', 'clust', 'Cyflym', 'cyflym', 'Chwaeth', 'chwaeth', 'Chwerw', 'chwerw']


# Moving beyond lists

Sorting lists using PyICU, is straight forward, both the `.sort` method the `sorted()` function accept the `key` parameter. The `key` parameter is used tp specify a function to be called on each list item before comparing items. The `icu.Collator.getSortKey` method get a sort key as an array of bytes from a string. 

To sort a list of Akan words:

In [49]:
ak_rules = "&E<ɛ<<<Ɛ&O<ɔ<<<Ɔ"
ak_collator = RuleBasedCollator(ak_rules)
data_lst = ["yoma", "ɔwɔ", "koterɛ", "ananse", "apɛsɛ", "nantwibaa", "nkaboa", "wowa", "susono", "pɔnkɔ"]
sorted_data_lst = sorted(data_lst, key=ak_collator.getSortKey)
sorted_data_lst

['ananse',
 'apɛsɛ',
 'koterɛ',
 'nantwibaa',
 'nkaboa',
 'ɔwɔ',
 'pɔnkɔ',
 'susono',
 'wowa',
 'yoma']

## Tuples

Tuples can also be sorted using the `sorted()` function using a key parameter calling the `icu.Collator.getSortKey` method. It is important to remeber that `sorted()` returns a list. If you require a tuple, it will be necessary to convert teh list to a tuple.


In [50]:
data_tup = ("yoma", "ɔwɔ", "koterɛ", "ananse", "apɛsɛ", "nantwibaa", "nkaboa", "wowa", "susono", "pɔnkɔ")
sorted_data_tup = tuple(sorted(data_tup, key=ak_collator.getSortKey))
sorted_data_tup

('ananse',
 'apɛsɛ',
 'koterɛ',
 'nantwibaa',
 'nkaboa',
 'ɔwɔ',
 'pɔnkɔ',
 'susono',
 'wowa',
 'yoma')

## Lambda function: dictionaries and objects

For more complex data structures it is necessary to use a lambda function with the key parameter.

In the following example we will sort a dctionary containing language names using the Estonian exonym and the asosciated BCP47 language subtag. We first declare the dictionary and create a Collator instance.


In [51]:
et_langs = {
    "šveitsisaksa": "gsw",
    "ukraina": "uk",
    "vietnami": "vi",
    "zarma": "dje",
    "rootsi": "sv",
    "ruanda": "rw",
    "volofi": "wo",
    "uusnorra": "nn",
    "tšehhi": "cs",
    "ülemsorbi": "hsb",
    "türgi": "tr",
    "suulu": "zu"
}

# Create a collator instamce for Estonian
et_loc = Locale("et_ET.UTF-8")
et_collator = Collator.createInstance(et_loc)


To sort the dictionary, we create a lambda function the utilises `icu.Collator.getSortKey` and specify what we are sorting by.


In [52]:
# Sort by Estonian exonym name
et_sorted_lang = sorted(et_langs.items(), key=lambda x: et_collator.getSortKey(x[0]))
print(et_sorted_lang)

# Sort by language code
et_sorted_code = sorted(et_langs.items(), key=lambda x: et_collator.getSortKey(x[1]))
print(et_sorted_code)

[('rootsi', 'sv'), ('ruanda', 'rw'), ('suulu', 'zu'), ('šveitsisaksa', 'gsw'), ('zarma', 'dje'), ('tšehhi', 'cs'), ('türgi', 'tr'), ('ukraina', 'uk'), ('uusnorra', 'nn'), ('vietnami', 'vi'), ('volofi', 'wo'), ('ülemsorbi', 'hsb')]
[('tšehhi', 'cs'), ('zarma', 'dje'), ('šveitsisaksa', 'gsw'), ('ülemsorbi', 'hsb'), ('uusnorra', 'nn'), ('ruanda', 'rw'), ('rootsi', 'sv'), ('suulu', 'zu'), ('türgi', 'tr'), ('ukraina', 'uk'), ('vietnami', 'vi'), ('volofi', 'wo')]


When we sort by the language subtag, we have used a collator instance initiated for the Estonian language. This is not necessarily ideal when sorting codes or identifiers. A better way of sorting by langauge locale would be to use the root collation tables:


In [53]:
# Initiate a collator instance using the Root collation table:
root_collator = Collator.createInstance(Locale.getRoot())

# Sort the language subtags using the Root collator instance:
sorted_subtag = sorted(et_langs.items(), key=lambda x: root_collator.getSortKey(x[1]))
print(sorted_subtag)


[('tšehhi', 'cs'), ('zarma', 'dje'), ('šveitsisaksa', 'gsw'), ('ülemsorbi', 'hsb'), ('uusnorra', 'nn'), ('ruanda', 'rw'), ('rootsi', 'sv'), ('türgi', 'tr'), ('ukraina', 'uk'), ('vietnami', 'vi'), ('volofi', 'wo'), ('suulu', 'zu')]


Sorting a class specific object also utilises a lambda function.


In [54]:
# Define a simple class that stores a province name and population of the province
class province:
    def __init__(self, name, population):
        self.name = name
        self.population = population
    def __repr__(self):
        return repr((self.name, self.population))

# Define a list of province objects for a set of Turkish provinces
tr_province = [
    province("Iğdır", 199442),
    province("Samsun", 1348542),
    province("Isparta", 444914),
    province("Şanlıurfa", 2073614),
    province("Bursa", 3056120),
    province("İstanbul", 15519267),
    province("Ankara", 5639076)
]

# Create a collator instance for Turkish
tr_collator = Collator.createInstance(Locale("tr"))

# Sort by provice name:
tr_province_sorted = sorted(tr_province, key=lambda province: tr_collator.getSortKey(province.name))
print(tr_province_sorted)


[('Ankara', 5639076), ('Bursa', 3056120), ('Iğdır', 199442), ('Isparta', 444914), ('İstanbul', 15519267), ('Samsun', 1348542), ('Şanlıurfa', 2073614)]


## Pandas Dataframes and Series


Copyright © 2021 [Enabling Languages](https://enabling-languages.github.io/). <br>This notebook is made available under the [MIT licence](https://github.com/enabling-languages/python-i18n/blob/main/LICENSE).