# Introduction

Python has a built-in `.sort()` method that modifies the list inplace. The `sorted()` built-in function builds a new sorted list from an iterable.

In [164]:
a = ["A", "E", "Z", "a", "e", "é", "z"]
print(sorted(a))
print(a)
a.sort()
print(a)

['A', 'E', 'Z', 'a', 'e', 'z', 'é']
['A', 'E', 'Z', 'a', 'e', 'é', 'z']
['A', 'E', 'Z', 'a', 'e', 'z', 'é']


The built-in `.sort()` and `sorted()` perform a crude codepoint ordering.

# Using the locale module for collation

For information on using the Locale module in Google Colab refer to the code snippets in [locale_module_colab.ipynb](https://github.com/enabling-languages/python-i18n/blob/main/colab/locale_module_colab.ipynb).

In [165]:
import locale
locale.setlocale(locale.LC_ALL, '')

## Available locales

It is possible to use the `locale` module and the key parameter of `list.sort()` and `sorted()` to obtain a tailored sort using a supported locale.

_It is important to note that use of the `locale` module will not produce platform independant code. Each server or workstation may have a different set of locales available, this is most noticable in Linux and Unix based servers and workstations._

To get a list of available locales, supported by the locale module, for Windows users:

```py
for lang in locale.windows_locale.values():
    print(lang)
```

To get a list of available locales, supported by the locale module, for other operating systems:

```py
for lang in locale.locale_alias.values():
    print(lang)
```

To use a locale, it must be available within the operating systems. To see what locales are available on Linux and MacOS workstations and servers:

```shell
locale -a
```

_A Python script will use the system default encoding, unless the locale is explicitly changed. It is important to note that changing the locale can affect the execution of Python code in multiple ways. A script should set the locale once and not change it. If more than one language and set of collation rules is needed, alternative solutions should be used._


## Identifying the current locale

Use `locale.getlocale()` to identify current locale:

In [166]:
loc = locale.getlocale()
default_loc = locale.getdefaultlocale()

print("Current locale: " + str(loc))
print("Default locale: " + str(default_loc))

Current locale: ('en_AU', 'UTF-8')
Default locale: ('en_AU', 'UTF-8')



## Setting a locale

Use `locale.setlocale()` to set locale. You can set all the `LC_COLLATE` locale environment variables using:

```py
locale.setlocale(locale.LC_COLLATE, 'de_DE.UTF-8')
```

You can set the current locale to your default settings:

```py
locale.setlocale(locale.LC_ALL, '')
```

Changing locales is not necessrily thread safe, and is not recommeneded. Ideally, the locale is set once at the beginning for the script or program:

```py
import locale
locale.setlocale(locale.LC_ALL, '')
```


## Locale specific sorting

For locale aware sorting, use [locale.strxfrm()](https://docs.python.org/3/library/locale.html#locale.strxfrm) for a key function or [locale.strcoll()](https://docs.python.org/3/library/locale.html#locale.strcoll) for a comparison function.


In [168]:
locale.setlocale(locale.LC_COLLATE, "fr_FR.UTF-8")
a = ["A", "E", "Z", "a", "e", "é", "z"]
print(sorted(a))
print(sorted(a, key=locale.strxfrm))

['A', 'E', 'Z', 'a', 'e', 'z', 'é']
['A', 'E', 'Z', 'a', 'e', 'é', 'z']


In [169]:
corpus = ["Art", "Älg", "Ved", "Wasa"]
locale.setlocale(locale.LC_COLLATE, "sv_SE.UTF-8")

print(corpus)
corpus.sort(key=locale.strxfrm)
print(corpus)

['Art', 'Ved', 'Wasa', 'Älg']
['Art', 'Älg', 'Ved', 'Wasa']
['Art', 'Ved', 'Wasa', 'Älg']


In [170]:
locale.setlocale(locale.LC_COLLATE, "de_DE.UTF-8")
lastnames = ["Bange", "Änger", "Amman", "Änger", "Zelch", "Ösbach"]
print(sorted(lastnames))
print(sorted(lastnames, key=locale.strxfrm)) 

['Amman', 'Bange', 'Zelch', 'Änger', 'Änger', 'Ösbach']
['Amman', 'Änger', 'Änger', 'Bange', 'Ösbach', 'Zelch']


In [188]:
# Reset
locale.setlocale(locale.LC_ALL, '')

'en_AU.UTF-8'

# PyUCA

[pyuca](https://github.com/jtauber/pyuca) implements the _Default Unicode Collation Element Table_ (DUCET). These are collation rules from the [Unicode collation algorithm](https://en.wikipedia.org/wiki/Unicode_collation_algorithm) that are language or locale independant.

Unfortunately `pyuca` only supports up to Unicode 10.

# Using PyICU for collation

PyICU is a Python extension wrapping ICU4C (the [ICU](http://site.icu-project.org/) C/C++ libraries).

To use PyICU, it is necessary for ICU to be installed. A version of ICU may be already available on some systems. Prebuilt packages are available for a number of operating systems, alternatively ICU4C can be built from source code. Please use an approach best suited to your operating system. Prepackages versions may contain older versions of ICU4C.

In [None]:
# Test if running on Google Colab, if yes install required 
# software and Python packages.

try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False
if IN_COLAB:
  !pip install -q pyicu

In [1]:
#from icu import Locale, Collator, RuleBasedCollator, ICU_VERSION, VERSION, UCollAttribute, UCollAttributeValue
from icu import *

print("ICU version: " + ICU_VERSION)
print("PyICU version: " + VERSION)

ICU version: 69.1
PyICU version: 2.7.4


## `icu.Collator` vs `icu.RuleBasedCollator`

PyICU exposes two methods to handle collation in ICU. The usual approach is to use `icu.Collator()` with a supported locale. The second approach is to write your own custom collation rules and use them with `icu.RuleBasedCollator`.

The second approach is useful for languages with no locale support, or in situations where you wish to modify the rules for a supported language.

To get a list of supported locales for Collation use `Collator.getAvailableLocales()`:

In [175]:
#for key in Collator.getAvailableLocales().keys():
#  print(key)

" ".join(list(Collator.getAvailableLocales().keys()))

'af am ar ar_SA as az be bg bn bo br bs bs_Cyrl ca ceb chr cs cy da de de_AT dsb dz ee el en en_US en_US_POSIX eo es et fa fa_AF ff ff_Adlm fi fil fo fr fr_CA ga gl gu ha haw he he_IL hi hr hsb hu hy id id_ID ig is it ja ka kk kl km kn ko kok ku ky lb lkt ln lo lt lv mk ml mn mr ms mt my nb nb_NO ne nl nn no om or pa pa_Guru pa_Guru_IN pl ps pt ro ru sa se si sk sl smn sq sr sr_Cyrl sr_Cyrl_BA sr_Cyrl_ME sr_Cyrl_RS sr_Latn sr_Latn_BA sr_Latn_RS sv sw ta te th tk to tr ug uk ur uz vi wae wo xh yi yo zh zh_Hans zh_Hans_CN zh_Hans_SG zh_Hant zh_Hant_HK zh_Hant_MO zh_Hant_TW zu'

The available locales will differ between versions of ICU4C being used. With each new version of ICU, additional locales are made available. ICU locales are based on the [Common Locale Data Repository](http://cldr.unicode.org/) (CLDR) datasets.

### Using `icu.Collator`

1. Define a collator instance using `icu.Collator`
2. Use the `key` parameter of `list.sort()` or `sorted()` to access the collator's SortKey.

A Turkish example:

In [176]:
tr_list = ["İstanbul", "Ankara", "İzmir", "Bursa", "Antalya", "Adana", "Şanlıurfa", "Konya", "Ceyhan", "Gaziantep", "Kayseri", "Samsun", "Çanakkale", "Iğdır"]
#loc = 'tr_TR.UTF-8'
loc = Locale("tr")
tr_collator = Collator.createInstance(loc)
print(sorted(tr_list, key=tr_collator.getSortKey))
#tr_list.sort(key=tr_collator.getSortKey)

['Adana', 'Ankara', 'Antalya', 'Bursa', 'Ceyhan', 'Çanakkale', 'Gaziantep', 'Iğdır', 'İstanbul', 'İzmir', 'Kayseri', 'Konya', 'Samsun', 'Şanlıurfa']


A Thai collation example:

In [177]:
thai_words = ["หมู", "เห็ด", "เป็ด", "ไก่", "ช้าง", "ม้า", "วัว", "ควาย"]
th_loc = Locale.createCanonical("th-TH") 
th_collator = Collator.createInstance(th_loc)
print(sorted(thai_words, key=th_collator.getSortKey))
# ['ไก่', 'ควาย', 'ช้าง', 'เป็ด', 'ม้า', 'วัว', 'หมู', 'เห็ด']

['ไก่', 'ควาย', 'ช้าง', 'เป็ด', 'ม้า', 'วัว', 'หมู', 'เห็ด']


Some locales have more than one set of collation rules available. For instance German has a default linguistic collation. There is also a phonebook collation whih orders names differently.

Given the following list of names:

In [179]:
de_names = ['Hafermann, Ulrich', 'Hacker, Simon', 'Hackmann, Gustav', 'Häcker, Emil', 'Haecker, Manfred', 'Häcker, Xaver', 'Assemann, Simon', 'Aßmann, Erika', 'Astmann, Manfred', 'Assmann, Frank']

The default German collation rules sorts these as:

In [180]:
de_loc = Locale.createCanonical("de-DE") 
de_collator = Collator.createInstance(de_loc)
print(sorted(de_names, key=de_collator.getSortKey))

['Assemann, Simon', 'Aßmann, Erika', 'Assmann, Frank', 'Astmann, Manfred', 'Häcker, Emil', 'Hacker, Simon', 'Häcker, Xaver', 'Hackmann, Gustav', 'Haecker, Manfred', 'Hafermann, Ulrich']


The German Phonebook collation rules have the following chnages to the German collation rules:

Ä/ä = Ae/ae \
Ö/ö = Oe/oe \
Ü/ü = Ue/ue \
ß = ss

This will sort the list of names as follows:

In [181]:
de_phbk_loc = Locale.createCanonical("de-u-co-phonebk")  # de@collation=phonebook
de_phpk_collator = Collator.createInstance(de_phbk_loc)
print(sorted(de_names, key=de_phpk_collator.getSortKey))

['Assemann, Simon', 'Aßmann, Erika', 'Assmann, Frank', 'Astmann, Manfred', 'Hacker, Simon', 'Hackmann, Gustav', 'Häcker, Emil', 'Haecker, Manfred', 'Häcker, Xaver', 'Hafermann, Ulrich']


## Customising `icu.Collator`

Atributes can be used to modify varous aspects of collation rules being used. 

1. Define a collator instance using `icu.Collator`
2. Update the attributes of the collator instance, using `.setAttribute()`
3. Use the `key` parameter of `list.sort()` or `sorted()` to access the collator's SortKey.

|Attribute  |Values  |Notes  |
|---------- |------- |------ |
|UCollAttribute.ALTERNATE_HANDLING |UCollAttributeValue.NON_IGNORABLE, UCollAttributeValue.SHIFTED |For handling variable elements. __NON_IGNORABLE__ (default) treats all the codepoints with non-ignorable primary weights in the same way. __SHIFTED__: codepoints with primary weights that are equal or below the variable top value to be ignored on primary level and moved to the quaternary level.  |
|UCollAttribute.CASE_FIRST |UCollAttributeValue.OFF, UCollAttributeValue.LOWER_FIRST, UCollAttributeValue.UPPER_FIRST | Ordering of upper and lower case letters. __OFF__ (default), which orders upper and lower case letters in accordance to their tertiary weights. __UPPER_FIRST__ which forces upper case letters to sort before lower case letters, and __LOWER_FIRST__ which forces lower case letters to sort before upper case letters. |
|UCollAttribute.CASE_LEVEL |UCollAttributeValue.ON, UCollAttributeValue.OFF |__On__ if case sensitive (case level is generated), __off__ is case insensitive (case level is not generated). |
|UCollAttribute.DECOMPOSITION_MODE |UCollAttributeValue.ON, UCollAttributeValue.OFF  |Alias for UCollAttribute.NORMALIZATION_MODE attribute.  |
|UCollAttribute.FRENCH_COLLATION |UCollAttributeValue.ON, UCollAttributeValue.OFF |Direction of secondary weights (Canadian French).  __ON__: secondary weights considered backwards. __OFF__: secondary weights considered in the order they appear.   |
|UCollAttribute.HIRAGANA_QUATERNARY_MODE |UCollAttributeValue.ON, UCollAttributeValue.OFF |When turned on, this attribute positions Hiragana before all non-ignorables on quaternary level. Use depreciated. |
|UCollAttribute.NORMALIZATION_MODE |UCollAttributeValue.ON, UCollAttributeValue.OFF |Controls whether the normalization check and necessary normalizations are performed. __OFF__: (default) no normalization check is performed. __ON__: check to see if data in in FCD form, if not then NFD normalization is performed.   |
|UCollAttribute.NUMERIC_COLLATION |UCollAttributeValue.ON, UCollAttributeValue.OFF  |When turned on, this attribute makes substrings of digits sort according to their numeric values.  |
|UCollAttribute.STRENGTH |UCollAttributeValue.PRIMARY, UCollAttributeValue.SECONDARY, UCollAttributeValue.TERTIARY, UCollAttributeValue.QUATERNARY, UCollAttributeValue.IDENTICAL |The strength attribute. The usual strength for most locales (except Japanese) is tertiary.  |

See [UColAttribute](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucol_8h.html#a583fbe7fc4a850e2fcc692e766d2826c) in [ICU4C API](https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/) reference.


In the following code snippet, `.setAttribute()` is used to modify the collation, allowing digits to be sorted as numbers, rather than being sorted as characters.

In [182]:
en_list = ['3 oranges', '1 apple', '10 strawberries', '2 pears']
print(sorted(en_list))
en_collator = Collator.createInstance(Locale('en_AU'))
en_collator.setAttribute(UCollAttribute.NUMERIC_COLLATION, UCollAttributeValue.ON)


print(sorted(en_list, key=en_collator.getSortKey))

['1 apple', '10 strawberries', '2 pears', '3 oranges']
['1 apple', '2 pears', '3 oranges', '10 strawberries']


To enable case insensitve collation:

`coll.setAttribute(UCollAttribute.STRENGTH, UCollAttributeValue.SECONDARY)` or `coll.setStrength(UCollAttributeValue.SECONDARY)`

To enable case and diacritic insensitive collation:

`coll.setAttribute(UCollAttribute.STRENGTH, UCollAttributeValue.PRIMARY)` or `coll.setStrength(UCollAttributeValue.PRIMARY)`


In [183]:
def str_compare(col, s1, s2):
    if col.compare(s1, s2) == 0:
        print( str(s1) + " equals " + str(s2))
    elif col.compare(s1, s2) < 0:
        print(str(s1) +  " is less than " + str(s2))
    else:
        print(str(s1) +  " is greater than " + str(s2))

str_compare(en_collator, "abc", "ABC")

# Case insensitive collation
en_collator.setAttribute(UCollAttribute.STRENGTH, UCollAttributeValue.SECONDARY)
str_compare(en_collator, "abc", "ABC")


abc is less than ABC
abc equals ABC


## Using `icu.RuleBasedCollator`

1. Define the [collation rules](https://unicode-org.github.io/icu/userguide/collation/customization/)
2. Define a collator instance using `icu.RuleBasedCollator`
3. Use the `key` parameter of `list.sort()` or `sorted()` to access the collator's SortKey.

The following code snippet sorts a list of Dinka lexemes:

In [184]:
din_rules = "[normalization on]&a<<<A<<aa<<<Aa<<<AA<<ä<<<Ä<<ää<<<Ää<<<ÄÄ&d<<<D<dh<<<Dh<<<DH<e<<<E<<ee<<<Ee<<<EE<<ë<<<Ë<<ëë<<<Ëë<<<ËË<ɛ<<<Ɛ<<ɛɛ<<<Ɛɛ<<<ƐƐ<<ɛ̈<<<Ɛ̈<<ɛ̈ɛ̈<<<Ɛ̈ɛ̈<<<Ɛ̈Ɛ̈&g<<<G<ɣ<<<Ɣ&i<<<I<<ii<<<II<<ï<<<Ï<<ïï<<<Ïï<<<ÏÏ&n<<<N<nh<<<Nh<<<NH<ny<<<Ny<<<NY<ŋ<<<Ŋ<o<<<O<<oo<<<Oo<<<OO<<ö<<<Ö<<öö<<<Öö<<<ÖÖ<ɔ<<<Ɔ<<ɔɔ<<<Ɔɔ<<<ƆƆ<<ɔ̈<<<Ɔ̈<<ɔ̈ɔ̈<<<Ɔ̈ɔ̈<<<Ɔ̈Ɔ̈<t<<<T<th<<<Th<<<TH<u<<<U<<uu<<<UU"
din_list = ['agook', 'abeeric', 'abuɔk', 'agol', 'acuut', 'abany', 'abur', 'abëër', 'abaany', 'aber', 'akɔ̈r', 'acut', 'acuth', 'abenh', 'abeer', 'akuny', 'ago', 'akɔrcok', 'akuŋɛŋ', 'abuɔ̈k', 'aberŋic', 'abuɔɔk', 'abeŋ', 'abuɔ̈c', 'abaŋ']
din_collator = RuleBasedCollator(din_rules)
print(sorted(din_list, key=din_collator.getSortKey))

['abany', 'abaany', 'abaŋ', 'abenh', 'abeŋ', 'aber', 'abeer', 'abëër', 'abeeric', 'aberŋic', 'abuɔ̈c', 'abuɔk', 'abuɔɔk', 'abuɔ̈k', 'abur', 'acut', 'acuut', 'acuth', 'ago', 'agook', 'agol', 'akɔ̈r', 'akɔrcok', 'akuny', 'akuŋɛŋ']


## Modifying existing rules

It is possible to make changes to collation rules to tailor the collation sequence. In Welsh __*Ch*__ is a seperate letter of the alphabet that comes after __*C*__. The default collaton rules for Welsh sort majuscules after minuscules:

In [185]:
cy_list = ["chwaeth", "Cyflym", "Clust", "cyflym", "Chwaeth", "chwerw", "Chwerw", "clust"]
cy_collator = Collator.createInstance(Locale("cy"))
print(sorted(cy_list, key=cy_collator.getSortKey))

['clust', 'Clust', 'cyflym', 'Cyflym', 'chwaeth', 'Chwaeth', 'chwerw', 'Chwerw']


Use `getRules()` to retrieve the collation rules from a collator instance. You can then modify the existing rules. The following snippet retrieved the Welsh collation rules and prepends [an option](https://unicode-org.github.io/icu/userguide/collation/customization/#default-options) that changes the collation order of majuscules and minuscules. These rules will sort Majuscules before minuscules:

In [186]:
cy_rules = cy_collator.getRules()   
#cy_rules = Collator.createInstance(Locale('cy')).getRules()
cy_rules = '[caseFirst upper]' + cy_rules
cy_collator_alt = RuleBasedCollator(cy_rules)
print(sorted(cy_list, key=cy_collator_alt.getSortKey))


['Clust', 'clust', 'Cyflym', 'cyflym', 'Chwaeth', 'chwaeth', 'Chwerw', 'chwerw']


Copyright © 2021 [Enabling Languages](https://enabling-languages.github.io/). <br>This notebook is made available under the [MIT licence](https://github.com/enabling-languages/python-i18n/blob/main/LICENSE).