# Collation

An alternative for `sorted()` using `icu.Collator` and `icu.RuleBasedCollator`. Currently supports lists, tuples, strings, dataframes and series.

Custom collation rules are defined for Dinka and Akan.

All icu::Collator supported locales are available.

__TODO:__

* add support for numpy arrays
* add support for dicts [[1]](https://stackoverflow.com/questions/38793694/python-sort-a-list-of-objects-dictionaries-with-a-given-sortkey-function)
* allow user to define colation rules and pass them to `el_collation.sorted_`
* allow user to modify collation rules provided by ICU locales

In [2]:
import pandas as pd
import el_collation as elcol
import random


## Custom Collation Rules (unsupported locales)

The following examples will be using predefined collation rules for Dinka.

In [3]:
# Set language
lang = "din-SS"
# Provide Dinka lexemes
ordered_lexemes_tuple = (
    'abany',
    'abaany',
    'abaŋ',
    'abenh',
    'abeŋ',
    'aber',
    'abeer',
    'abëër',
    'abeeric',
    'aberŋic',
    'abuɔ̈c',
    'abuɔk',
    'abuɔɔk',
    'abuɔ̈k',
    'abur',
    'acut',
    'acuut',
    'acuth',
    'ago',
    'agook',
    'agol',
    'akɔ̈r',
    'akɔrcok',
    'akuny',
    'akuŋɛŋ'
)

# Ensure lexeme order is randomised
random.seed(5)
random_lexemes = tuple(random.sample(ordered_lexemes_tuple, len(ordered_lexemes_tuple)))
random_lexemes

('agook',
 'abeeric',
 'abuɔk',
 'agol',
 'acuut',
 'abany',
 'abur',
 'abëër',
 'abaany',
 'aber',
 'akɔ̈r',
 'acut',
 'acuth',
 'abenh',
 'abeer',
 'akuny',
 'ago',
 'akɔrcok',
 'akuŋɛŋ',
 'abuɔ̈k',
 'aberŋic',
 'abuɔɔk',
 'abeŋ',
 'abuɔ̈c',
 'abaŋ')

In [4]:
# Sort randomised tuple of Dinka lexemes
sorted_lexemes = elcol.sorted_(random_lexemes, lang)
sorted_lexemes

('abany',
 'abaany',
 'abaŋ',
 'abenh',
 'abeŋ',
 'aber',
 'abeer',
 'abëër',
 'abeeric',
 'aberŋic',
 'abuɔ̈c',
 'abuɔk',
 'abuɔɔk',
 'abuɔ̈k',
 'abur',
 'acut',
 'acuut',
 'acuth',
 'ago',
 'agook',
 'agol',
 'akɔ̈r',
 'akɔrcok',
 'akuny',
 'akuŋɛŋ')

### Pandas dataframes



In [4]:
ddf = pd.read_csv("../word_frequency/unilex/din.txt", sep='\t', skiprows = range(2,5))
random_ddf = ddf.sample(frac=1)
sorted_ddf = elcol.sorted_(random_ddf, lang, random_ddf['Form'])
sorted_ddf.head(30)

Unnamed: 0,Form,Frequency
0,a,1043535
1,aa,10106358
22,aba,498635
23,abä,210763
2,aaba,102811
3,aabä,46265
25,abak,77108
26,abäk,128514
27,abäkke,20562
28,abaŋ,154217


### Pandas series

In [None]:
random_words = random_ddf['Form']
sorted_words = elcol.sorted_(random_words, lang, random_words)