# Sorting Sorani Kurdish data

In the following examples, we will be sorting Sorani Kurdish data, following the requirements of the Kurdish Academy. A sort using the locale module doesn't provide a cross-platform solution, and there is no 'ckb' or 'ckb_IQ' collation rules in `icu4c`. We will use custom rules and the `icu.RuleBasedCollator()` to sort Sorani text.

The following code, illustrates two posisble approaches:

1. Read in a LDML file containing the rules, and 
2. Provode the rules as a string that can be directly used by `icu.RuleBasedCollator()`.

The rules used are:

```
[normalization on]
[reorder Arab]
&\u0695 < \u0632
&\u0648 < \u06C6 < \u0648\u0648
```


## Setup

In [2]:
import sys
# import regex as re
import re
from icu import Locale, Collator, RuleBasedCollator
import xml.etree.ElementTree as ET
from random import sample
import locale
from pathlib import Path
import wget

ModuleNotFoundError: No module named 'icu'

In [None]:
# Retrieve data
data_file = "../data/wordlists/kurdi_words.txt"
url = "https://raw.githubusercontent.com/0xdolan/kurdi/master/corpus/kurdi_words.txt"
path = Path(data_file)
if not path.is_file():
    print("No file, downlading ...")
    output_directory = "../data/wordlists/"
    filename = wget.download(url, out=output_directory)

# Read in data
with open(data_file, 'r') as fh:
    data = fh.read().splitlines()

random_data = sample(data, len(data))


## Using rules in a LDML file

The Common Locale Data Repository (CLDR) provides locale data as XML files using the Locale Data Markup Language (LDML) schema. There is also a json format of CLDR data available. We will read and parse the XML file, extracting the collation rules. Then build a collator instance to sort our text.

In [None]:
def create_ldml_collator_instance(lang: str):
    def get_custom_rules(rules_file: str) -> str:
        rules: str = ''
        doc = ET.parse(rules_file)
        r = doc.find('./collations/collation[@type="standard"]')
        if r is None:
            r = doc.find('./collations/collation')
        if r is None:
            sys.stderr(f"Can't find collator in {rules_file}")
        pattern = re.compile(r'[ \t]{2,}|[ ]*#.+\n')
        rules = re.sub(pattern, '', r.find('./cr').text)
        # return rules.replace("\n", "")
        return ''.join(rules.splitlines())
        
    # custom_rules = {
    #     "din": "../rules/collation/din.xml",
    #     "din-SS": "../rules/collation/din.xml",
    #     "ckb": "../rules/collation/ckb.xml",
    #     "ckb-IQ": "../rules/collation/ckb.xml",
    #     "ckb-IR": "../rules/collation/ckb.xml"
    # }
    
    # if lang in custom_rules:
        # return RuleBasedCollator(get_custom_rules(custom_rules[lang]))
    return RuleBasedCollator(get_custom_rules("../rules/collation/ckb.xml"))
    # return Collator.createInstance(Locale.forLanguageTag(lang))

In [None]:
ldml_collator = create_ldml_collator_instance("ckb")
sorted_ldml_data = sorted(random_data, key=ldml_collator.getSortKey)

In [None]:
print(f'{sorted_ldml_data == random_data}, {sorted_ldml_data == data}')

False, False


## Collation rules as embedded string in python

In [None]:
ckb_rules = (
    "[normalization on]"
    "[reorder Arab]"
    "&\u0695 < \u0632"
    "&\u0648 < \u06C6 < \u0648\u0648"
)

rb_collator = RuleBasedCollator(ckb_rules)
sorted_rb_data = sorted(random_data, key=rb_collator.getSortKey)

In [None]:
print(f'{sorted_rb_data == random_data}, {sorted_rb_data == data}, {sorted_rb_data == sorted_ldml_data}')

False, False, True


## Using glibc: ckb_IQ

In [None]:
locale.setlocale(locale.LC_COLLATE, "ckb_IQ.UTF-8")

'ckb_IQ.UTF-8'

In [None]:
sorted_glibc_data = sorted(random_data, key=locale.strxfrm)

In [None]:
print(f'{sorted_glibc_data == random_data}, {sorted_glibc_data == data}, {sorted_glibc_data == sorted_rb_data}, {sorted_glibc_data == sorted_ldml_data}')

False, False, False, False


Glibc uses the follwoing rules for the Central Kurdish locale:

```
LC_COLLATE
% Copy the template from ISO/IEC 14651
copy "iso14651_t1"

reorder-after <S0631> % ر
<S0695> % ڕ

reorder-after <S0646> % ن
<S0648> % و
<S06C6> % ۆ

reorder-end

END LC_COLLATE
```

This could be converted to CLDR rules as:

```
&\u0631 < \u0695
&\u0646 < \u0648 < \u06C6
```

Although full set of rules would need to take in all the differences between CLDR collation algorithm and the default ISO/IEC 14651 template.

In [None]:
ckb_glibc_rules = (
    "&\u0631 < \u0695"
    "&\u0646 < \u0648 < \u06C6"
)

rb_glibc_collator = RuleBasedCollator(ckb_glibc_rules)
sorted_rb_glibc_data = sorted(random_data, key=rb_glibc_collator.getSortKey)

In [None]:
print(f'{sorted_rb_glibc_data == random_data}, {sorted_rb_glibc_data == data}, {sorted_rb_glibc_data == sorted_rb_data}, {sorted_rb_glibc_data == sorted_ldml_data}, {sorted_rb_glibc_data == sorted_glibc_data}')

False, False, False, False, True


## Custom glibc locale: ckb_IQ@academy

In [None]:
locale.setlocale(locale.LC_COLLATE, "ckb_IQ.UTF-8@academy")

NameError: name 'locale' is not defined

In [None]:
sorted_glibc_custom_data = sorted(random_data, key=locale.strxfrm)

In [None]:
print(f'{sorted_glibc_custom_data == random_data}, {sorted_glibc_custom_data == data}, {sorted_glibc_custom_data == sorted_rb_data}, {sorted_glibc_custom_data == sorted_ldml_data}, {sorted_glibc_custom_data == sorted_glibc_data}')