# Sorting Sorani Kurdish data

In the following examples, we will be sorting Sorani Kurdish data, following the requirements of the Kurdish Academy, based on ...

We will look at two approaches for Central Kurkish collation:

1. Use _PyICU_, and
2. Use the _locale_ module.

_PyICU_ is a Python wrapper for `icu4c`, but `icu4c` does not contain collation rules for the `ckb` locales. Although, it is possible to use custom rules and pass those rules to `icu.RuleBasedCollator()`. This approach will provide a cross platform solution on systems that have `icu4c` installed.

Using the _locale_ module for sorting is more problematic, different operating systems use different underlying infrastructure, with macOS and Windows providing no way to use custom locales. GLIBC based systems provide a `ckb_IQ` locale, but the collation rules do not match the order used by the Kurdish Academy. It is necessary to define and install a custom locale for GLIBC based systems.

A sort using the locale module doesn't provide a cross-platform solution.

## Setup

In [1]:
import sys
# import regex as re
import re
import locale
import random
from icu import RuleBasedCollator
from pathlib import Path
import wget
# from el_internationalisation import bidi_envelope

In [2]:
# Retrieve Sorani Kurdish wordlist
data_file = "../data/wordlists/kurdi_words.txt"
url = "https://raw.githubusercontent.com/0xdolan/kurdi/master/corpus/kurdi_words.txt"
path = Path(data_file)
if not path.is_file():
    print("Downloading file ...")
    output_directory = "../data/wordlists/"
    filename = wget.download(url, out=output_directory)

# Read in data
with open(data_file, 'r') as fh:
    data = fh.read().splitlines()

random.seed(10)
# random_data = sample(data, len(data))
random_data = random.sample(data,40)


In [3]:
#print("\n".join(random_data))
print(*random_data, sep="\n")

هەمیل
ئێرانییش
لەخاڵیس
مۆخت
هەڵدەکشێتەوە
ئاڵاکانیاندا
حاجت
مارکۆماس
مەترسیانەی
دەرمانانەیی
پەناهندەلە
تاوانکارەمان
ئێوەیەکی
نوشستیەی
مێژنەم
سۆمالچ
بلبلێیە
دروستدەبو
کەئادەمیزاد
شیعرییەکەش
ئەلیح
لێیانناون
بەمڵایەوە
وهەشتاکان
شمەکخۆریدا
فرەزمانیی
لەئامێزم
دەستگیرکردنانەشەوە
ڕاناوەستیت
دکتوری
لەکتێبخانەگشتییەکانی
تیرۆردە
ڕووسوورانەی
زۆرناخۆش
چایچیت
شێخونیان
بەشتگەلێکەوە
لەکریس
گشتیدەکرد
دارانەتان


## Using icu.RuleBasedCollator()

There are two possible approaches to adding collation rules, using `icu.RuleBasedCollator()`:

1. Read in a LDML file and extract the collation rules, or
2. Provide the rules as a string that can be directly used by `icu.RuleBasedCollator()`.

We will use the second approach.

The rules used are:

```
"[normalization on]"
"[reorder Arab]"
"&\u0631 < \u0695"
"&\u0648 < \u0648\u0648"
```

You need to:

1. Define your collation rules
2. Initialise a collator instance by passing the rules to `icu.RuleBasedCollator`
3. Use the collator instance's `getSortKey()` for the `list.sorted()` key parameter.

The collator can be reused with a script to perform all necessary sorting.

In [4]:
ckb_rules = (
    "[normalization on]"
    "[reorder Arab]"
    "&\u0631 < \u0695"
    "&\u0648 < \u0648\u0648"
)

collator = RuleBasedCollator(ckb_rules)
sorted_data_icu = sorted(random_data, key=collator.getSortKey)

In [5]:
print(*sorted_data_icu, sep="\n")


ئاڵاکانیاندا
ئەلیح
ئێرانییش
ئێوەیەکی
بلبلێیە
بەشتگەلێکەوە
بەمڵایەوە
پەناهندەلە
تاوانکارەمان
تیرۆردە
چایچیت
حاجت
دارانەتان
دروستدەبو
دکتوری
دەرمانانەیی
دەستگیرکردنانەشەوە
ڕاناوەستیت
ڕووسوورانەی
زۆرناخۆش
سۆمالچ
شمەکخۆریدا
شیعرییەکەش
شێخونیان
فرەزمانیی
کەئادەمیزاد
گشتیدەکرد
لەئامێزم
لەخاڵیس
لەکتێبخانەگشتییەکانی
لەکریس
لێیانناون
مارکۆماس
مەترسیانەی
مۆخت
مێژنەم
نوشستیەی
هەڵدەکشێتەوە
هەمیل
وهەشتاکان


## Using a GLIBC locale

It is possible to install a custom GLIBC locale on Linux systems that use GLIBC. This approach will not work on Linux systems using _Musl libc_.  Likewise, it will not work on systems, like macOS, that are based on _BSD libc_.

In the code below, we are using a custom GLIBC locale: [ckb_IQ.UTF-8@academy](https://github.com/enabling-languages/python-i18n/blob/main/rules/collation/glibc/ckb_IQ%40academy).

It uses the following GLIBC collation rules:

```
LC_COLLATE

copy "iso14651_t1"

collating-element <U0648_0648> from "<U0648><U0648>"  % Double ARABIC LETTER WAW

reorder-after <S0631> % ARABIC LETTER REH
<S0695> % ARABIC LETTER REH WITH SMALL V BELOW

reorder-after <S0648> % ARABIC LETTER WAW
<U0648_0648> % Double ARABIC LETTER WAW

reorder-end

END LC_COLLATE
```

In [7]:
try:
    locale.setlocale(locale.LC_COLLATE, "ckb_IQ.UTF-8@academy")
    sorted_data_glibc = sorted(random_data, key=locale.strxfrm)
except locale.Error:
    print("Required locale is unavailable")



Required locale is unavailable
