# Sorting Sorani Kurdish data

In the following examples, we will be sorting Sorani Kurdish data, following the requirements of the Kurdish Academy. A sort using the locale module doesn't provide a cross-platform solution, and there is no 'ckb' or 'ckb_IQ' collation rules in `icu4c`. We will use custom rules and the `icu.RuleBasedCollator()` to sort Sorani text.

There are two posisble approaches to adidng collation rules:

1. Read in a LDML file containing the rules, and 
2. Provide the rules as a string that can be directly used by `icu.RuleBasedCollator()`.

We will use the seconf approach.

The rules used are:

```
"[normalization on]"
"[reorder Arab]"
"&\u0631 < \u0695 < \u0632"
"&\u0648 < \u06C6 < \u0648\u0648"
```


## Setup

In [5]:
import sys
# import regex as re
import re
from icu import RuleBasedCollator
from random import sample
from pathlib import Path
import wget

In [6]:
# Retrieve data
data_file = "../data/wordlists/kurdi_words.txt"
url = "https://raw.githubusercontent.com/0xdolan/kurdi/master/corpus/kurdi_words.txt"
path = Path(data_file)
if not path.is_file():
    print("No file, downloading ...")
    output_directory = "../data/wordlists/"
    filename = wget.download(url, out=output_directory)

# Read in data
with open(data_file, 'r') as fh:
    data = fh.read().splitlines()

random_data = sample(data, len(data))


## Providing collation rules to icu.RuleBasedCollator()

In [7]:
ckb_rules = (
    "[normalization on]"
    "[reorder Arab]"
    "&\u0631 < \u0695 < \u0632"
    "&\u0648 < \u06C6 < \u0648\u0648"
)

rb_collator = RuleBasedCollator(ckb_rules)
sorted_rb_data = sorted(random_data, key=rb_collator.getSortKey)

In [8]:
print(f'{sorted_rb_data == random_data}, {sorted_rb_data == data}')
print(f'{sorted_rb_data[:30]}, \n\n {sorted_rb_data[-30:]}')

False, False
['ئا', 'ئائە', 'ئائەلێرەوە', 'ئائەم', 'ئائەمانە', 'ئائەمانەن', 'ئائەمنیەتیان', 'ئائەمە', 'ئائەمەش', 'ئائەمەشە', 'ئائەمەمان', 'ئائەمەندە', 'ئائەمەندەی', 'ئائەمەنە', 'ئائەمەی', 'ئائەمەیە', 'ئائەو', 'ئائەوانە', 'ئائەوانەن', 'ئائەوانەی', 'ئائەوها', 'ئائەوە', 'ئائەوەتا', 'ئائەوەندە', 'ئائەوەها', 'ئائەوەهادا', 'ئائەوەیە', 'ئائین', 'ئائیندەیە', 'ئائینش'], 

 ['یێهو', 'یێهودا', 'یێهوش', 'یێهوشا', 'یێهوشوا', 'یێهوی', 'یێهویان', 'یێهۆڤا', 'یێهوودە', 'یێەی', 'یێومە', 'یێوە', 'یێویژدانانەی', 'یێویستیان', 'یێوێنەی', 'یێی', 'یێیان', 'یێیانکردەوە', 'یێیت', 'یێیدا', 'یێیرۆ', 'یێیکەویبوو', 'یێیل', 'یێین', 'یێیە', 'یێیەک', 'یێیەکتر', 'یێیەکە', 'یێیەکی', 'یێیەی']
