# Sorting emoji

Python's inbuild sorting algorithms sort emoji by codepoint.

Codepoint order, as well as the default collation rules provided by the Unicode Collation Algorithm do not provide adequate [ordering and grouping](https://www.unicode.org/reports/tr51/#Sorting) of emoji.

The Unicode Common Locale Data Repository (CLDR) provides colation rules for emoji. [Conformant emoji collation](https://www.unicode.org/reports/tr51/#Collation_Conformance) is defined in CLDR tailoring rules for the Unicode Collation Algorthim (UCA).

CLDR groups emoji into broad conceptual categories in order to group related emoji together.

## Emoji only collation

For the following discussion we will use the following emoji:

|Character |Codepoint |Description |Category |
|--------- |--------- |----------- |-------- |
|🦜 |U+1F99C |Parrot |animal-bird |
|🥚 |U+1F95A |Egg |food-prepared |
|🐔 |U+1F414 |Chicken |animal-bird |

The default python sort algorithm will order then in terms of the emoji's codepoint: U+1F414 (chicken), U+1F95A (egg), and then U+1F99C (parrot).

The CLDR ordering would be to sort the two bids together (U+1F414 then U+1F99C), followed by U+1F95A.

In [1]:
a = ['🦜', '🥚', '🐔']
sorted(a)

['🐔', '🥚', '🦜']

Using PyICU, it is possible to sort emoji according to CLDR's collation rules for Emoji. The `-u-co-emoji` Unicode BCP-47 extension will enable CLDR based emoji collation. When sorting just wmoji we can use the langauge subtag `und` (undetermined) as the base for the locale identifier: `und-u-co-emoji`.

In [2]:
from icu import Collator, Locale
coll = Collator.createInstance(Locale.createCanonical("und-u-co-emoji"))
print(sorted(a, key=coll.getSortKey))

['🐔', '🦜', '🥚']


This yields a CLDR based sort using the CLDR emoji collation rules.

##  Sorting text and emoji

A more complex scenario is sorting a set of text and emoji.

[UTS #35](https://unicode.org/reports/tr35/tr35-collation.html#Combining_Rules) provides a discussion of tailoring and combining rules in relation to sorting emoji and text. We'll implement the example given in UTS #35 in Python.

The following characters are used:

|Character  |Codepoint  |Description  |
|---------- |---------- |------------ |
|😀 |U+1F600 |Grinning Face |
|글 |U+AE00 |Hangul Syllable Geul |
|Z |U+005A |Latin Capital Letter Z |
|ü |U+00FC |Latin Small Letter U with Diaeresis |
|, |U+002C |Comma |
|✈️️ |U+2708 U+FE0F |Airplane |
|y |U+0079 |Latin Small Letter Y |
|☹️ |U+2639 U+FE0F |White Frowning Face |
|a |U+0061 |Latin Small Letter A |

Enabling emoji collation overrides language specific tailorings. This has no impact on text for languages that use the root collation, but will have a negative impact on languages that do require tailoring to obtain the correct collation order.

The python sort algorithm will order content by codepoint:

In [11]:
# List to be sorted
b = ['😀', '글', 'Z', 'ü', ',', '✈️️', 'y', '☹️', 'a']

#Default Python sort
sorted(b)

[',', 'Z', 'a', 'y', 'ü', '☹️', '✈️️', '글', '😀']

The `en` locale identifier will use the CLDR root collation. Emoji are not sorted using the CLDR emoji collation rules:

In [25]:
# locale: en
en_coll = Collator.createInstance(Locale.forLanguageTag("en"));
sorted(b, key=en_coll.getSortKey)

[',', '☹️', '✈️️', '😀', 'a', 'ü', 'y', 'Z', '글']

Enabling emoji collation using the `en-u-co-emoji` locale will sort the emoji based on the emoji collation rules and the remaining characters are sorted as per the root collation algorithm.

In [24]:
# locale for en-u-co-emoji
en_emoji_coll = Collator.createInstance(Locale.forLanguageTag("en-u-co-emoji"));
sorted(b, key=en_emoji_coll.getSortKey)

[',', '😀', '☹️', '✈️️', 'a', 'ü', 'y', 'Z', '글']

`en-u-co-emoji"`will yield the same result as `und-u-co-emoji`, i.e. sort emoji according to the CLDR emoji collation order and sort other characters according to the root collation algorithm.

In [23]:
# locale for und-u-co-emoji
und_emoji_coll = Collator.createInstance(Locale.forLanguageTag("und-u-co-emoji"));
sorted(b, key=und_emoji_coll.getSortKey)

[',', '😀', '☹️', '✈️️', 'a', 'ü', 'y', 'Z', '글']

The `da` locale has tailored collation rules to order text in the sequence required for Danish:

In [22]:
# locale for da
da_coll = Collator.createInstance(Locale.forLanguageTag("da"));
sorted(b, key=da_coll.getSortKey)

[',', '☹️', '✈️️', '😀', 'a', 'y', 'ü', 'Z', '글']

Adding emoji collation support overrides the Danish language tailorings. Look at the order of __ü__ in the list for the `da` and `da-u-co-emoji` locales.

In [20]:
# locale for da-u-co-emoji
da_emoji_coll = Collator.createInstance(Locale.forLanguageTag("da-u-co-emoji"));
sorted(b, key=da_emoji_coll.getSortKey)

[',', '😀', '☹️', '✈️️', 'a', 'ü', 'y', 'Z', '글']

To overcome this, it is possible to combine the collation rules for the `da` and `da_and_emoji_rules`. We can do this by:

1. Initiating collator instances for each locale, and retrieve the rules
2. Concatenate the rule sets
3. Initiate a collator instance using `RuleBasedCollator`

This will order emoji according to the emoji collation rules and order Latin script text according to Danish collation rules.

In [19]:
# Combinded rules
from icu import RuleBasedCollator
#da_and_emoji_rules = Collator.createInstance(Locale.forLanguageTag('da')).getRules() + Collator.createInstance(Locale.forLanguageTag('und-u-co-emoji')).getRules()
da_rules = Collator.createInstance(Locale.forLanguageTag('da')).getRules()
emoji_rules = Collator.createInstance(Locale.forLanguageTag('und-u-co-emoji')).getRules()
da_and_emoji_rules = da_rules + emoji_rules
combined_coll = RuleBasedCollator(da_and_emoji_rules)
sorted(b, key=combined_coll.getSortKey)

[',', '😀', '☹️', '✈️️', 'a', 'y', 'ü', 'Z', '글']

The same approach is needed for other languages that are not supported by the CLDR root collation algorithm and require tailored rules.

## Resources

* [Emoji ordering chart](https://www.unicode.org/emoji/charts/emoji-ordering.html)
* [CLDR Root collation rules](https://github.com/unicode-org/cldr/blob/353527cdabf1e8870d261beb3c908de6deb1915b/common/collation/root.xml#L951)