In [1]:
import re
from unicodedata import normalize, name, combining, decomposition

from tf.core.helpers import rangesFromList

# Inspection

Meet the unicode characters:

* cLike: a c with caron and dot below, maximally composed, still two characters
* hLike: an L with macron and dot below, maximally composed, one character

We'll decompose them, recompose them in two ways, and inspect the result.

In [2]:
cLike = "č̣"
lLike = "\u1E38"
caron = "\u030c"
macron = "\u0304"
dot = "\u0323"

Functions to inspect a unicode string.

In [3]:
def show(u):
    for c in u:
        cName = name(c)
        cPadding = ' ' if cName.startswith('COMBINING') else ''
        print(f"\t{cPadding}{c} {ord(c):>04x} {name(c)}")

In [4]:
def showdc(u):
    ud = normalize("NFD", u)
    ud1 = ud[0] + ud[2] + ud[1]
    uc = normalize("NFC", ud)
    uc1 = normalize("NFC", ud1)
    print("Original")
    show(u)
    print("Decomposed")
    show(ud)
    print("Composed")
    show(uc)
    print("Composed in alternative order")
    show(uc1)

Let's inspect our two candidates.

In [5]:
showdc(cLike)

Original
	č 010d LATIN SMALL LETTER C WITH CARON
	 ̣ 0323 COMBINING DOT BELOW
Decomposed
	c 0063 LATIN SMALL LETTER C
	 ̣ 0323 COMBINING DOT BELOW
	 ̌ 030c COMBINING CARON
Composed
	č 010d LATIN SMALL LETTER C WITH CARON
	 ̣ 0323 COMBINING DOT BELOW
Composed in alternative order
	č 010d LATIN SMALL LETTER C WITH CARON
	 ̣ 0323 COMBINING DOT BELOW


This letter can be composed to a single unicode character

In [6]:
showdc(lLike)

Original
	Ḹ 1e38 LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
Decomposed
	L 004c LATIN CAPITAL LETTER L
	 ̣ 0323 COMBINING DOT BELOW
	 ̄ 0304 COMBINING MACRON
Composed
	Ḹ 1e38 LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
Composed in alternative order
	Ḹ 1e38 LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON


# Matching

Let's see how we can match.
It is not completely trivial, because we might need to know the order of combining characters into which characters decompose.

We create patterns for testing out various tasks and we will see which forms of the unicode string yield what results.

* the base letter
* the dot below
* the caron or macron
* the base letter plus a caron/macron (first in a naive way, and then in a good way)
* the base letter plus a dot
* the base letter plus a caron/macron and a dot (in that order)
* the base letter plus a dot and a caron/macron (in that order)
* the base letter plus a dot and a caron/macron (in any order)

## Collecting all combining characters

As a preparation we synthesize a regex that matches all combining characters in the unicode table as far as they are not letters
(some combining characters are also letters).

We do this by inspecting the names of all unicode characters, looking for the words `COMBINING` and `LETTER` in their names,
collecting the code points on the non-letter combiners, distilling a set of ranges out of that and writing it into a regex pattern.

In [7]:
CMB = 'COMBINING'
LTR = 'LETTER'

combiners = []

for x in range(1, 0xFFFF + 1):
    try:
        u = chr(x)
        n = name(u)
    except ValueError:
        continue
    if CMB in n and LTR not in n:
            combiners.append(x)
 
print(f"Combiners : {len(combiners):>5}")

charRanges = (
    chr(b) if b == e else f"{chr(b)}-{chr(e + 1)}"
    for (b, e) in rangesFromList(combiners)
)

combinerPat = f"[{''.join(charRanges)}]*"
combinerPat

Combiners :   247


'[̀-ͣ҃-Ҋ߫-ߴఀఄഀ፝-፠᩿᪰-\u1abf᭫-᭴᷀-᷊᷋-ᷓ᷵-\u1dfa᷻-Ḁ⃐-\u20f1⳯-Ⳳ゙-゛꙯-꙳꙼-꙾꛰-꛲꣠-꣪꣱︠-︰]*'

## Non-naive searching for combinations

If we want to search for say an L with a macron, we also want to find cases of an L with a macron in the presence of other
combining characters. 
In order to do so, we need the pattern that matches combiners.

In [8]:
aon = "[\u030c\u0304]"

regs = (
    ("letter", re.compile("[A-Za-z]")),
    ("dot", re.compile(dot)),
    ("aon", re.compile(aon)),
    ("letterAonNaive", re.compile(f"[A-Za-z]{aon}")),
    ("letterAon", re.compile(f"[A-Za-z]{combinerPat}{aon}")),
    ("letterDot", re.compile(f"[A-Za-z]{combinerPat}{dot}")),
    ("letterAonDot", re.compile(f"[A-Za-z]{combinerPat}{dot}{combinerPat}{aon}")),
    ("letterDotAon", re.compile(f"[A-Za-z]{combinerPat}{aon}{combinerPat}{dot}")),
    ("letterDotAonX", re.compile(f"[A-Za-z]{combinerPat}(?:(?:{aon}{combinerPat}{dot})|(?:{dot}{combinerPat}{aon}))")),
)

def matchit(u):
    show(u)
    for (label, reg) in regs:
        match = reg.search(u)
        answer = "yes" if match else "no"
        print(f"\t{label:<20} match: {answer}")

In [9]:
def showmatch(u):
    ud = normalize("NFD", u)
    ud1 = ud[0] + ud[2] + ud[1]
    uc = normalize("NFC", ud)
    uc1 = normalize("NFC", ud1)
    print("Original")
    matchit(u)
    print("Decomposed")
    matchit(ud)
    print("Composed")
    matchit(uc)
    print("Composed in alternative order")
    matchit(uc1)

In [10]:
showmatch(cLike)

Original
	č 010d LATIN SMALL LETTER C WITH CARON
	 ̣ 0323 COMBINING DOT BELOW
	letter               match: no
	dot                  match: yes
	aon                  match: no
	letterAonNaive       match: no
	letterAon            match: no
	letterDot            match: no
	letterAonDot         match: no
	letterDotAon         match: no
	letterDotAonX        match: no
Decomposed
	c 0063 LATIN SMALL LETTER C
	 ̣ 0323 COMBINING DOT BELOW
	 ̌ 030c COMBINING CARON
	letter               match: yes
	dot                  match: yes
	aon                  match: yes
	letterAonNaive       match: no
	letterAon            match: yes
	letterDot            match: yes
	letterAonDot         match: yes
	letterDotAon         match: no
	letterDotAonX        match: yes
Composed
	č 010d LATIN SMALL LETTER C WITH CARON
	 ̣ 0323 COMBINING DOT BELOW
	letter               match: no
	dot                  match: yes
	aon                  match: no
	letterAonNaive       match: no
	letterAon            match: no
	lett

In [11]:
showmatch(lLike)

Original
	Ḹ 1e38 LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
	letter               match: no
	dot                  match: no
	aon                  match: no
	letterAonNaive       match: no
	letterAon            match: no
	letterDot            match: no
	letterAonDot         match: no
	letterDotAon         match: no
	letterDotAonX        match: no
Decomposed
	L 004c LATIN CAPITAL LETTER L
	 ̣ 0323 COMBINING DOT BELOW
	 ̄ 0304 COMBINING MACRON
	letter               match: yes
	dot                  match: yes
	aon                  match: yes
	letterAonNaive       match: no
	letterAon            match: yes
	letterDot            match: yes
	letterAonDot         match: yes
	letterDotAon         match: no
	letterDotAonX        match: yes
Composed
	Ḹ 1e38 LATIN CAPITAL LETTER L WITH DOT BELOW AND MACRON
	letter               match: no
	dot                  match: no
	aon                  match: no
	letterAonNaive       match: no
	letterAon            match: no
	letterDot            match:

# Conclusion

In order to work systematically with composed characters, it is best to always decompose them first.
After that, it is best to assume no order of the combining characters in the decomposition.
That means that some regexes will become a bit more complicated.

# Left overs

Let's try to decompose `cLike` and `lLike` step by step (not the handiest method):

In [14]:
decomposition(cLike[0])

'0063 030C'

In [15]:
decomposition(lLike)

'1E36 0304'

In [16]:
decomposition("\u1E36")

'004C 0323'

Let's inspect the `combining` information for unicode characters.

In [19]:
combining(cLike[0])

0

In [20]:
combining(lLike)

0

In [17]:
combining(caron)

230

In [18]:
combining(macron)

230