In [1]:
import json
import re

In [2]:
key_set = set()
mappings = {}
with open("ucsc_mapping.txt", "r", encoding="utf-8") as f:
    line_count = 0
    for line in f.readlines():
        line_count += 1
        fm, uni = line.strip().split(": ")
        fm = fm[1:-1]
        uni = uni[1:-1]
        if fm in key_set:
            print(fm, " -> ", uni, " # ", mappings[fm], " # ", mappings[fm] == uni)
        else:
            key_set.add(fm)
            mappings[fm] = uni

print("Total mappings: ", line_count)
print("Unique mappings: ", len(mappings))

fjHd  ->  ව්‍යො  #  ව්‍යො  #  True
fIda  ->  ෂෝ  #  ෂෝ  #  True
fId  ->  ෂො  #  ෂො  #  True
fâ  ->  ඩේ  #  ඬේ  #  False
Ï  ->  ඐ  #  ඐ  #  True
\/  ->  රැ  #  රැ=  #  False
Ú  ->  ඵී  #  ඵී  #  True
È  ->  දි  #  දි  #  True
Ý  ->  ඵි  #  ඵී  #  False
Ü  ->  ට්  #  ට්  #  True
Ù  ->  ඩ්  #  ඕ  #  False
Ì  ->  ඏ  #  ඏ  #  True
M  ->  ඵ  #  ඵ  #  True
c  ->  ජ  #  ජ  #  True
r  ->  ර  #  ර  #  True
{  ->  ඥ  #  ඥ  #  True
|  ->  ඳ  #  ඳ  #  True
~  ->  ඬ  #  ඬ  #  True
&  ->  )  #  )  #  True
Z  ->  ’  #  ’  #  True
•  ->  x  #  x  #  True
º  ->  X  #  X  #  True
¹  ->  V  #  V  #  True
Total mappings:  546
Unique mappings:  523


Except for 4, rest of the duplicates are same mapping.

"fâ"  ->  "ඩේ"  #  "ඬේ"  #  False ==>> Key for "ඬේ" should be 'få'. Added it.

"\/"  ->  "රැ"  #  "රැ="  #  False ==>> "රැ=" seems wrong

"Ý"  ->  "ඵි"  #  "ඵී"  #  False ==>> Key for 'ඵී' is 'Ú'

"Ù"  ->  "ඩ්"  #  "ඕ"  #  False ==>> Neither is correct. "Ù": "ඞ්" and "â": "ඩ්"

Remove rest of the dupes.

In [3]:
key_set_1 = set()
mappings_1 = {}
with open("ucsc_deduped.txt", "r", encoding="utf-8") as f:
    line_count = 0
    for line in f.readlines():
        line_count += 1
        fm, uni = line.strip().split(": ")
        fm = fm[1:-1]
        uni = uni[1:-1]
        if fm in key_set_1:
            print(fm, " -> ", uni, " # ", mappings_1[fm], " # ", mappings_1[fm] == uni)
        else:
            key_set_1.add(fm)
            mappings_1[fm] = uni

print("Total deduped mappings: ", line_count)
print("Unique deduped mappings: ", len(mappings_1))
print("Additional key: ", list(key_set_1 - key_set))

Total deduped mappings:  524
Unique deduped mappings:  524
Additional key:  ['få']


In [4]:
for key, uni in mappings_1.items():
    if re.search(r" \w+", uni) or re.search(r" .+", key) or re.search(r".+=", uni):
        print(f"'{key}': '{uni}'")

'ƒ': 'ඳැ='
'ü': 'ඤූ='
'û': 'ඤු='
'´': ' ඕ'
'P': 'ඡ='
'“': ' ර්‍ණ'
' ’': 'ී'
' ‘': 'ි'
'œ': ' ර්‍්‍ය'


Not sure about ' ’': 'ී' and ' ‘': 'ි'. Leaving them as is for now. 

Fixing the rest

```
"%a": "a%"
"f*%a": "ෆ්‍රේ"
"f.%a": "ග්‍රේ"
"fl%a": "ක්‍රේ"
"fm%a": "ප්‍රේ"
```
There is a problem with `a%`. Since it can be used as either `a%` or `%a`, It will be taken out as a normalizing replacemeent rule and deleted from the mapping. And rest is fixed according to the normalizing rule "%a": "a%".
```
"f*a%": "ෆ්‍රේ"
"f.a%": "ග්‍රේ"
"fla%": "ක්‍රේ"
"fma%": "ප්‍රේ"
```
Now unique deduped mappings shoul be  523

variations of ් and ු are normalized since unicode should correctly render them through open-type rules.
```
"A": "a"
"=": "q"
"+": "Q"
```

leaving the following rules as is for clarity, even though they won't be used:
```
"A": "්"
"=": "ු"
"+": "ූ"
```

In [5]:
normalizing_rules = {"%a": "a%", "A": "a", "=": "q", "+": "Q"}

Mappings like 'C' and 'J' are bit confusing since they map to a combination of the format `<char><ZWJ>`. But zero width joiner is invicible and it might look like there is only one glyph there.
```
"F": "ත්‍"
"J": "න්‍"
"C": "ක්‍"
```

A few sanyaka represenations used in some places but missing in UCSC mapping is added. This confirms to the form of using "`" for sanyaka lakuna.
```
"`o": "ඳ"
"`P": "ඦ"
"`v": "ඬ"
```
Dwakara (`ද්‍<ZWJ><another char>`) is also missing from UCSC mapping. Those can be viewed as sanyaka forms of "ධ" and "ව". So "`" based sanyaka mapping is also added there.

```
"Š": "ද්‍ධ"
"`O": "ද්‍ධ"
"„": "ද්‍ව"
"`j": "ද්‍ව"
```


repaya + yansaya (œ) mapping has an issue. Since repaya needs come before the consonent and yansaya afterwards, simple replacement is not possible. ".œ" is getting mapped to "ග ර්‍්‍ය", where the correct mapping would be "ර්‍ග්‍ය". This needs to be fixed by generation of mappings. For now fixing the special case mapping `"hH_": "ර්ය"` with `"hH_": "ර්‍ය්‍ය"`.