# Telugu 2 IPA Converter
This projects aims at creating an encoding converter that can be used for converting Telugu (ISO 15924: `telu`) text to IPA. I might change this converter to a generic Indic to IPA converter. 

At the `stage 1` of this project, I'm only converting a subset of Telugu used in a minority language named Kuvi (ISO 639-3: `kxv`).

This project is created to convert a corpus in Kuvi language into IPA. Due to the restrictions on the redistribution of the text, I can't share the original corpus publicly. However, the code is open source and is shared under a `CC-BY-SA`, `MIT` license. Please feel free to modify it.

The crucial part of this project is to create a file that maps the Telugu Unicode Codepoints with IPA. Since I'm not a Telugu speaker, I'm planning to use any publicly available mappings and I found a [Wikipedia article](https://en.wikipedia.org/wiki/Telugu_script) with the relevant information. 

## The code
We use Python3 for this project.

### Libraries

In [27]:
import os # for file operations
import re # for text parsing
import string #for cleanup
import unicodedata # for processing the unicode text
import pprint # for formatting

### Load source files
The sources files are (`Standard Format Markup: sfm`) marked up text 

In [5]:
# kuvi_corpus_path = "C:\Users\<user_name>\Kuvi"
kuvi_corpus_path = "E:\My Paratext 9 Projects\KUVI"
bible_files = [fn for fn in os.listdir(kuvi_corpus_path) if re.match(r'[0-9]+.*\.SFM', fn)]
print(bible_files)

['01GENKUVI.SFM', '02EXOKUVI.SFM', '03LEVKUVI.SFM', '04NUMKUVI.SFM', '05DEUKUVI.SFM', '06JOSKUVI.SFM', '07JDGKUVI.SFM', '091SAKUVI.SFM', '102SAKUVI.SFM', '122KIKUVI.SFM', '131CHKUVI.SFM', '142CHKUVI.SFM', '19PSAKUVI.SFM', '20PROKUVI.SFM', '23ISAKUVI.SFM', '24JERKUVI.SFM', '28HOSKUVI.SFM', '32JONKUVI.SFM', '33MICKUVI.SFM', '38ZECKUVI.SFM', '41MATKUVI.SFM', '42MRKKUVI.SFM', '43LUKKUVI.SFM', '44JHNKUVI.SFM', '45ACTKUVI.SFM', '46ROMKUVI.SFM', '471COKUVI.SFM', '482COKUVI.SFM', '49GALKUVI.SFM', '50EPHKUVI.SFM', '51PHPKUVI.SFM', '52COLKUVI.SFM', '531THKUVI.SFM', '542THKUVI.SFM', '551TIKUVI.SFM', '562TIKUVI.SFM', '57TITKUVI.SFM', '58PHMKUVI.SFM', '59HEBKUVI.SFM', '60JASKUVI.SFM', '611PEKUVI.SFM', '622PEKUVI.SFM', '631JNKUVI.SFM', '642JNKUVI.SFM', '653JNKUVI.SFM', '66JUDKUVI.SFM', '67REVKUVI.SFM', '94XXAKUVI.SFM', '95XXBKUVI.SFM']


In [6]:
bible_text = ""

for book in bible_files:
    f = open(os.path.join(kuvi_corpus_path, book), 'r', encoding="utf-8")
    bible_text += f.read()


### Cleanup
Get rid of the markups, English alphabets, punctuations and numerals as we are not interested in them 

In [8]:
#bible_text = re.sub("\\\\[A-Za-z0-9]*?\\\\s", "", bible_text)
bible_text = re.sub("\\\\id .*\n", "", bible_text)
bible_text = re.sub("\\\\[A-Za-z0-9]*?\s", "", bible_text)
bible_text = re.sub("[A-Za-z0-9]", "", bible_text)
bible_text = re.sub("\n", "", bible_text)

#### Remove punctuation characters

Removing the punctuations from the text to avoid clutter.
```
Although there are several ways/approaches to accomplish this, using the translate method of the strings seem to be the most efficient method.
```
[Source: StackOverFlow](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)

In [19]:
s = bible_text

In [22]:
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
bible_text = s.translate(table) 

#### python 2
> On Python 2, you'll have to use a different code to accomplish this.

```python
import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation
```

#### Warning
While it is easy to run a global cleanup like this to clean your text, you don't want to inadvertantly removing the punctuation marks with a `semantic role` in the text. So if your text has such usages of punctuation marks, please be careful and do it using another approaches. NLTK has modules that can do this out of the box.



In [24]:
bible_text[:500]

'ఆదిఆదిఆదికాండం తొల్లిమూలు మహపురు హాగుతి బూమితి రచ్చి కిత్తెసి బూమిత ఏనయి హిల్లఅతె ఏని వాణవ హిల్లఅన వరఅయిఎ మచ్చె ఏ హారెఎ క్డుహాని ఏయుణ అందెరి ప్డిక్హాఁచె మహపురుజీవు ఏయులెక్కొ బర్రె పొర్హాఁచ్చెసి ఉజ్జెడి ఆపె ఇంజిఁ మహపురు వెస్సలిఎ ఉజ్జెడి ఆతె ఉజ్జెడి నెహఁయి ఇంజిఁ మహపురు మెస్తెసి అందెరిటి ఉజ్జెడితి మహపురు ఏర్సితెసి ఉజ్జెడితి మద్దెన ఇంజిఁ అందెరితి లాఅఁయఁ ఇంజిఁ మహపురు దోర్క ఇట్టితెసి అందెరి ఆహఁ వేయ్యలిఎ మూలుతి దిన్న ఆతె ఓడె ఏయుటి ఏయు ఎట్క ఆనిలేఁకిఁ ఏదఅఁ మద్ది హారెఎ పడ్డ ఆతి పొరొ ఆపెదెఁ ఇంజిఁ మహపురు వె'

### Create a character inventory with count

In [25]:
char_inventory = {}

for char in bible_text:
    try:
        char_inventory[char] += 1
    except:
        char_inventory[char] = 1


In [None]:
o = open("kuvi.tab", mode='w', encoding='utf-8')
for key, value in char_inventory.items():
    
    try:
        o.write(key + "\t" + unicodedata.name(key) +  "\t" + hex(ord(key)) + "\n")
    except:
        o.write(key + "\t" + "Unknown" +  "\t" + unicodedata.decimal(key)  + "\n")
o.close()

o = open("kuvi.txt", mode='w', encoding='utf-8')
o.write(bible_text)
o.close()

In [29]:
dir(unicodedata)

['UCD',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'bidirectional',
 'category',
 'combining',
 'decimal',
 'decomposition',
 'digit',
 'east_asian_width',
 'lookup',
 'mirrored',
 'name',
 'normalize',
 'numeric',
 'ucd_3_2_0',
 'ucnhash_CAPI',
 'unidata_version']

In [37]:
char_inventory

{'ఆ': 12093,
 'ద': 17059,
 'ి': 143611,
 'క': 53492,
 'ా': 44654,
 'ం': 23042,
 'డ': 20905,
 ' ': 208303,
 'త': 63654,
 'ొ': 26510,
 'ల': 35759,
 '్': 98091,
 'మ': 44405,
 'ూ': 5062,
 'ు': 55790,
 'హ': 41078,
 'ప': 25043,
 'ర': 55449,
 'గ': 19566,
 'బ': 16136,
 'చ': 22493,
 'ె': 54776,
 'స': 38738,
 'ఏ': 14313,
 'న': 65731,
 'య': 31460,
 'అ': 15572,
 'వ': 29543,
 'ణ': 15190,
 'ఎ': 17284,
 'ఁ': 51861,
 'జ': 28416,
 'ీ': 20653,
 'ఉ': 771,
 'ఇ': 21027,
 'ట': 22586,
 'ో': 11893,
 'ే': 14527,
 'ఓ': 2510,
 'ఈ': 2392,
 '\u200d': 884,
 '\u200c': 103,
 'ఊ': 403,
 'ఒ': 3167,
 'శ': 1849,
 'ఙ': 571,
 'ష': 15,
 'ఐ': 241,
 'ై': 257,
 'ౌ': 376,
 'ఖ': 1,
 'భ': 1,
 '“': 2015,
 '‘': 176,
 '’': 176,
 '”': 2015,
 'ఔ': 2,
 '…': 644}

In [104]:
indic2IPA = {}
for codepoint in range(0x0c01,0xc7f):
    tel_cons = chr(codepoint)
    name = unicodedata.name(tel_cons)
    try:
        indic2IPA[name.lower().split()[-1]] = ""
    except:
        #print(name)
        pass
#print("char: {}, name: {}, ipa: {}".format(tel_cons, "_".join((name.lower().split())), "?"))

ValueError: no such name

In [83]:
tel_alphabets = {
    0x0c01: {"char":"ఁ", "name": "Telugu Sign Candrabindu","ipa": "◌̃"},
    0x0c02: {"char":"ం","name": "Telugu Sign Anusvara", "ipa":"m̃"}
                }

In [116]:
def getIPA(char):
    indic2IPA = {
        "ka": "k", "kha": "kʰ", "ga": "g", "gha": "gʰ", "nga": "ŋ"
        
    }
    return(indic2IPA.get(char, "?"))

In [102]:
indic2IPA

{'candrabindu': '',
 'anusvara': '',
 'visarga': '',
 'above': '',
 'a': '',
 'aa': '',
 'i': '',
 'ii': '',
 'u': '',
 'uu': '',
 'r': '',
 'l': ''}

In [108]:
alphabet = "\u0d15"

In [135]:
#print("char: {}, unicode name: {}, IPA: {}".format(alphabet, "_".join(unicodedata.name(alphabet).lower().split()), getIPA(unicodedata.name(alphabet).lower().split()[-1])))
print('"{}" <> {} ; {}'.format("_".join(unicodedata.name(alphabet).lower().split()), getIPA(unicodedata.name(alphabet).lower().split()[-1]), alphabet))

"malayalam_letter_ka" <> k ; ക


In [140]:
0x0c01

3073

In [145]:
"\u0c7f"

'౿'

In [188]:
for ch in range(3073, 3200):
    try:
        teckName = teckitize_me(ch)
        ipaChar = (teckName.split("_")[-1])
        print('{} <> "{}" ; {}'.format(teckName, ipaChar, chr(ch)))
    except Exception as e:
        # print("; {}, {} may be an empty slot in unicode chart".format(str(e)), hex(ch))
        pass

telugu_sign_candrabindu <> "candrabindu" ; ఁ
telugu_sign_anusvara <> "anusvara" ; ం
telugu_sign_visarga <> "visarga" ; ః
telugu_sign_combining_anusvara_above <> "above" ; ఄ
telugu_letter_a <> "a" ; అ
telugu_letter_aa <> "aa" ; ఆ
telugu_letter_i <> "i" ; ఇ
telugu_letter_ii <> "ii" ; ఈ
telugu_letter_u <> "u" ; ఉ
telugu_letter_uu <> "uu" ; ఊ
telugu_letter_vocalic_r <> "r" ; ఋ
telugu_letter_vocalic_l <> "l" ; ఌ
telugu_letter_e <> "e" ; ఎ
telugu_letter_ee <> "ee" ; ఏ
telugu_letter_ai <> "ai" ; ఐ
telugu_letter_o <> "o" ; ఒ
telugu_letter_oo <> "oo" ; ఓ
telugu_letter_au <> "au" ; ఔ
telugu_letter_ka <> "ka" ; క
telugu_letter_kha <> "kha" ; ఖ
telugu_letter_ga <> "ga" ; గ
telugu_letter_gha <> "gha" ; ఘ
telugu_letter_nga <> "nga" ; ఙ
telugu_letter_ca <> "ca" ; చ
telugu_letter_cha <> "cha" ; ఛ
telugu_letter_ja <> "ja" ; జ
telugu_letter_jha <> "jha" ; ఝ
telugu_letter_nya <> "nya" ; ఞ
telugu_letter_tta <> "tta" ; ట
telugu_letter_ttha <> "ttha" ; ఠ
telugu_letter_dda <> "dda" ; డ
telugu_letter_ddha <> 

In [170]:
unicodedata.name(chr(3199))

'TELUGU SIGN TUUMU'

In [176]:
def teckitize_me(ch):
    return "_".join(unicodedata.name(chr(ch)).lower().split())
    

In [177]:
print(teckitize_me(3379))

malayalam_letter_lla


In [193]:
header_metadata = """
; This file was created by <beniza> using TECkitMappingEditorU.exe v4.0.0.0 on 12/17/2019.
;   Conversion Type = Legacy_to_from_Unicode
;   Left-hand side font = Gautami;18
;   Right-hand side font = Charis SIL;15.75
;   Main Window Position = 0,0,658,713
;   Left-hand side Character Map Window Position = 658,0,457,447
;   Right-hand side Character Map Window Position = 658,447,457,413
"""

In [192]:
header = """
EncodingName            "Telugu2IPA"
DescriptiveName         "A Telugu to IPA converter"
Version                 "1"
Contact                 "mailto:beniza@gmail.com"
RegistrationAuthority   "New Life Computer Institute"
RegistrationName        "in.nlci.encodingconverter.telu2ipa"
Copyright               "© 2019 NLCI. CC-BY-SA."
LHSFlags                ()
RHSFlags                ()
"""

In [194]:
def printUni(ch):
    try:
        print("{}, {}".format(unicodedata.name(chr(ch)), hex(ch)))
    except Exception as e:
        printUni(ch+1)

base = 2304 # Devanagari (the first of the Indic Script blocks)
for scriptnum in range (10):
    ch = base + (scriptnum * 128) # Each Indic Unicode block is of the size 128
    printUni(ch)
    


DEVANAGARI SIGN INVERTED CANDRABINDU, 0x900
BENGALI ANJI, 0x980
GURMUKHI SIGN ADAK BINDI, 0xa01
GUJARATI SIGN CANDRABINDU, 0xa81
ORIYA SIGN CANDRABINDU, 0xb01
TAMIL SIGN ANUSVARA, 0xb82
TELUGU SIGN COMBINING CANDRABINDU ABOVE, 0xc00
KANNADA SIGN SPACING CANDRABINDU, 0xc80
MALAYALAM SIGN COMBINING ANUSVARA ABOVE, 0xd00
SINHALA SIGN ANUSVARAYA, 0xd82
