Python library for IT3 to UTF conversion for Indian languages. Initial code adopted from WX to UTF converter link.
Telugu tested fully. DLI metadata is full of errors, hence the tool based conversion to unicode is not highly useful.
Download indic-it3-converter from github.
Clone the repository: git clone https://github.com/arjunaraoc/indic-it3-converter Change to the cloned directory: cd indic-it3-converter pip install -r requirements.txt python setup.py install
(works with -f text only) it3conv --f ssf --t intra --n --l hin --s utf --i hin-utf.ssf --o hin-it3.ssf -l , --language select language (3 letter ISO-639 code) Hindi : hin Telugu : tel Tamil : tam Malayalam : mal Kannada : kan Bengali : ben Oriya : ori Punjabi : pan Marathi : mar Nepali : nep Gujarati : guj Bodo : bod Konkani : kok Assamese : asm Urdu : urd -s , --source-enc {utf, it3} select input-file encoding -f , --format {text, ssf, conll, bio, tnt} select input-file format -t , --ssf-type {inter, intra} specify ssf-type if file format (-f) is ssf -n, --nested set this flag for nested ssf -m, --no-mask set this flag to keep off masking of roman strings in Indic text -i , --input <input-file> -o , --output <output-file> -z, --normalize set this flag for utf normalizations without IT3-Conversion
- ::
- Not implemented
Example for Telugu it3 to Utf-8
-------------------------------
>>> from it3conv import IT3C
>>> it3c = IT3C(lang='tel', order='it32utf')
>>> tel_it3=u'''tirumala tirupati aan\'dhrulaku pund-ya kshheitramu'''
>>> tel_utf = it3c.convert(tel_it3)
>>> print(tel_utf)
... తిరుమల తిరుపతి ఆంధ్రులకు పుణ్య క్షేత్రము
>>>
Example for Hindi it3 to Utf-8
------------------------------
>>> from it3conv import IT3C
>>> it3c = IT3C(lang='hin', order='it32utf')
>>> hin_it3=u'''Ankhiya Nihar Ke Pag-dhuri Jhar Ke'''
>>> hin_utf = it3c.convert(hin_it3)
>>> print(hin_utf)
... अन्खिय निहर् कॆ पग्-धुरि झर् कॆ
>>>