#Indic NLP Library



The goal of the _Indic NLP Library_ is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text. 

The library provides the following functionalities: 

- Text Normalization
- Script Information
- Tokenization
- Word Segmenation
- Script Conversion
- Romanization
- Indicization
- Transliteration
- Translation

The data resources required by the Indic NLP Library are hosted in a different repository. These resources are required for some modules. You can download from the [Indic NLP Resources](https://github.com/anoopkunchukuttan/indic_nlp_resources) project.

## Pre-requisites

- Python 2.7+
- [Morfessor 2.0 Python Library](http://www.cis.hut.fi/projects/morpho/morfessor2.shtml)


In [13]:
import sys
sys.path.append('/DATA/src/python/indic_nlp_library/src')
from indicnlp import common
common.INDIC_RESOURCES_PATH="/DATA/src/python/indic_nlp_resources"

##Text Normalization

Text written in Indic scripts display a lot of quirky behaviour on account of varying input methods, multiple representations for the same character, etc. 
There is a need to canonicalize the representation of text so that NLP applications can handle the data in a consistent manner. The canonicalization primarily handles the following issues: 

    - Non-spacing characters like ZWJ/ZWNL
    - Multiple representations of Nukta based characters 
    - Multiple representations of two part dependent vowel signs
    - Typing inconsistencies: e.g. use of pipe (|) for poorna virama


In [14]:
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

input_text=u"\u0958 \u0915\u093c"
remove_nuktas=False
factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("hi",remove_nuktas)
output_text=normalizer.normalize(input_text)

print output_text
print 'Length before normalization: {}'.format(len(input_text))
print 'Length after normalization: {}'.format(len(output_text))

क़ क़
Length before normalization: 4
Length after normalization: 5


## Script Conversion

Convert from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script. The following scripts are supported:

_Devanagari(Hindi,Marathi,Sanskrit,Konkani,Sindhi,Nepali), Assamese, Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Sindhi, Tamil, Telugu, Kannada, Malayalam_

In [15]:
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
input_text=u'राजस्थान'
print UnicodeIndicTransliterator.transliterate(input_text,"hi","pa")

ਰਾਜਸ੍ਥਾਨ


## Romanization

Convert script text to Roman text in the ITRANS notation

In [16]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text=u'राजस्थान'
lang='hi'


print ItransTransliterator.to_itrans(input_text,lang)

rAjasthAna


## Indicization (ITRANS to Indic Script)

Let's call conversion of ITRANS-transliteration to an Indic script as **Indicization**!


In [17]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator


# input_text=u'rajasthAna'
input_text=u'pitL^In'
lang='hi'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
for y in x:
    print '{:x}'.format(ord(y))

पितॣन्
92a
93f
924
963
928
94d


##Query Script Information

Indic scripts have been designed keeping phonetic principles in nature and the design and organization of the scripts makes it easy to obtain phonetic information about the characters. 

In [18]:
from indicnlp.langinfo import *

c=u'क'
lang='hi'

print 'Is vowel?:  {}'.format(is_vowel(c,lang))
print 'Is consonant?:  {}'.format(is_consonant(c,lang))
print 'Is velar?:  {}'.format(is_velar(c,lang))
print 'Is palatal?:  {}'.format(is_palatal(c,lang))
print 'Is aspirated?:  {}'.format(is_aspirated(c,lang))
print 'Is unvoiced?:  {}'.format(is_unvoiced(c,lang))
print 'Is nasal?:  {}'.format(is_nasal(c,lang))

Is vowel?:  False
Is consonant?:  True
Is velar?:  True
Is palatal?:  False
Is aspirated?:  False
Is unvoiced?:  True
Is nasal?:  False


##Word Segmentation

Unsupervised morphological analysers for various Indian language. Given a word, the analyzer returns the componenent morphemes. 
The analyzer can recognize inflectional and derivational morphemes. 

The following languages are supported:

_Hindi, Punjabi, Marathi, Konkani, Gujarati, Bengali, Kannada, Tamil, Telugu, Malayalam_

Support for more languages will be added soon.

In [11]:
from indicnlp.morph import unsupervised_morph 
from indicnlp import common

analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('mr')

indic_string=u'आपल्या हिरड्यांच्या आणि दातांच्यामध्ये जीवाणू असतात .'

analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))

for w in analyzes_tokens: 
    print w

आपल्या
हिरड्या
ंच्या
आणि
दाता
ंच्या
मध्ये
जीवाणू
असतात
.


##Transliteration

In [19]:
import urllib2
from django.utils.encoding import * 
from django.utils.http import * 

text=iri_to_uri(urlquote('anoop, ratish kal fone par baat karenge'))
url=u'http://www.cfilt.iitb.ac.in/indicnlpweb/indicnlpws/transliterate_bulk/en/hi/{}/statistical'.format(text)

response=urllib2.urlopen(url).read()
print response

{"hi":["नूप , रेटिश काल फोन पार बाट करेंगे"]}


##Tokenization 

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.   


In [10]:
from indicnlp.tokenize import indic_tokenize  

indic_string=u'अनूप,अनूप?।फोन'

print u'Input String: {}'.format(indic_string)
print u'Tokens: '
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print t

Input String: अनूप,अनूप?।फोन
Tokens: 
अनूप
,
अनूप
?
।
फोन
