# Indic NLP Library

The goal of the _Indic NLP Library_ is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text. 

The library provides the following functionalities: 

- Text Normalization
- Script Conversion
- Romanization
- Indicization
- Script Information
- Phonetic Similarity
- Syllabification
- Tokenization
- Word Segmenation
- Transliteration
- Translation

The data resources required by the Indic NLP Library are hosted in a different repository. These resources are required for some modules. You can download from the [Indic NLP Resources](https://github.com/anoopkunchukuttan/indic_nlp_resources) project.

## Pre-requisites

- Python 2.7+
- [Morfessor 2.0 Python Library](http://www.cis.hut.fi/projects/morpho/morfessor2.shtml)


## Getting Started

**  ----- Set these variables ----- **

In [1]:
# The path to the local git repo for Indic NLP library
INDIC_NLP_LIB_HOME="/home/baks/3-1/Research/Tools/indic_nlp_library"

# The path to the local git repo for Indic NLP Resources
INDIC_NLP_RESOURCES="/home/baks/3-1/Research/Tools/indic_nlp_resources"

**Add Library to Python path**

In [2]:
import sys
sys.path.append('{}/src'.format(INDIC_NLP_LIB_HOME))

** Export environment variable ** 

    export INDIC_RESOURCES_PATH=<path>

     OR
     
**set it programmatically**
We will use that method for this demo

In [3]:
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)

NameError: name 'INDIC_NLP_RESOURCES' is not defined

** Initialize the Indic NLP library **

In [4]:
from indicnlp import loader
loader.load()

IndicNlpException: 'Indic Resources Path not set'

** Let's actually try out some of the API methods in the Indic NLP library **

Many of the API functions require a language code. We use 2-letter ISO 639-1 codes. Some languages do not have assigned 2-letter codes. We use the following two-letter codes for such languages: 

- Konkani: kK
- Manipuri: mP
- Bodo: bD

## Text Normalization

Text written in Indic scripts display a lot of quirky behaviour on account of varying input methods, multiple representations for the same character, etc. 
There is a need to canonicalize the representation of text so that NLP applications can handle the data in a consistent manner. The canonicalization primarily handles the following issues: 

    - Non-spacing characters like ZWJ/ZWNL
    - Multiple representations of Nukta based characters 
    - Multiple representations of two part dependent vowel signs
    - Typing inconsistencies: e.g. use of pipe (|) for poorna virama


In [5]:
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

input_text=u"\u0958 \u0915\u093c"
remove_nuktas=False
factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("hi",remove_nuktas)
output_text=normalizer.normalize(input_text)

print output_text
print 'Length before normalization: {}'.format(len(input_text))
print 'Length after normalization: {}'.format(len(output_text))

क़ क़
Length before normalization: 4
Length after normalization: 5


## Script Conversion

Convert from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script. The following scripts are supported:

_Devanagari (Hindi,Marathi,Sanskrit,Konkani,Sindhi,Nepali), Assamese, Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Sindhi, Tamil, Telugu, Kannada, Malayalam_

In [6]:
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
input_text=u'राजस्थान'
print UnicodeIndicTransliterator.transliterate(input_text,"hi","pa")

ਰਾਜਸ੍ਥਾਨ


## Romanization

Convert script text to Roman text in the ITRANS notation

In [7]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text=u'राजस्थान'
lang='hi'


print ItransTransliterator.to_itrans(input_text,lang)

rAjasthAna


## Indicization (ITRANS to Indic Script)

Let's call conversion of ITRANS-transliteration to an Indic script as **Indicization**!


In [8]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator


# input_text=u'rajasthAna'
input_text=u'pitL^In'
lang='hi'
x=ItransTransliterator.from_itrans(input_text,lang)
print x
for y in x:
    print '{:x}'.format(ord(y))

पितॣन्
92a
93f
924
963
928
94d


## Script Information

Indic scripts have been designed keeping phonetic principles in nature and the design and organization of the scripts makes it easy to obtain phonetic information about the characters. 

### Get Phonetic Feature Vector

With each script character, a phontic feature vector is associated, which encodes the phontic properties of the character. This is a bit vector which is can be obtained as shown below:  

In [5]:
from indicnlp.script import  indic_scripts as isc

c=u'क'
lang='hi'

isc.get_phonetic_feature_vector(c,lang)

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

This fields in this bit vector are (from left to right): 

In [24]:
sorted(isc.PV_PROP_RANGES.iteritems(),key=lambda x:x[1][0])

[('basic_type', [0, 6]),
 ('vowel_length', [6, 8]),
 ('vowel_strength', [8, 11]),
 ('vowel_status', [11, 13]),
 ('consonant_type', [13, 18]),
 ('articulation_place', [18, 23]),
 ('aspiration', [23, 25]),
 ('voicing', [25, 27]),
 ('nasalization', [27, 29]),
 ('vowel_horizontal', [29, 32]),
 ('vowel_vertical', [32, 36]),
 ('vowel_roundness', [36, 38])]

You can check the phonetic information database files in Indic NLP resources to know the definition of each of the bits. 

- _For Tamil Script_: [database](https://github.com/anoopkunchukuttan/indic_nlp_resources/blob/master/script/tamil_script_phonetic_data.csv) 
- _For other Indic Scripts_: [database](https://github.com/anoopkunchukuttan/indic_nlp_resources/blob/master/script/all_script_phonetic_data.csv)

### Query Phonetic Properties

**Note:** _This interface below will be deprecated soon and a new interface will be available soon._

In [12]:
from indicnlp.langinfo import *

c=u'क'
lang='hi'

print 'Is vowel?:  {}'.format(is_vowel(c,lang))
print 'Is consonant?:  {}'.format(is_consonant(c,lang))
print 'Is velar?:  {}'.format(is_velar(c,lang))
print 'Is palatal?:  {}'.format(is_palatal(c,lang))
print 'Is aspirated?:  {}'.format(is_aspirated(c,lang))
print 'Is unvoiced?:  {}'.format(is_unvoiced(c,lang))
print 'Is nasal?:  {}'.format(is_nasal(c,lang))

Is vowel?:  False
Is consonant?:  True
Is velar?:  True
Is palatal?:  False
Is aspirated?:  False
Is unvoiced?:  True
Is nasal?:  False


### Get Phonetic Similarity

Using the phonetic feature vectors, we can define phonetic similarity between the characters (and underlying phonemes). The library implements some  measures for phonetic similarity between the characters (and underlying phonemes). These can be defined using the phonetic feature vectors discussed earlier, so users can implement additional similarity measures. 

The implemented similarity measures are: 

- cosine
- dice
- jaccard
- dot_product
- sim1 (Kunchukuttan _et al._, 2016)
- softmax

** References **

Anoop Kunchukuttan, Pushpak Bhattacharyya, Mitesh Khapra. _Substring-based unsupervised transliteration with phonetic and contextual knowledge_. SIGNLL Conference on Computational Natural Language Learning ** (CoNLL 2016) **. 2016.

In [14]:
from indicnlp.script import  indic_scripts as isc
from indicnlp.script import  phonetic_sim as psim

c1=u'क'
c2=u'ख'
c3=u'भ'
lang='hi'

print u'Similarity between {} and {}'.format(c1,c2)
print psim.cosine(
    isc.get_phonetic_feature_vector(c1,lang),
    isc.get_phonetic_feature_vector(c2,lang)
    )

print

print u'Similarity between {} and {}'.format(c1,c3)
print psim.cosine(
    isc.get_phonetic_feature_vector(c1,lang),
    isc.get_phonetic_feature_vector(c3,lang)
    )


Similarity between क and ख
0.833331944447

Similarity between क and भ
0.499999166668


_You may have figured out that you can also compute similarities of characters belonging to different scripts._

You can also get a similarity matrix which contains the similarities between all pairs of characters (within the same script or across scripts).

Let's see how we can compare the characters across Devanagari and Malayalam scripts

In [19]:
from indicnlp.script import  indic_scripts as isc
from indicnlp.script import  phonetic_sim as psim


slang='hi'
tlang='ml'
sim_mat=psim.create_similarity_matrix(psim.cosine,slang,tlang,normalize=False)

c1=u'क'
c2=u'ഖ'
print u'Similarity between {} and {}'.format(c1,c2)
print sim_mat[isc.get_offset(c1,slang),isc.get_offset(c2,tlang)]

Similarity between क and ഖ
0.833331944447


Some similarity functions like `sim` do not generate values in the range [0,1] and it may be more convenient to have the similarity values in the range [0,1]. This can be achieved by setting the `normalize` paramter to `True`

In [21]:
slang='hi'
tlang='ml'
sim_mat=psim.create_similarity_matrix(psim.sim1,slang,tlang,normalize=True)

c1=u'क'
c2=u'ഖ'
print u'Similarity between {} and {}'.format(c1,c2)
print sim_mat[isc.get_offset(c1,slang),isc.get_offset(c2,tlang)]

Similarity between क and ഖ
0.0686089400193


## Orthographic Syllabification

_Orthographic Syllabification_ is an approximate syllabification process for Indic scripts, where CV+ units are defined to be _orthographic syllables_.

See the following paper for details:

Anoop Kunchukuttan, Pushpak Bhattacharyya. [_Orthographic Syllable as basic unit for SMT between Related Languages_](https://arxiv.org/abs/1610.00634). Conference on Empirical Methods in Natural Language Processing **(EMNLP 2016)**. 2016. 

In [23]:
from indicnlp.syllable import  syllabifier

w=u'जगदीशचंद्र'
lang='hi'

print u' '.join(syllabifier.orthographic_syllabify(w,lang))

ज ग दी श च ंद्र


## Tokenization 

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.   


In [24]:
from indicnlp.tokenize import indic_tokenize  

indic_string=u'अनूप,अनूप?।फोन'

print u'Input String: {}'.format(indic_string)
print u'Tokens: '
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print t

Input String: अनूप,अनूप?।फोन
Tokens: 
अनूप
,
अनूप
?
।
फोन


## Word Segmentation

Unsupervised morphological analysers for various Indian language. Given a word, the analyzer returns the componenent morphemes. 
The analyzer can recognize inflectional and derivational morphemes. 

The following languages are supported:

_Hindi, Punjabi, Marathi, Konkani, Gujarati, Bengali, Kannada, Tamil, Telugu, Malayalam_

Support for more languages will be added soon.

In [9]:
from indicnlp.morph import unsupervised_morph 
from indicnlp import common

analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('mr')

In [10]:
indic_string=u'आपल्या हिरड्यांच्या आणि दातांच्यामध्ये जीवाणू असतात .'

analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))

for w in analyzes_tokens: 
    print w

आपल्या
हिरड्या
ंच्या
आणि
दाता
ंच्या
मध्ये
जीवाणू
असतात
.


## Transliteration

We use the [_BrahmiNet_](http://www.cfilt.iitb.ac.in/brahminet/static/rest.html) REST API for transliteration.

In [None]:
import urllib2
from django.utils.encoding import * 
from django.utils.http import * 

text=iri_to_uri(urlquote('anoop, ratish kal fone par baat karenge'))
url=u'http://www.cfilt.iitb.ac.in/indicnlpweb/indicnlpws/transliterate_bulk/en/hi/{}/statistical'.format(text)

response=urllib2.urlopen(url).read()
print response

You can also use _BrahmiNet_ through [this](http://www.cfilt.iitb.ac.in/brahminet) web interface.  

You can read more about _BrahmiNet_ [here](http://www.cfilt.iitb.ac.in/brahminet/static/publications/brahminet_naacl2015.pdf)