<a href="https://colab.research.google.com/github/fatimazain118/Natural-language-processing/blob/main/Lab01/Indic_NLP_Library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Indic NLP Library

Now we will have some hands-on experince using Indic NLP Library. While the libraries discussed in the previous notebook works for English, Indian languages sometimes requires additional handling like tokenization, sentence-splitting. etc. 

More details can be found here https://github.com/anoopkunchukuttan/indic_nlp_library

## Set-up

### Download IndicNLP Library resources from here

In [5]:
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

fatal: destination path 'indic_nlp_resources' already exists and is not an empty directory.


Set-up the path to Indic NLP Resources folder

## Initialize the Indic NLP library

In [6]:
INDIC_NLP_RESOURCES=r"indic_nlp_resources/"

In [8]:
pip install indic-nlp-library

Collecting indic-nlp-library
  Downloading indic_nlp_library-0.81-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 3.0 MB/s 
[?25hCollecting sphinx-argparse
  Downloading sphinx_argparse-0.3.1-py2.py3-none-any.whl (12 kB)
Collecting morfessor
  Downloading Morfessor-2.0.6-py3-none-any.whl (35 kB)
Collecting sphinx-rtd-theme
  Downloading sphinx_rtd_theme-1.0.0-py2.py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 9.7 MB/s 
Installing collected packages: sphinx-rtd-theme, sphinx-argparse, morfessor, indic-nlp-library
Successfully installed indic-nlp-library-0.81 morfessor-2.0.6 sphinx-argparse-0.3.1 sphinx-rtd-theme-1.0.0


In [9]:
import sys
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)

In [10]:
from indicnlp import loader
loader.load()

We will now try out some of the APIs provided by the library

**NOTE:** Many APIs require us to provide the language we are working with. Language information is provided as a 2-letter ISO 639-1 codes. More details and the exact 2-letter code used can be found here https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Some languages do not have assigned 2-letter codes. The library uses the following two-letter codes for such languages:

 - Konkani : kK
 - Manipuri : mP
 - Bodo : bD

## Text Normalization

Standardize the text written in Indic scripts. Some of the issues handled are
 - Non-spacing characters
 - Multiple representations of Nukta based characters
 - Multiple representations of two part dependent vowel signs
 - Typing inconsistencies: e.g. use of pipe (|) for poorna virama

In [11]:
def normalize_text(input_text, normalizer):
    output_text=normalizer.normalize(input_text)

    print(input_text)
    print()

    print('Before normalization')
    print(' '.join([ hex(ord(c)) for c in input_text ] ))
    print('Length: {}'.format(len(input_text)))
    print()    
    print('After normalization')
    print(' '.join([ hex(ord(c)) for c in output_text ] ))
    print('Length: {}'.format(len(output_text)))    

In [12]:
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory

factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("hi")
# Reference for Hindi https://unicode.org/charts/PDF/U0900.pdf

input_text="\u0958 \u0915\u093c"

normalize_text( input_text, normalizer )

input_text="\u0959 \u0916\u093c"

normalize_text( input_text, normalizer )

क़ क़

Before normalization
0x958 0x20 0x915 0x93c
Length: 4

After normalization
0x915 0x93c 0x20 0x915 0x93c
Length: 5
ख़ ख़

Before normalization
0x959 0x20 0x916 0x93c
Length: 4

After normalization
0x916 0x93c 0x20 0x916 0x93c
Length: 5


#### To-Do: Find more instances of such issues in Hindi or your native language and print them here

In [13]:
input_text="\u095e \u092b\u093c"
normalize_text( input_text, normalizer )

फ़ फ़

Before normalization
0x95e 0x20 0x92b 0x93c
Length: 4

After normalization
0x92b 0x93c 0x20 0x92b 0x93c
Length: 5


In [14]:
input_text="\u095d \u0922\u093c"
normalize_text( input_text, normalizer )

ढ़ ढ़

Before normalization
0x95d 0x20 0x922 0x93c
Length: 4

After normalization
0x922 0x93c 0x20 0x922 0x93c
Length: 5


In [15]:
input_text="\u095b \u091c\u093c"
normalize_text( input_text, normalizer )

ज़ ज़

Before normalization
0x95b 0x20 0x91c 0x93c
Length: 4

After normalization
0x91c 0x93c 0x20 0x91c 0x93c
Length: 5


### Sentence Splitter

Rule-based system to split text into sentences

In [16]:
from indicnlp.tokenize import sentence_tokenize

indic_string="बहुत समय पहले की बात है. एक घने जंगल में एक तोता अपने दो बच्चों के साथ रहता है. उनका जीवन ख़ुशी-ख़ुशी बीत रहा था."
sentences=sentence_tokenize.sentence_split(indic_string, lang='hi')
for t in sentences:
    print(t)


बहुत समय पहले की बात है.
एक घने जंगल में एक तोता अपने दो बच्चों के साथ रहता है.
उनका जीवन ख़ुशी-ख़ुशी बीत रहा था.


#### To-Do: Experiment with sentence-splitting for your native language. Find instances where the model could fail

In [18]:
from indicnlp.tokenize import sentence_tokenize

indic_string="एक गाँव में एक दयालु दानव रहता था. वह हमेशा लोगों की मदद किया करता था. एक दिन ऐसा हुआ कि दो ख़तरनाक भालुओं का जोड़ा उस गाँव में आ गया."
sentences=sentence_tokenize.sentence_split(indic_string, lang='hi')
for t in sentences:
    print(t)

एक गाँव में एक दयालु दानव रहता था.
वह हमेशा लोगों की मदद किया करता था.
एक दिन ऐसा हुआ कि दो ख़तरनाक भालुओं का जोड़ा उस गाँव में आ गया.


### Tokenization

Tokenize based on punctuation boundary

In [19]:
from indicnlp.tokenize import indic_tokenize  

indic_string='उनका जीवन ख़ुशी-ख़ुशी बीत रहा था.'

print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print(t)

Input String: उनका जीवन ख़ुशी-ख़ुशी बीत रहा था.
Tokens: 
उनका
जीवन
ख़ुशी
-
ख़ुशी
बीत
रहा
था
.


#### Detokenization

It is natural to not add white-space between words and punctuations. The detokenizer handles that part

In [20]:
from indicnlp.tokenize import indic_detokenize  
indic_string = ' '.join( indic_tokenize.trivial_tokenize(indic_string) )

print('Input String: {}'.format(indic_string))
print('Detokenized String: {}'.format(indic_detokenize.trivial_detokenize(indic_string,lang='hi')))


Input String: उनका जीवन ख़ुशी - ख़ुशी बीत रहा था .
Detokenized String: उनका जीवन ख़ुशी-ख़ुशी बीत रहा था.


#### To-Do: Experiment with tokenization for your native language.

In [21]:
indic_string='तब से वे एक-दूसरे से मिलने की कोशिश करते हैं'

print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string): 
    print(t)

Input String: तब से वे एक-दूसरे से मिलने की कोशिश करते हैं
Tokens: 
तब
से
वे
एक
-
दूसरे
से
मिलने
की
कोशिश
करते
हैं


In [22]:
from indicnlp.tokenize import indic_detokenize  
indic_string = ' '.join( indic_tokenize.trivial_tokenize(indic_string) )

print('Input String: {}'.format(indic_string))
print('Detokenized String: {}'.format(indic_detokenize.trivial_detokenize(indic_string,lang='hi')))

Input String: तब से वे एक - दूसरे से मिलने की कोशिश करते हैं
Detokenized String: तब से वे एक-दूसरे से मिलने की कोशिश करते हैं


### Script Conversion

Convert from one Indic script to another using a rule-based system

The following scripts are supported:

Devanagari (Hindi,Marathi,Sanskrit,Konkani,Sindhi,Nepali), Assamese, Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Sindhi, Tamil, Telugu, Kannada, Malayalam

In [23]:
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
input_text = 'राजस्थान'
print(UnicodeIndicTransliterator.transliterate(input_text,"hi","kn"))

ರಾಜಸ್ಥಾನ


#### To-Do: Experiment with Script conversion between various language pairs

In [30]:
input_text = 'बरेली'
print(UnicodeIndicTransliterator.transliterate(input_text,"hi","ml"))  #Malayalam
print(UnicodeIndicTransliterator.transliterate(input_text,"hi","or"))  #Oriya
print(UnicodeIndicTransliterator.transliterate(input_text,"hi","bn"))  #Bengali

ബരേലീ
ବରେଲୀ
বরেলী


### Romanization

Convert script text to Roman text in the ITRANS notation

In [31]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text = 'ರುದ್ರ ಮೂರ್ತಿ'
lang='kn'

itrans_text = ItransTransliterator.to_itrans(input_text,lang)

print(itrans_text)


rudra muurti


#### To-Do: Experiment with Romanization between your native language and English

In [33]:
input_text = 'ज़ैनब फातिमा'
lang='hi'

itrans_text = ItransTransliterator.to_itrans(input_text,lang)

print(itrans_text)

jaainaba phaatimaa


### Indicization (ITRANS to Indic Script)

Convert script text in ITRANS notation to Indic script

In [34]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

lang='kn'

x = ItransTransliterator.from_itrans(itrans_text,lang)
print(x)
for y in x:
    print('{:x}'.format(ord(y)))


ಜಾಇನಬ ಫಾತಿಮಾ
c9c
cbe
c87
ca8
cac
20
cab
cbe
ca4
cbf
cae
cbe


### Word Segmentation

Perform unsupervised word segmentation using Morfessor

The following languages are supported:

Hindi, Punjabi, Marathi, Konkani, Gujarati, Bengali, Kannada, Tamil, Telugu, Malayalam

In [35]:
from indicnlp.morph import unsupervised_morph 
from indicnlp import common

analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('kn')

In [36]:
indic_string='ರೈತನೊಬ್ಬನ ತೋಟದಲ್ಲಿದ್ದ ಸೇಬು ಮರದಲ್ಲಿಒಂದೇ ಒಂದು ಹಣ್ಣು ಬೆಳೆಯಿತು .'

analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))

for w in analyzes_tokens: 
    print(w)

ರೈತ
ನೊಬ್ಬನ
ತೋಟ
ದಲ್ಲಿದ್ದ
ಸೇಬು
ಮರ
ದಲ್ಲಿ
ಒಂದೇ
ಒಂದು
ಹಣ್ಣು
ಬೆಳೆಯ
ಿತು
.


#### To-Do: Experiment with Word Segmentation for your native language

In [37]:
analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('hi')

indic_string = "एक गाँव में एक दयालु दानव रहता था. वह हमेशा लोगों की मदद किया करता था. एक दिन ऐसा हुआ कि दो ख़तरनाक भालुओं का जोड़ा उस गाँव में आ गया."

analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))

for w in analyzes_tokens: 
    print(w)

एक
गाँव
में
एक
दयालु
दानव
रहता
था.
वह
हमेशा
लोगों
की
मदद
किया
कर
ता
था.
एक
दिन
ऐसा
हुआ
कि
दो
ख़तरनाक
भालु
ओं
का
जोड़ा
उस
गाँव
में
आ
गया.
