# Marathi with CLTK

Analyse marathi texts using CLTK! <br>
Firstly, we need to add the path where our corpora will reside.

In [1]:
import os
USER_PATH = os.path.expanduser('~')

Before we begin analysing the texts, we will need to download the marathi corpora, for which, we will be using an Importer. Call the importer to download marathi texts, as follows.. 

In [2]:
from cltk.corpus.utils.importer import CorpusImporter
marathi_corpus_importer = CorpusImporter('marathi')

You can view which corpora to download by calling list_corpora() method

In [3]:
marathi_corpus_importer.list_corpora

['marathi_text_wikisource']

In [4]:
marathi_corpus_importer.import_corpus('marathi_text_wikisource');

It can be verified that the `marathi_text_wikisource` corpus is downloaded in a `cltk_data/marathi/text` folder which at the path given by `USER_PATH`. It is now possible to analyse the texts within. See what datasets are available as shown:

In [5]:
marathi_corpus_path = os.path.join(USER_PATH,'cltk_data/marathi/text/marathi_text_wikisource/datasets')
print(os.listdir(marathi_corpus_path))

['dnyaneshwari', 'haripath']


For this tutorial, let us analyse the texts by poet Dnyaneshwari, which is at the path as shown.

In [6]:
marathi_text_path = os.path.join(marathi_corpus_path,'dnyaneshwari')
marathi_chapters = []
for filename in os.listdir(marathi_text_path):
    with open(os.path.join(marathi_text_path,filename),'r') as file:
        chapter_text = file.read()
        marathi_chapters.append(chapter_text)

Let us see take the first 1005 characters of the first chapter for the analysis..

In [7]:
marathi_text = marathi_chapters[0]
print(marathi_text[:1005])

 ॥ ॐ श्री परमात्मने नमः ॥ ॥ अथ श्रीमद्भगवद्गीता ॥ । अश्टादशोऽध्यायः - अध्याय अठरावा । । । मोक्षसंज्ञासयोगः ।
जयजय देव निर्मळ । निजजनाखिलमंगळ । जन्मजराजलदजाळ । प्रभंजन ॥ १ ॥ जयजय देव प्रबळ । विदळितामंगळकुळ । निगमागमद्रुमफळ । फलप्रद ॥ २ ॥ जयजय देव सकल । विगतविषयवत्सल । कलितकाळकौतूहल । कलातीत ॥ ३ ॥ जयजय देव निश्चळ । चलितचित्तपानतुंदिल । जगदुन्मीलनाविरल । केलिप्रिय ॥ ४ ॥ जयजय देव निष्कळ । स्फुरदमंदानंदबहळ । नित्यनिरस्ताखिलमळ । मूळभूत ॥ ५ ॥ जयजय देव स्वप्रभ । जगदंबुदगर्भनभ । भुवनोद्भवारंभस्तंभ । भवध्वंस ॥ ६ ॥ जयजय देव विशुद्ध । विदुदयोद्यानद्विरद । शमदम\-मदनमदभेद । दयार्णव ॥ ७ ॥ जयजय देवैकरूप । अतिकृतकंदर्पसर्पदर्प । भक्तभावभुवनदीप । तापापह ॥ ८ ॥ जयजय देव अद्वितीय । परीणतोपरमैकप्रिय । निजजनजित भजनीय । मायागम्य ॥ ९ ॥ जयजय देव श्रीगुरो । अकल्पनाख्यकल्पतरो । स्वसंविद्रुमबीजप्ररो । हणावनी ॥ १० ॥ हे काय एकैक ऐसैसें । नानापरीभाषावशें । स्तोत्र करूं तुजोद्देशें । निर्विशेषा ॥ ११ ॥ जिहींं विशेषणीं विशेषिजे । तें दृश्य नव्हे रूप तुझें । हें जाणें मी म्हणौनि लाजें । वानणा इहीं ॥ १२ ॥ परी मर्यादेचा सा

## Tokenizing Sentences

Let us tokenize the sentences in `marathi_text`.

In [8]:
from cltk.tokenize.sentence import TokenizeSentence
marathi_tokenizer = TokenizeSentence('marathi')
marathi_tokens = marathi_tokenizer.tokenize(marathi_text)
print(marathi_tokens[:50])

['॥', 'ॐ', 'श्री', 'परमात्मने', 'नमः', '॥', '॥', 'अथ', 'श्रीमद्भगवद्गीता', '॥', '।', 'अश्टादशोऽध्यायः', '-', 'अध्याय', 'अठरावा', '।', '।', '।', 'मोक्षसंज्ञासयोगः', '।', '\nजयजय', 'देव', 'निर्मळ', '।', 'निजजनाखिलमंगळ', '।', 'जन्मजराजलदजाळ', '।', 'प्रभंजन', '॥', '१', '॥', 'जयजय', 'देव', 'प्रबळ', '।', 'विदळितामंगळकुळ', '।', 'निगमागमद्रुमफळ', '।', 'फलप्रद', '॥', '२', '॥', 'जयजय', 'देव', 'सकल', '।', 'विगतविषयवत्सल', '।']


## Stopword filtering

Stopwords list for marathi can be found at `stop` module of cltk.

In [9]:
from cltk.stop.marathi.stops import STOP_LIST
print (STOP_LIST[:10])

['न', 'तरी', 'तो', 'हें', 'तें', 'कां', 'आणि', 'जें', 'जे', 'मग']


Let us filter the `marathi_tokens` list for words that are not stop words.

In [10]:
marathi_tokens_no_stop  = [token for token in marathi_tokens if token not in STOP_LIST]
print(marathi_tokens_no_stop[:5])

['॥', 'ॐ', 'श्री', 'परमात्मने', 'नमः']


In [11]:
print(len(marathi_tokens))
print(len(marathi_tokens_no_stop))

33475
27835


As one can see, `marathi_tokens` had 33475 tokens whereas `marathi_tokens_no_stop` has 27835

## Transliterations

We can transliterate marathi scripts to that of other Indic languages. Let us transliterate ` शब्दकोश `to Gujarati:

In [12]:
from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator
UnicodeIndicTransliterator.transliterate(' शब्दकोश ',"mr","gu")

' શબ્દકોશ '

We can also romanize the text as shown:

In [13]:
marathi_text_two = 'तुलसी औषधी वनस्पती आहे'
from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator
ItransTransliterator.to_itrans(marathi_text_two,'mr')

'tulasii aushhadhii wanaspatii aahe'

Similarly, we can indicize a text given in its ITRANS-transliteration

In [14]:
marathi_text_itrans = 'aapaNa dararoja eka pustaka waachale paahije.'
ItransTransliterator.from_itrans(marathi_text_itrans,'mr')

'आपण दररोज एक पुस्तक वाचले पाहिजे।'

## Syllabifier

We can use the `indian_syllabifier` to syllabify the Marathi sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time.

In [15]:
phonetics_model_importer = CorpusImporter('sanskrit')
phonetics_model_importer.list_corpora
phonetics_model_importer.import_corpus('sanskrit_models_cltk') 

Now we import the syllabifier and syllabify as follows:

In [16]:
%%capture
from cltk.stem.sanskrit.indian_syllabifier import Syllabifier
marathi_syllabifier = Syllabifier('marathi')
marathi_syllables = marathi_syllabifier.orthographic_syllabify('इतिहास')

The syllables of the word इतिहास will thus be:

In [17]:
print(marathi_syllables)

['इ', 'ति', 'हा', 'स']


## Marathi Alphabets

There are 13 vowels in Marathi, which can be printed out as follows:

In [18]:
from cltk.corpus.marathi.alphabet import *
print("Vowels: ", VOWELS)
print("IAST Representation of vowels: ",IAST_REPRESENTATION_VOWELS)

Vowels:  ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'अॅ', 'ऑ']
IAST Representation of vowels:  ['a', 'ā', 'i', 'ī', 'u', 'ū', 'ṛ', 'e', 'ai', 'o', 'au', 'ae', 'ao']


Also, there are 25 consonnants divided into 5 groups or <i>vargas</i> :

In [19]:
print("Velar consonants:",VELAR_CONSONANTS)
print("IAST Representation of Velar consonants:",IAST_VELAR_CONSONANTS)
print("\nPalatal consonants:",PALATAL_CONSONANTS)
print("IAST Representation of Palatal consonants:",IAST_PALATAL_CONSONANTS)
print("\nRetroflex consonants:",RETROFLEX_CONSONANTS)
print("IAST Representation of Retroflex consonants:",IAST_RETROFLEX_CONSONANTS)
print("\nDental consonants:",DENTAL_CONSONANTS)
print("IAST Representation of Dental consonants:",IAST_DENTAL_CONSONANTS)
print("\nLabial consonants:",LABIAL_CONSONANTS)
print("IAST Representation of Labial consonants:",IAST_LABIAL_CONSONANTS)

Velar consonants: ['क', 'ख', 'ग', 'घ', 'ङ']
IAST Representation of Velar consonants: ['k', 'kh', 'g', 'gh', 'ṅ']

Palatal consonants: ['च', 'छ', 'ज', 'झ', 'ञ']
IAST Representation of Palatal consonants: ['c', 'ch', 'j', 'jh', 'ñ']

Retroflex consonants: ['ट', 'ठ', 'ड', 'ढ', 'ण']
IAST Representation of Retroflex consonants: ['ṭ', 'ṭh', 'ḍ', 'ḍh', 'ṇ']

Dental consonants: ['त', 'थ', 'द', 'ध', 'न']
IAST Representation of Dental consonants: ['t', 'th', 'd', 'dh', 'n']

Labial consonants: ['प', 'फ', 'ब', 'भ', 'म']
IAST Representation of Labial consonants: ['p', 'ph', 'b', 'bh', 'm']


There are 4 semi-vowels, 3 sibilants, 1 fricative and 3 additional consonants in Marathi.

In [20]:
print("Semi-vowels: ",SEMI_VOWELS)
print("IAST Representation of Semi-vowels: ",IAST_SEMI_VOWELS)

print("\nSibilants",SIBILANTS)
print("IAST Representation of Sibilants",IAST_SIBILANTS)

print("\nFricative consonants:",FRIACTIVE_CONSONANTS)
print("IAST Representation of Fricative consonants:",IAST_FRIACTIVE_CONSONANTS)

print("\nAdditional consonants:",ADDITIONAL_CONSONANTS)
print("IAST Representation of Additional consonants:",IAST_ADDITIONAL_CONSONANTS)

Semi-vowels:  ['य', 'र', 'ल', 'व']
IAST Representation of Semi-vowels:  ['y', 'r', 'l', 'w']

Sibilants ['श', 'ष', 'स']
IAST Representation of Sibilants ['ś', 'ṣ', 's']

Fricative consonants: ['ह']
IAST Representation of Fricative consonants: ['h']

Additional consonants: ['ळ', 'क्ष', 'ज्ञ']
IAST Representation of Additional consonants: ['La', 'kSha', 'dnya']


Following are the digits in the Marathi Script:

In [21]:
print("Digits:",DIGITS)

Digits: ['०', '१', '२', '३', '४', '५', '६', '७', '८', '९']
