# Sanskrit with CLTK

Sanskrit is one of the oldest Indo-Eurpoean languages and a literary language of ancient and medieval South Asia that also served as a lingua franca. It is a standardised dialect of Old Indo-Aryan, originating as Vedic Sanskrit and tracing its linguistic ancestry back to Proto-Indo-Iranian and Proto-Indo-European.<br>You can use CLTK to analyse Sanskrit texts!<br><br>
Firstly, we need to add the path where our corpora will reside.

In [1]:
import os
USER_PATH = os.path.expanduser('~')

Before we begin analysing the texts, we will need to download the Sanskrit corpora, for which, we will be using an Importer. Call the importer to download Sanskrit texts, as follows.. 

In [2]:
from cltk.corpus.utils.importer import CorpusImporter
sanskrit_corpus_importer = CorpusImporter('sanskrit')

You can view which corpora to download by calling list_corpora() method

In [3]:
sanskrit_corpus_importer.list_corpora

['sanskrit_text_jnu',
 'sanskrit_text_dcs',
 'sanskrit_parallel_sacred_texts',
 'sanskrit_text_sacred_texts',
 'sanskrit_parallel_gitasupersite',
 'sanskrit_text_gitasupersite',
 'sanskrit_text_wikipedia',
 'sanskrit_text_sanskrit_documents',
 'sanskrit_models_cltk']

In [4]:
sanskrit_corpus_importer.import_corpus('sanskrit_text_dcs');

It can be verified that the `sanskrit_text_dcs` corpus is downloaded in a `cltk_data/sanskrit/text` folder which at the path given by `USER_PATH`. It is now possible to analyse the texts within. For this tutorial, let us analyse <i>Ayurvedarasāyana</i>

In [5]:
sanskrit_corpus_path = os.path.join(USER_PATH,'cltk_data/sanskrit/text/sanskrit_text_dcs/corpora')
sanskrit_text_path = os.path.join(sanskrit_corpus_path,'Ayurvedarasāyana.txt')
sanskrit_text = open(sanskrit_text_path,'r').read()

Let us see the contents of `sanskrit_text`

In [6]:
print(sanskrit_text)

tatra dravyasya prādhānyam āha dravyameveti // (1.0)
	rasādibhyo dravyameva pradhānam // (2.0)
	hi yasmāt te rasādayaḥ tadāśrayāḥ dravyādhiṣṭhānāḥ // (3.0)
	etenaiva rasādyāśrayo dravyamityuktaṃ bhavati // (4.0)
	prādhānyakathanaṃ dravyajñāne tātparyārtham // (5.0)
	uktaṃ ca saṃgrahe / (6.1)
	yasmād dṛṣṭo yavaḥ svādur gururapyanilapradaḥ / (6.2)
	dīpanaṃ śītamapyājyaṃ vasoṣṇāpy agnisādinī // (6.3)
	kaṭupāko'pi pittaghno mudgo māṣastu pittalaḥ / (7.1)
	svādupāko 'pi cayakṛtsnigdhoṣṇaṃ guru phāṇitam // (7.2)
	rase svādau yathā caitat tathānyeṣvapi dṛśyate / (8.1)
	vātalaṃ kaphapittaghnamamlam apyākṣakīphalam // (8.2)
	kurute dadhi gurveva vahniṃ pārevataṃ na tu / (9.1)
	kapitthaṃ dāḍimaṃ cāmlaṃ grāhi nāmalakīphalam // (9.2)
	kaṣāyā grāhiṇī śītā dhātakī na harītakī / (10.1)
	apradhānāḥ pṛthak tasmādrasādyāḥ saṃśritāstu te // (10.2)
	prabhāvaśca yato dravye dravyaṃ śreṣṭham ato matam / (11.1)
	iti // (11.2)
	


## Transliterations

We can transliterate sanskrit scripts to that of other Indic languages. Let us transliterate `ज्वालक `to Malayalam:

In [7]:
from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator
UnicodeIndicTransliterator.transliterate(' ज्वालक ',"sa","ml")

' ജ്വാലക '

We can also romanize the text as shown:

In [8]:
sanskrit_text_two = 'अगद्यति'
from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator
ItransTransliterator.to_itrans(sanskrit_text_two,'sa')

'agadyati'

Similarly, we can indicize a text given in its ITRANS-transliteration

In [9]:
sanskrit_text_itrans = 'udghaaTayati'
ItransTransliterator.from_itrans(sanskrit_text_itrans,'sa')

'उद्घाटयति'

## Query Script Information

Phonetic information about the characters can be obtained as follows:

In [10]:
from cltk.corpus.sanskrit.itrans.langinfo import *
char = 'ट'
lang = 'sa'

In [11]:
is_vowel(char,lang)

False

In [12]:
is_consonant(char,lang)

True

In [13]:
is_velar(char,lang)

False

In [14]:
is_palatal(char,lang)

False

In [15]:
is_aspirated(char,lang)

False

In [16]:
is_unvoiced(char,lang)

True

In [17]:
is_nasal(char,lang)

False

In [18]:
is_retroflex(char,lang)

True

There are many more features in `langinfo`. Do check them out!

## Syllabifier

We can use the `indian_syllabifier` to syllabify the sanskrit sentences. To do this, we will have to import models as follows. The importing of `sanskrit_models_cltk` might take some time.

In [19]:
phonetics_model_importer = CorpusImporter('sanskrit')
phonetics_model_importer.list_corpora
phonetics_model_importer.import_corpus('sanskrit_models_cltk') 

Now we import the syllabifier and syllabify as follows:

In [20]:
%%capture
from cltk.stem.sanskrit.indian_syllabifier import Syllabifier
sanskrit_syllabifier = Syllabifier('sanskrit')
sanskrit_syllables = sanskrit_syllabifier.orthographic_syllabify('पुस्तकालय')

The syllables of the word पुस्तकालय will thus be:

In [21]:
print(sanskrit_syllables)

['पु', 'स्त', 'का', 'ल', 'य']


## Tokenizing Sentences

Let us tokenize the sentences in `sanskrit_text`.

In [22]:
sanskrit_text_example = 'वार्त्तापत्रम् अर्वाचीनकाले मानवस्य जीवनमस्ति । वार्त्तापत्रम् अस्माकं ज्ञानस्य गंङगा इव।यथा प्रभाते करदर्शनम् आवश्यकं सञ्जताम्।'
from cltk.tokenize.sentence import TokenizeSentence
sanskrit_tokenizer = TokenizeSentence('sanskrit')
sanskrit_tokens = sanskrit_tokenizer.tokenize(sanskrit_text_example)
print(sanskrit_tokens)

['वार्त्तापत्रम्', 'अर्वाचीनकाले', 'मानवस्य', 'जीवनमस्ति', '।', 'वार्त्तापत्रम्', 'अस्माकं', 'ज्ञानस्य', 'गंङगा', 'इव', '।', 'यथा', 'प्रभाते', 'करदर्शनम्', 'आवश्यकं', 'सञ्जताम्', '।']


## Stopword filtering

Stopwords list for Sanskrit can be found at `stop` module of cltk.

In [23]:
from cltk.stop.sanskrit.stops import STOPS_LIST
print (STOPS_LIST[:10])

['अहम्', 'आवाम्', 'वयम्', 'माम्', 'मा', 'आवाम्', 'अस्मान्', 'नः', 'मया', 'आवाभ्याम्']


Let us filter the `sanskrit_tokens` list for words that are not stop words.

In [24]:
sanskrit_tokens_no_stop  = [token for token in sanskrit_tokens if token not in STOPS_LIST]
print(sanskrit_tokens_no_stop)

['वार्त्तापत्रम्', 'अर्वाचीनकाले', 'मानवस्य', 'जीवनमस्ति', '।', 'वार्त्तापत्रम्', 'अस्माकं', 'ज्ञानस्य', 'गंङगा', '।', 'प्रभाते', 'करदर्शनम्', 'आवश्यकं', 'सञ्जताम्', '।']


In [25]:
print(len(sanskrit_tokens))
print(len(sanskrit_tokens_no_stop))

17
15


As one can see, `sanskrit_tokens` had 17 tokens whereas `sanskrit_tokens_no_stop` has 15

## Swadesh list for Sanskrit

Swadesh list for Sanskrit can be obtained as follows:

In [26]:
from cltk.corpus.swadesh import Swadesh
swadesh_list = Swadesh('sa')
print(swadesh_list.words())

['अहम्', 'त्वम्', 'स', 'वयम्, नस्', 'यूयम्, वस्', 'ते', 'इदम्', 'तत्', 'अत्र', 'तत्र', 'क', 'किम्', 'कुत्र', 'कदा', 'कथम्', 'न', 'सर्व', 'बहु', 'किञ्चिद्', 'अल्प', 'अन्य', 'एक', 'द्वि', 'त्रि', 'चतुर्', 'पञ्चन्', 'महत्', 'दीर्घ', 'उरु', 'घन', 'गुरु', 'अल्प', 'ह्रस्व', 'अंहु', 'तनु', 'स्त्री', 'पुरुष, नर', 'मनुष्य, मानव', 'बाल, शिशु', 'पत्नी, भार्या', 'पति', 'मातृ', 'पितृ', 'पशु', 'मत्स्य', 'वि, पक्षिन्', 'श्वन्', 'यूका', 'सर्प', 'कृमि', 'वृक्ष, तरु', 'वन', 'दण्ड', 'फल', 'बीज', 'पत्त्र', 'मूल', 'त्वच्', 'पुष्प', 'तृण', 'रज्जु', 'चर्मन्, त्वच्', 'मांस', 'रक्त, असृज्', 'अस्थि', 'पीवस्, मेदस्', 'अण्ड', 'शृङ्ग', 'पुच्छ', 'पर्ण', 'केश', 'शिरस्', 'कर्ण', 'अक्षि', 'नासा', 'वक्त्र, मुख', 'दन्त', 'जिह्वा', 'नख', 'पद', 'जङ्घ', 'जानु', 'हस्त, पाणि', 'पक्ष', 'उदर', 'अन्त्र, आन्त्र, गुद', 'गल, ग्रीवा', 'पृष्ठ', 'स्तन', 'हृदय', 'यकृत्', 'पिबति', 'खादति, अत्ति', 'दशति', 'धयति', 'ष्ठीवति', 'वमति', 'वाति', 'अनिति', 'स्मयते, हसति', 'पश्यति, √दृश्', 'शृणोति', 'जानाति', 'मन्यते, चिन्तयति', 'जिघ्रति', 'बिभे

## Sanskrit Alphabets

You can view the alphabets in Sanskrit by using `alphabet.py` file

In [27]:
from cltk.corpus.sanskrit.alphabet import *

Vowels in Sanskrit can be obtained as shown:

In [28]:
print("Indpendent vowels- Simple: ", INDEPENDENT_VOWELS_SIMPLE)
print("Indpendent vowels- Dipthongs: ", INDEPENDENT_VOWELS_DIPTHONGS)
print("Dependent vowels: ",DEPENDENT_VOWELS)


Indpendent vowels- Simple:  ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ए', 'ओ', 'ऋ', 'ऌ']
Indpendent vowels- Dipthongs:  ['ऐ', 'औ']
Dependent vowels:  ['ा', 'ि', 'ी', 'ु', 'ू', 'े', 'ै', 'ो', 'ौ', 'ं', 'ः', 'ृ', 'ॄ']


Consonants in Sanskrit are grouped as follows:

In [29]:
print("Guttural consonants: ", CONSONANT_GUTTURALS)
print("Palatal consonants: ", CONSONANT_PALATALS)
print("Cerebral consonants: ", CONSONANT_CEREBRALS)
print("Dental consonants: ", CONSONANT_DENTALS)
print("Labial consonants: ", CONSONANT_LABIALS)
print("Semivowel consonant: ", SEMIVOWEL_CONSONANT)
print("Sibilant consonant: ", SIBILANT_CONSONANT)
print("Aspirated consonant: ", SONANT_ASPIRATE)

Guttural consonants:  ['क', 'ख', 'ग', 'घ', 'ङ']
Palatal consonants:  ['च', 'छ', 'ज', 'झ', 'ञ']
Cerebral consonants:  ['ट', 'ठ', 'ड', 'ढ', 'ण']
Dental consonants:  ['त', 'थ', 'द', 'ध', 'न']
Labial consonants:  ['प', 'फ', 'ब', 'भ', 'म']
Semivowel consonant:  ['य', 'र', 'ल', 'व']
Sibilant consonant:  ['श', 'ष', 'स']
Aspirated consonant:  ['ह']


Digits in Sanskrit:

In [30]:
print("Digits: ", DIGITS)

Digits:  ['०', '१', '२', '३', '४', '५', '६', '७', '८', '९']


There are a few other alphabets too:

In [31]:
print(OTHER_ALPHABETS)

['ॐ', '्', 'ँ', 'ऽ', '़', '।', '॥']
