<a href="https://colab.research.google.com/github/defcarlos/mulamadhyamaka/blob/main/mulamadhymaka.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Word count of Mulamadhyamakakarika
The first thing I did was to loop through the book and count how often words occurred in the book.

Prior to this however, I needed to create a text file that contained the entire text. I could have done this through web scraping, but instead did it by hand since the book is rather short. I sourced the text from the [Digital Sanskrit Buddhist Canon](https://www.dsbcproject.org/canon-text/book/242).

##cltk install and import
From here down I'm installing the Classical Language Toolkit ([cltk](http://cltk.org/)) and importing their nlp pipeline for Sanskrit.  

In [None]:
#installing cltk
!pip install cltk

Collecting cltk
  Downloading cltk-1.1.6-py3-none-any.whl (845 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m845.0/845.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting boltons<22.0.0,>=21.0.0 (from cltk)
  Downloading boltons-21.0.0-py2.py3-none-any.whl (193 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.7/193.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython<4.0,>=3.0 (from cltk)
  Downloading GitPython-3.1.40-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting greek-accentuation<2.0.0,>=1.2.0 (from cltk)
  Downloading greek_accentuation-1.2.0-py2.py3-none-any.whl (6.8 kB)
Collecting python-Levenshtein<0.13.0,>=0.12.0 (from cltk)
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.5/50.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[

In [None]:
# installing nlp pipeline.
# according to cltk doc this is the only import most users will need
from cltk import NLP

In [None]:
# Load the default Pipeline for Sanskrit
cltk_nlp = NLP(language="san")

‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Sanskrit' (ISO: 'san'): `MultilingualTokenizationProcess`, `SanskritEmbeddingsProcess`, `StopsProcess`.


##Opening text file containing Mulamadhymakakarika

In [None]:
# Open the file in read mode
# also print a sample of the text to check everything looks ok
with open("mula.txt") as text:
  mula = text.read()

##Quick test run of document
Here I printed a snipit of the doc, a character count, and token count exactly as was done in the cltk documentation to make sure things are working as expected.

In [None]:
print("Text snippet:", mula[:200])
print("Character count:", len(mula))
print("Approximate token count:", len(mula.split()))

Text snippet: Ch1
nāgārjuna kṛta

madhyamakaśāstram 

pratyayaparīkṣā nāma prathamaṃ prakaraṇam 

anirodhamanutpādamanucchedamaśāśvatam 
anekārthamanānārthamanāgamamanirgamam  

yaḥ pratītyasamutpādaṃ prapañcopaśam
Character count: 42266
Approximate token count: 4325


##NLP pipeline, tokenization, and initial word count
Again, I walk through the steps laidout by the cltk documentation. I then build a dictionary containing a count of word that occurs in the text.

In [None]:
# Now execute NLP algorithms upon input text
mula_doc = cltk_nlp.analyze(text=mula)

CLTK message: This part of the CLTK depends upon word embedding models from the Fasttext project.
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.sa.vec' to '/root/cltk_data/san/embeddings/fasttext/wiki.sa.vec'? [Y/n] 
y


100%|██████████| 129M/129M [00:10<00:00, 11.9MiB/s]


In [None]:
# Tokenize the document
doc_token = mula_doc.tokens

# create empty dictionary
token_dic = {}

# iterate through the list of tokens, counting each occurance of the token
# adding the token and it's count to the dictionary
for word in doc_token:
  if word in token_dic:
    token_dic[word] +=1
  else:
    token_dic[word] =1

# create a new dictionary that is sorted alphabetically
token_dic_sort = sorted(token_dic)

# Print the contents of dictionary
for key in token_dic_sort:
  print(key, ":", token_dic[key])

# Write results to file
  with open('alpha-sort.txt', 'w') as f:
    for key in token_dic_sort:
      f.write(key + ":" + str(token_dic[key]) + "\n")
  f.close()
text.close()

, : 5
Ch1 : 1
Ch10 : 1
Ch11 : 1
Ch12 : 1
Ch13 : 1
Ch14 : 1
Ch15 : 1
Ch16 : 1
Ch17 : 1
Ch18 : 1
Ch19 : 1
Ch2 : 1
Ch20 : 1
Ch21 : 1
Ch22 : 1
Ch23 : 1
Ch24 : 1
Ch25 : 1
Ch26 : 1
Ch27 : 1
Ch3 : 1
Ch4 : 1
Ch5 : 1
Ch6 : 1
Ch7 : 1
Ch8 : 1
Ch9 : 1
[ : 1
] : 1
abhisaṃskurute : 1
abhāve : 1
abhāvāccāryasatyānāṃ : 2
abhūmatītamadhvānamityetannopapadyate : 1
abhūmatītamadhvānaṃ : 1
abrahmacaryavāsaśca : 1
adhvanyanāgate : 1
adṛśyamāna : 1
agnīndhanaparīkṣā : 1
agnīndhane : 1
agnīndhanābhyāṃ : 1
ahetukamajātasya : 1
ahetukaṃ : 1
ahetupratyayān : 1
ajyate : 2
ajātamaniruddhaṃ : 1
akāryakaṃ : 1
akṛtaṃ : 1
akṛtrimaḥ : 1
akṛtābhyāgamabhayaṃ : 1
alakṣaṇaṃ : 1
alakṣaṇo : 1
amūnyapi : 1
anapekṣya : 2
anapekṣyāśubhaṃ : 1
anekārthamanānārthamanucchedamaśāśvatam : 1
anekārthamanānārthamanāgamamanirgamam : 1
anirodhamanutpādamanucchedamaśāśvatam : 1
aniruddhamanutpannamaśūnyaṃ : 1
aniruddhamanutpannametannirvāṇamucyate : 1
anityamityapi : 1
anityamuktaṃ : 1
anityatā : 1
anitye : 2
antavaccāpyanantaṃ : 1
antav

###Results of initial word count
The first thing you might notice from the above print out is that it's kind of a mess. There are 2191 entries. This include chapter numbers and a few puncuations that didn't get filtered. The actual number of unique words in the text is 2162. That's too much to make sense of.

Another problem is, you have to scroll through what seems like an endless amount of entries to find words that have more than one occurrence.

To address these issues, I reran the above code and sorted results by the value instead of by the key.

##Second word count
I create another dictionary. This time sorted by the value in the dictionary rather than the key.

In [None]:
# Sort dictionary by value
value_sorted = sorted(token_dic.items(), key=lambda x: x[1], reverse=True)

# After sorting things need to be put back into a dictonary
sortdict = dict(value_sorted)

# Print the contents of dictionary
for key in sortdict:
  print(key, ":", token_dic[key])

# Write results to file
  with open('sorted-value.txt', 'w') as f:
    for key in sortdict:
      f.write(key + ":" + str(token_dic[key]) + "\n")
  f.close()
text.close()

na : 236
ca : 162
vidyate : 69
yadi : 59
sa : 51
hi : 50
kathaṃ : 35
vā : 34
bhaviṣyati : 31
nāsti : 29
yujyate : 28
prakaraṇam : 27
bhavet : 24
eva : 21
kutaḥ : 20
te : 20
karma : 20
katham : 19
naiva : 19
punaḥ : 18
phalaṃ : 18
vinā : 18
prasajyate : 17
kiṃ : 17
jāyate : 17
gantā : 16
bhāvo : 16
pratītya : 14
duḥkhaṃ : 14
nopapadyate : 13
gamanaṃ : 13
saḥ : 13
evaṃ : 13
nirvāṇaṃ : 13
gacchati : 12
kasya : 11
phalam : 11
bhāvānāṃ : 10
sati : 10
tasya : 10
tu : 10
saha : 10
yaḥ : 9
yadā : 9
nirudhyate : 9
śāśvatam : 9
saṃbhavo : 9
nāma : 8
kaḥ : 8
tiṣṭhati : 8
bhāvaḥ : 8
ye : 8
janayate : 8
yathā : 8
bhāvaśca : 8
kārakaḥ : 8
yo : 8
kuta : 7
naivopapadyate : 7
caiva : 7
saṃbhavanti : 7
phalasya : 7
ceti : 7
kleśāḥ : 7
tathāgataḥ : 7
nāpi : 6
pravartate : 6
yasya : 6
darśanaṃ : 6
paśyati : 6
kāraṇaṃ : 6
sidhyati : 6
śūnyatā : 6
svabhāvena : 6
vibhavaścaiva : 6
pratibādhase : 6
avidyamāne : 5
kriyā : 5
nirodho : 5
yatra : 5
sā : 5
yena : 5
anya : 5
, : 5
sarvaṃ : 5
paśyanti : 5
taṃ : 5
bh

###Results of second word count

It might come as no surprise that the word most used (236 times) in a book whose central theme is that everything lacks essence, nature, or *svabhava* (स्वभाव) is *na* (न) the Sanskrit word for no/not.

However, if you're familiar with Sanskrit you'll know that the language requires words to change based on the grammatical case being used and what role the word is playing in the sentence. Another reason words change is due to *sandhi* (सन्धि). Here words will change to make it easier to pronounce because of their proximity to other words. They may also combine to form a single word. Let's take the word *nopapadyate* (नोपपद्यते). Due to *sandhi* *na* and *upapadyate* are combined to mean something like not befitting/not found/not qualified. All of this is to say, there are other instances of *na* (न) in the text that are not being contented.

Cltk has a lemmatization tool included in it to allow us to convert words to their root form. Let's see how well it does with a few sample words. After some testing it doesn't seem to be available for Sanskrit. I'll have to explore other options for accomplishing this task.

Another task that can be accomplished with cltk is, removing stop words from our list and seeing what words are most common without them. Stop words are words that are considered unimportant.



##Stop word filter

In [None]:
# Tokenize the document
doc_filtered = mula_doc.tokens_stops_filtered

# create empty dictionary
filtered_dic = {}

# iterate through the list of tokens, counting each occurance of the token
# adding the token and it's count to the dictionary
for word in doc_filtered:
  if word in filtered_dic:
    filtered_dic[word] +=1
  else:
    filtered_dic[word] =1

# create a new dictionary that is sorted alphabetically
filtered_dic_sort = sorted(filtered_dic)

# Print the contents of dictionary
for key in filtered_dic_sort:
  print(key, ":", filtered_dic[key])

# Write results to file
#  with open('alpha-sort.txt', 'w') as f:
#    for key in filtered_dic_sort:
#      f.write(key + ":" + str(filtered_dic[key]) + "\n")
#  f.close()
#text.close()

, : 5
Ch1 : 1
Ch10 : 1
Ch11 : 1
Ch12 : 1
Ch13 : 1
Ch14 : 1
Ch15 : 1
Ch16 : 1
Ch17 : 1
Ch18 : 1
Ch19 : 1
Ch2 : 1
Ch20 : 1
Ch21 : 1
Ch22 : 1
Ch23 : 1
Ch24 : 1
Ch25 : 1
Ch26 : 1
Ch27 : 1
Ch3 : 1
Ch4 : 1
Ch5 : 1
Ch6 : 1
Ch7 : 1
Ch8 : 1
Ch9 : 1
[ : 1
] : 1
abhisaṃskurute : 1
abhāve : 1
abhāvāccāryasatyānāṃ : 2
abhūmatītamadhvānamityetannopapadyate : 1
abhūmatītamadhvānaṃ : 1
abrahmacaryavāsaśca : 1
adhvanyanāgate : 1
adṛśyamāna : 1
agnīndhanaparīkṣā : 1
agnīndhane : 1
agnīndhanābhyāṃ : 1
ahetukamajātasya : 1
ahetukaṃ : 1
ahetupratyayān : 1
ajyate : 2
ajātamaniruddhaṃ : 1
akāryakaṃ : 1
akṛtaṃ : 1
akṛtrimaḥ : 1
akṛtābhyāgamabhayaṃ : 1
alakṣaṇaṃ : 1
alakṣaṇo : 1
amūnyapi : 1
anapekṣya : 2
anapekṣyāśubhaṃ : 1
anekārthamanānārthamanucchedamaśāśvatam : 1
anekārthamanānārthamanāgamamanirgamam : 1
anirodhamanutpādamanucchedamaśāśvatam : 1
aniruddhamanutpannamaśūnyaṃ : 1
aniruddhamanutpannametannirvāṇamucyate : 1
anityamityapi : 1
anityamuktaṃ : 1
anityatā : 1
anitye : 2
antavaccāpyanantaṃ : 1
antav

In [None]:
print("There are", len(filtered_dic), "items in the dictionary without stop words.")

There are 2191 items in the dictionary without stop words.


In [None]:
# Sort dictionary by value
value_filtered = sorted(filtered_dic.items(), key=lambda x: x[1], reverse=True)

# After sorting things need to be put back into a dictonary
filtereddict = dict(value_filtered)

# Print the contents of dictionary
for key in filtereddict:
  print(key, ":", filtered_dic[key])


na : 236
ca : 162
vidyate : 69
yadi : 59
sa : 51
hi : 50
kathaṃ : 35
vā : 34
bhaviṣyati : 31
nāsti : 29
yujyate : 28
prakaraṇam : 27
bhavet : 24
eva : 21
kutaḥ : 20
te : 20
karma : 20
katham : 19
naiva : 19
punaḥ : 18
phalaṃ : 18
vinā : 18
prasajyate : 17
kiṃ : 17
jāyate : 17
gantā : 16
bhāvo : 16
pratītya : 14
duḥkhaṃ : 14
nopapadyate : 13
gamanaṃ : 13
saḥ : 13
evaṃ : 13
nirvāṇaṃ : 13
gacchati : 12
kasya : 11
phalam : 11
bhāvānāṃ : 10
sati : 10
tasya : 10
tu : 10
saha : 10
yaḥ : 9
yadā : 9
nirudhyate : 9
śāśvatam : 9
saṃbhavo : 9
nāma : 8
kaḥ : 8
tiṣṭhati : 8
bhāvaḥ : 8
ye : 8
janayate : 8
yathā : 8
bhāvaśca : 8
kārakaḥ : 8
yo : 8
kuta : 7
naivopapadyate : 7
caiva : 7
saṃbhavanti : 7
phalasya : 7
ceti : 7
kleśāḥ : 7
tathāgataḥ : 7
nāpi : 6
pravartate : 6
yasya : 6
darśanaṃ : 6
paśyati : 6
kāraṇaṃ : 6
sidhyati : 6
śūnyatā : 6
svabhāvena : 6
vibhavaścaiva : 6
pratibādhase : 6
avidyamāne : 5
kriyā : 5
nirodho : 5
yatra : 5
sā : 5
yena : 5
anya : 5
, : 5
sarvaṃ : 5
paśyanti : 5
taṃ : 5
bh

###Stop word filter results
Only 4 words in the book were considered to be stop words by cltk.