# Classification of Buddhist Scriptures Using Unsupervised Machine Learning
## Introduction
### Problem Statement
Buddhism is one of the world's major religions. Its primary branches are Theravada, Mahayana, and Vajrayana. These branches can be distinguished by, among many other features, which scriptures they accept as legitimate. Theravada is the most conservative in this respect, accepting the smallest set of scriptures. Mahayana accepts a larger set of scriptures. Vajrayana is the most liberal, accepting all of the scriptures of the Theravada, Mahayana, and an additional set unique to Vajrayana.
Historians of religion, and religious practitioners, debate the historical origins of these branches and frequently when new texts are discovered by archaeologists, there is a question as to which branch these text belong. Machine learning methods can help us to categorize these texts in a way that may avoid both the possible sectarian divisions of practitioners and the possibly mistaken assumptions of historians.
Thus, this project seeks to use unsupervised machine learning methods to categorize a group of texts as either Theravada, Mahayana, or Vajrayana.

### Data
I have collected excerpts from texts associated with each branch, saved as theravada-excerpts.txt, mahayana-excerpts.txt, and vajrayana-excerpts.txt. These excerpts are taken from scriptures that are taken by domain excerpts to be representative of the textual tradition of each branch. For each branch, the excerpts are taken from four separate texts in chunks of about 200 lines. 

## Preprocessing
First, I've loaded in the texts below and gotten a word count. We can see that there are a roughly equal number of words  (~5000) in each of the sets of excerpts. This is what we want for modeling purposes.

In [1]:
with open('theravada-excerpts.txt', 'r') as excerpts:
    theravada = excerpts.read().split(' ')

print('Theravada Excerpts Word Count:' + str(len(theravada)))

with open('mahayana-excerpts.txt', 'r') as excerpts:
    mahayana = excerpts.read().split(' ')

print('Mahayana Excerpts Word Count:' + str(len(mahayana)))

with open('vajrayana-excerpts.txt', 'r') as excerpts:
    vajrayana = excerpts.read().split(' ')

print('Vajrayana Excerpts Word Count:' + str(len(vajrayana)))

Theravada Excerpts Word Count:5094
Mahayana Excerpts Word Count:4885
Vajrayana Excerpts Word Count:4863


Next, I'll split the sets of excerpts into chunks of 200 words to act as the samples in our data. 

In [2]:
theravada_chunks = []
mahayana_chunks = []
vajrayana_chunks = []

for i in range(0,len(theravada) - 199, 200):
    theravada_chunks.append(' '.join(theravada[i:i+199]))

for i in range(0,len(mahayana) - 199, 200):
    mahayana_chunks.append(' '.join(mahayana[i:i+199]))

for i in range(0,len(vajrayana) - 199, 200):
    vajrayana_chunks.append(' '.join(vajrayana[i:i+199]))

We can now construct our initial dataframe of text excerpts and category labels. I will use 0 for Theravada, 1 for Mahayana, and 2 for Vajrayana.

In [3]:
import pandas as pd

In [31]:
th_data = {'text':theravada_chunks, 'branch': [0]*(len(theravada_chunks))}
th_df = pd.DataFrame(th_data)

ma_data = {'text': mahayana_chunks, 'branch': [1]*(len(mahayana_chunks))}
ma_df = pd.DataFrame(ma_data)

va_data = {'text':vajrayana_chunks, 'branch': [2]*(len(vajrayana_chunks))}
va_df = pd.DataFrame(va_data)

df = pd.concat([th_df, ma_df, va_df])

df.head()

Unnamed: 0,text,branch
0,THUS HAVE I HEARD. On one occasion the Blessed...,0
1,undisciplined in their\n92 Sabbllsava Sutta: S...,0
2,in him and the arisen\ntaint ~f ignorance is a...,0
3,or the view 'I perceive not-\nself with self' ...,0
4,"""What are the things unfit for attention that ...",0


## Quantifying the Text
Now that we have a dataframe of our text, we can start turning it into something quantifiable for modeling. I will use two approaches here. The first is the frequency analysis that we are already familiar with. The second is sentiment analysis.

### Frequency Analysis

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [None]:
# clean this up and tune the hyperparameters

freq_matrix = TfidfVectorizer(sublinear_tf=True).fit_transform(df['text']).toarray()
tsvd = TruncatedSVD(n_components=10).fit_transform(freq_matrix)

### Sentiment Analysis

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

# this needs to be edited for purpose

def assign(messages):
    sentiments = []
    sia = SentimentIntensityAnalyzer()
    for message in messages:
        sentiment = sia.polarity_scores(message)
        sentiments.append(sentiment)
    return sentiments