# Notebook 4: Modeling

In this notebook I develop two separate approaches to modeling and making a path through the canon 1. Through topic analysis; 2. Through a sutta recommendation engine. 

The logic behind the first approach is that if the model could identify clear topics, users at the front end could navigate through the canon by selecting from topics of interest and corresponding suttas. This, however, as the analysis below will show, did not pan out as the words to describe the topics are unintelligible. 

The second approach was far more effective, particularly with the BERT production model. With this model in hand, I can develop a front end interface such that users can input a sutta with content that is interesting to them and get another sutta with similar content. For a collection of teachings that is not formally arranged or structured (in contrast to the Bible, say), this is a serious benefit and allows a user to make a path from any entrance point.

In [24]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gensim
import gensim.corpora as corpora
import nltk
nltk.download('punkt')

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from gensim.utils import simple_preprocess
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from gensim import models
from pprint import pprint
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from sentence_transformers import SentenceTransformer
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt to /Users/ae-j/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
df = pd.read_csv('./sutta_csv/cleaned/df_all_prep.csv')

# Topic Modeling Part 1 (T1): Topic Analysis

Here I develop three topic analysis models, one using TruncatedSVD with TFIDF, one using Latent Dirichlet Allocation (LDA) and a third using LDA with TFIDF. Given the large computation requirements this was originally developed in google colab. The LDA models were evaluated with a coeh

This was unfortunately a bit of a bust. The topics that the model was identifying were not particularly interpretable and given the aim of the project the results were not usable (despite a decent coherence score on the LDA model with tfidf). 

### (T1.1) Topic Analysis using TFIDF and TruncatedSVD

In [5]:
#Instantiate and fit tfidfvectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features= 3000)

X = vectorizer.fit_transform(df['text_full'])

#### Using TruncatedSVD 

In [6]:
# Base code developed from the following blog, modified to fit requirements here. - https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/

#Instantiating and fitting model
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', 
                         n_iter=100, random_state=42)

svd_model.fit(X)

#Topic retrieval 
len(svd_model.components_)

20

In [7]:
## Print out the relevant topics

terms = vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])
        print(" ")

Topic 0: 
blessed
 
monk
 
mind
 
dhamma
 
said
 
monks
 
having
 
Topic 1: 
cessation
 
consciousness
 
comes
 
requisite
 
condition
 
feeling
 
form
 
Topic 2: 
comes
 
blessed
 
cessation
 
requisite
 
condition
 
brahman
 
gotama
 
Topic 3: 
inconstant
 
blessed
 
perception
 
self
 
consciousness
 
ven
 
form
 
Topic 4: 
ven
 
blessed
 
awakening
 
occasion
 
factor
 
cessation
 
mindfulness
 
Topic 5: 
mind
 
thag
 
passion
 
kn
 
desire
 
defilement
 
body
 
Topic 6: 
stress
 
path
 
thag
 
kn
 
leading
 
perception
 
truth
 
Topic 7: 
awakening
 
factor
 
inconstant
 
monk
 
qualities
 
factors
 
self
 
Topic 8: 
stress
 
mind
 
defilement
 
passion
 
noble
 
desire
 
regard
 
Topic 9: 
misconduct
 
person
 
inconstant
 
body
 
good
 
evil
 
bodily
 
Topic 10: 
awakening
 
factor
 
māra
 
brahman
 
gotama
 
master
 
dependent
 
Topic 11: 
faculty
 
monk
 
māra
 
cognizable
 
evil
 
intellect
 
eye
 
Topic 12: 
dimension
 
māra
 
evil
 
person
 
perception
 
quality
 
nun
 
Top

## (T1.2) Topic Analysis with LDA

Using TruncatedSVD with TFIDF produced fairly unintelligible topic word clusters. In this section I develop two LDA models, one using just a bag of words on tokenized text and one using TFIDF vectors. Some base code and inspiration comes from this [blog](https://highdemandskills.com/topic-modeling-lda/).

### (M1.2.1) LDA with Bag of Words

In [25]:
## Preprocessing a second time to correct string error
df['text_token_2'] = [simple_preprocess(sutta, deacc=True) for sutta in df['text_full']] 

stop_words = stopwords.words('english')

# remove stop-words
df['text_no_stop_2'] = df['text_token_2'].apply(lambda x: [item for item in x if item not in stop_words])

In [26]:
## Creating a list of all the text elements
text_for_dic = []
for i in df['text_no_stop_2']:
    text_for_dic.append(i)

In [27]:
text_for_dic

[['heard',
  'one',
  'occasion',
  'blessed',
  'one',
  'staying',
  'near',
  'ukkattha',
  'shade',
  'royal',
  'sal',
  'tree',
  'blessed',
  'forest',
  'addressed',
  'monks',
  'monks',
  'yes',
  'lord',
  'monks',
  'responded',
  'blessed',
  'one',
  'said',
  'monks',
  'teach',
  'sequence',
  'root',
  'phenomena',
  'root',
  'sequence',
  'phenomena',
  'listen',
  'pay',
  'close',
  'attention',
  'speak',
  'say',
  'lord',
  'responded',
  'blessed',
  'one',
  'said',
  'case',
  'monks',
  'uninstructed',
  'run',
  'mill',
  'person',
  'regard',
  'noble',
  'ones',
  'well',
  'versed',
  'disciplined',
  'dhamma',
  'regard',
  'people',
  'integrity',
  'well',
  'versed',
  'disciplined',
  'dhamma',
  'perceives',
  'earth',
  'earth',
  'perceiving',
  'earth',
  'earth',
  'supposes',
  'things',
  'earth',
  'supposes',
  'things',
  'earth',
  'supposes',
  'things',
  'coming',
  'earth',
  'supposes',
  'earth',
  'mine',
  'delights',
  'earth',
 

In [29]:
#Creating dictionary and bag of words manipulated corpus

ID2word = corpora.Dictionary(text_for_dic)
texts = df['text_no_stop_2']
corpus = [ID2word.doc2bow(sutta) for sutta in texts]

In [30]:
# Train LDA model on the corpus generated above
lda_model = gensim.models.LdaMulticore(corpus=corpus, num_topics=5, id2word=ID2word, passes=100)

# View topics from the lda model
pprint(lda_model.print_topics(num_words=5))

[(0,
  '0.028*"one" + 0.011*"said" + 0.010*"blessed" + 0.010*"pleasure" + '
  '0.009*"perception"'),
 (1,
  '0.014*"mind" + 0.014*"one" + 0.008*"would" + 0.008*"body" + 0.007*"life"'),
 (2,
  '0.019*"cessation" + 0.019*"self" + 0.016*"consciousness" + 0.015*"feeling" '
  '+ 0.015*"one"'),
 (3,
  '0.048*"one" + 0.025*"blessed" + 0.013*"dhamma" + 0.011*"said" + '
  '0.009*"monks"'),
 (4,
  '0.018*"monk" + 0.016*"one" + 0.013*"mind" + 0.011*"right" + '
  '0.010*"qualities"')]


In [31]:
#Use the coherence score to evaluate the effectiveness of the model.
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=df['text_no_stop_2'], dictionary=ID2word, coherence='c_v')

#Print Coherence
coherence_lda = coherence_model_lda.get_coherence()
print('-'*50)
print('\nCoherence Score:', coherence_lda)
print('-'*50)

--------------------------------------------------

Coherence Score: 0.482213244293822
--------------------------------------------------


### (T1.2.2) LDA Bag of Words and TFIDF

This yields a roughly equivalent coherence score to the non-TFIDF one of about .49 and similarly difficult to interpret topic words.

In [34]:
## Use TFIDF in conjunction with the text 
corpus = [ID2word.doc2bow(sutta) for sutta in texts]
TFIDF = models.TfidfModel(corpus)
text_tfidf = TFIDF[corpus]

## Train LDA model on new TFIDF corpus
lda_modeltf = gensim.models.LdaMulticore(corpus=text_tfidf, num_topics=5, id2word=ID2word, passes=100)

## Print topics from new model
pprint(lda_modeltf.print_topics(num_words=5))

[(0,
  '0.002*"monk" + 0.002*"mind" + 0.002*"dhamma" + 0.002*"one" + '
  '0.002*"cessation"'),
 (1,
  '0.000*"cleanliness" + 0.000*"disintegrates" + 0.000*"ratthapala" + '
  '0.000*"stressfulness" + 0.000*"pancasala"'),
 (2,
  '0.001*"period" + 0.001*"thousandth" + 0.001*"brightness" + 0.000*"knock" + '
  '0.000*"prosper"'),
 (3,
  '0.001*"dearer" + 0.001*"transgression" + 0.001*"rare" + 0.000*"injurious" + '
  '0.000*"exploration"'),
 (4,
  '0.001*"aimed" + 0.001*"alavaka" + 0.000*"bahiya" + 0.000*"resilient" + '
  '0.000*"wholly"')]


In [35]:
# Set up coherence model
coherence_model_lda = gensim.models.CoherenceModel(model=lda_modeltf, texts=df['text_no_stop_2'], dictionary=ID2word, coherence='c_v')

# Print coherence
coherence_lda = coherence_model_lda.get_coherence()
print('-'*50)
print('\nCoherence Score:', coherence_lda)
print('-'*50)

--------------------------------------------------

Coherence Score: 0.49330696646548644
--------------------------------------------------


# Content Modeling (C1) - Similarities

Given the dead-end from topic analysis I started looking at an alternative method for navigating the suttas. A recommendation engine would allow a user to enter the canon at any any sutta and get a sutta to follow. This could be done a number of times essentially creating a sort of 'choose your own adventure' path through the suttas. Some base code and inspiration for this modeling section comes from this [blog](https://towardsdatascience.com/calculating-document-similarities-using-bert-and-other-models-b2c1a29c9630). 

### (C1.1) Using TFIDF

First using TFIDF to generate five most similar documents given either dot product or euclidean distances. The performance is very poor with large euclidean distances and cosine similarities of about .2 in most cases.

In [63]:
#Instantiate and fit tfidfvectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features= 3000)
X = vectorizer.fit_transform(df['text_full'])

In [64]:
# Dot product and euclidean distance on the TFIDF vectorized text

pairwise_similarities=np.dot(X,X.T).toarray()
pairwise_differences=euclidean_distances(X)

In [65]:
#Function to print similarities
def most_similar(ref_num, similarity_matrix, matrix):
    print (f'Document: {df.iloc[ref_num]["title"]}')
    print ('\n')
    print ('Similar Documents:')
    if matrix=='Cosine Similarity':
        similar_ix=np.argsort(similarity_matrix[ref_num])[::-1][1:6]
    elif matrix=='Euclidean Distance':
        similar_ix=np.argsort(similarity_matrix[ref_num])[1:6]
    for ix in similar_ix:
        print('\n')
        print (f'Title: {df.iloc[ix]["title"]}')
        print (f'Ref: {df.iloc[ix]["ref"]}')
        print (f'{matrix} : {similarity_matrix[ref_num][ix]}')

most_similar(0, pairwise_similarities,'Cosine Similarity')    
most_similar(0, pairwise_similarities,'Euclidean Distance')  

Document: MN 1  Mūlapariyāya Sutta | The Root Sequence


Similar Documents:


Title: AN 10:6 Samādhi Sutta | Concentration
Ref: AN 10:6
Cosine Similarity : 0.38089015461814496


Title: AN 10:7 Sāriputta Sutta | With Sāriputta
Ref: AN 10:7
Cosine Similarity : 0.35099032923755885


Title: Ud 8:1  Nibbāna Sutta | Unbinding (1)
Ref: Ud 8:1
Cosine Similarity : 0.3159798777873003


Title: AN 4:24 Kāḷaka Sutta | At Kāḷaka’s Park
Ref: AN 4:24
Cosine Similarity : 0.2639282043281809


Title: SN 6:15  Parinibbāna Sutta | Total Unbinding
Ref: SN 6:15
Cosine Similarity : 0.2526533180612764


### (C1.2) Using Doc2Vec

Slightly better average performance with a lower average euclidean distance and higher cosine similarity than TFIDF (.5 v .3). Still not totally satisfactory nor something with which I would feel comfortable making a user interface.

In [39]:
# Converting suttas into form that doc2vec expects
tagged_suttas = [TaggedDocument(words = word_tokenize(sutta), tags=[i]) for i, sutta in enumerate(df['text_full'])]
model_d2v = Doc2Vec(vector_size=100, alpha=0.025, min_count=1)
  
model_d2v.build_vocab(tagged_suttas)

In [67]:
# Training model on tagged documents
for epoch in range(100):
    model_d2v.train(tagged_suttas,
                total_examples=model_d2v.corpus_count,
                epochs=model_d2v.epochs)

# Using doc2vec to generate document embeddings
document_embeddings=np.zeros((df.shape[0],100))

for i in range(len(document_embeddings)):
    document_embeddings[i]=model_d2v.dv[i]

  document_embeddings[i]=model_d2v.docvecs[i]


In [68]:
#Using cosine_similarity and euclidean distances on d2v document embeddings
pairwise_similarities=cosine_similarity(document_embeddings)
pairwise_differences=euclidean_distances(document_embeddings)

In [69]:
# Returning d2v similarity recommendations
most_similar(0, pairwise_similarities, 'Cosine Similarity')
most_similar(0, pairwise_differences, 'Euclidean Distance')

Document: MN 1  Mūlapariyāya Sutta | The Root Sequence


Similar Documents:


Title: Thag 1:7  Bhalliya
Ref: Thag 1:7
Cosine Similarity : 0.5155920011135091


Title: SN 35:19  Abhinanda Sutta | Delight (1)
Ref: SN 35:19
Cosine Similarity : 0.5098645070675596


Title: SN 35:20  Abhinanda Sutta | Delight (2)
Ref: SN 35:20
Cosine Similarity : 0.4990225322757579


Title: Iti 94  —  On having consciousness neither externally scattered and diffused, nor internally positioned.
Ref: Iti 94
Cosine Similarity : 0.49686554426645824


Title: AN 5:140 Sotar Sutta | The Listener
Ref: AN 5:140
Cosine Similarity : 0.47250562109315414
Document: MN 1  Mūlapariyāya Sutta | The Root Sequence


Similar Documents:


Title: Thag 1:7  Bhalliya
Ref: Thag 1:7
Euclidean Distance : 29.26128388640226


Title: AN 5:140 Sotar Sutta | The Listener
Ref: AN 5:140
Euclidean Distance : 30.813677886439496


Title: Thag 1:2  Mahā Koṭṭhita
Ref: Thag 1:2
Euclidean Distance : 30.888478115246578


Title: Ud 8:6  Pāṭaligāma Sut

### (C1.3) Using BERT Model

With this final model the input is now not the document corresponding to the index number given but the ref itself. Significant jump up in cosine similarity (.9+ for top results) and decrease in euclidean distance. Far and away the best model and the one further developed in notebook five as the production model.

In [53]:
#Function to deploy BERT model trained on BERT base
def bert_mod(sutta_ref):
    sbert_model = SentenceTransformer('bert-base-nli-mean-tokens', show_progress_bar = False)

    #adding \u2009 in between the sutta abbreviation and number to avoid non-standard whitespace
    ref = sutta_ref.replace(' ', '\u2009')
    sutta_index = df['ref'].index[(df['ref'] == ref)].tolist() 
    document_embeddings = sbert_model.encode(df['text_full'])

    # Redefining with sbert generated doc embeddings
    pairwise_similarities=cosine_similarity(document_embeddings)
    pairwise_differences=euclidean_distances(document_embeddings)
    
    #bringing in most_similar function 
    most_similar(sutta_index[0], pairwise_similarities,'Cosine Similarity')
    most_similar(sutta_index[0], pairwise_differences,'Euclidean Distance')