# Notebook 1: Introduction and Cleaning Data

### **Introduction**

The Pali Canon is the foundational collection of texts of the Theravada Buddhist Religion. It contains the oldest known records of the Buddha's teachings, compiled in written form about 500 years after the Buddha's death and maintained orally in the interim. 

The Canon is a diverse collection of works that documents teachings, stories, exclamations, quotes and poetry grouped into 5 separate collections by later compilers. Although it is little known and little studied in the west, the Canon and commentaries on it, form the core of the religion for hundreds of millions of Theravada Buddhists, particularly in South East Asia. 

Although translations from Pali (an ancient Sanskritic language) have existed for over one hundred years, the translations were often made by scholars who were not steeped in the living Buddhist monastic culture and discipline and often by those who did not practice. As such, it is unclear how often early translators had experiential insight into the meaning of often complex phenomena and concepts that are represented in the Canon. These insights are undoubtedly important for accurately representing a dead language (that contains many words/concepts with no direct equivalent in english) and, in turn, for outlining a path of practice to an unconditioned happiness that is as alive today as it was in the time of the Buddha. 

In recent years, however, as a result of an enormous effort by several English-speaking Buddhist monks, a large portion of the Pali Canon has been translated and made available online. The suttas that are the data for this project come from www.dhammatalks.org which hosts suttas translated by Ajahn Geoff, a monk of nearly 45 years in the Kammathana (Thai Forest Tradition) lineage. He has significant experience in translating both from Pali and Thai and is an inspiring monk in conduct and learnedness. 

### **Problem Statement**

The purpose of this project is two-fold:
1. To do significant, public-facing, Natural Language Processing analysis on the Pali Canon. An investigation like this, at this scale, has, to my knowledge, never been conducted before. Given the recency of the availability of strong English translations of the Canon coupled with  fairly recent advances in Machine Learning algorithms that will be employed, the absence of an existing analyis at this level is less surprising than it might initially appear. Furthermore, the cross-section of lay-Theravadan Buddhists (non-monks) who are dedicated to reading the original texts (not 'Dhamma' books by other lay-Buddhist 'Dhamma teachers'), and people with an understanding of the tools needed to do this analysis probably yields quite a small number of people. 

2. To develop a recommendation algorithm for suttas that could be used to support the development of particular mental qualities, themes and understandings within the religion. One could consider this to be a sort of 'Netflix' for information on how to develop along a path to an unconditioned happiness. For a rough understanding of the functionality, one can imagine a scenario where a user would input a sutta on a particular theme they were interested in learning more about or developing further and recieve back five suttas that share content similarity. 

### **Technical Introduction to Notebook One**

As mentioned above, the data for this project comes from roughly 30 scrapes of the website www.dhammatalks.org using the [Octoparse](www.octoparse.com) webscraping tool. The decision to not code this scrape and to use a tool instead was made given the structure of the website and the variety of ways that the text is presented on the website. 

The dataframes that are being created in this notebook number thirteen in total: one for each of the four 'stand-alone' compilations; seven additional dataframes for each of the seven sub-collections of the fifth compilation (Khuddaka Nikaya); one for the entirety of the Khuddaka Nikaya; lastly, one for all five compilations. 

# **Merging the first four collections: MN, DN, SN, AN**

These are fairly straightforward to work with as they are all in roughly the same format.

#### Imports

In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import re

## 1. Majjhima Nikaya (MN)

In [3]:
def low_rep(df):
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace(" ", "_")
    return df

In [4]:
df_mnf = pd.read_csv('./sutta_csv/scraped/mn_full.csv')
df_mnt = pd.read_csv('./sutta_csv/scraped/mn_text_2.csv')

low_rep(df_mnf);
low_rep(df_mnt);
df_mn = pd.merge(left = df_mnf, right = df_mnt, on='title_url')

In [5]:
## Function to clean up the navigation bar that was scraped in the middle of the text

def nav_bar(df):
    df[['A', 'text_full']] = df['sutta_text'].str.split('Suttas/', 1, expand=True)
    df.drop(columns = ['A', 'sutta_text'], axis = 1, inplace = True)
    df.reset_index(inplace = True)
    df = df.dropna().copy()
    return df

In [6]:
#Splitting introductory notes from main sutta body
def split_intro(df):
    df['text_full'] = df['text_full'].str.replace('*', 'REMOVE').copy()
    df_intro = df[df['text_full'].str.contains("REMOVE", case=True)].copy()
    df_no_intro = df[~df['text_full'].str.contains("REMOVE", na = False)].copy()
    try:
        df_intro['intro'] = df_intro['text_full'].str.split('REMOVE\n', 1, expand=True)[0]
        df_intro['text_full'] = df_intro['text_full'].str.split('REMOVE', 1, expand=True)[1]
    except:
        df_intro['intro'] = 'No Introduction'
    df_no_intro['intro'] = 'No Introduction'
    df = pd.concat([df_intro, df_no_intro])
    df['text_full'] = df['text_full'].str.replace('REMOVE', '')
    df['intro'] = df['intro'].str.replace('REMOVE', '')
    df = df.dropna().copy()
    return(df)

In [7]:
# Splitting concluding notes from main sutta body
def split_notes(df):
    df_notes = df[df["text_full"].str.contains("Note", case=True)].copy()
    df_no_notes = df[~df["text_full"].str.contains("Note", na=False)].copy()
    try:
        df_notes[["text_full", "notes"]] = df_notes["text_full"].str.split("Note", 1, expand=True)
    except:
        df_notes['notes'] = 'No Notes'
    df_no_notes["notes"] = "No Notes"
    df = pd.concat([df_notes, df_no_notes])
    df = df.dropna().copy()
    return(df)

In [8]:
# Splitting the author's 'see also' recommendation from the main sutta body

def split_see(df):
    df_see = df[df["text_full"].str.contains("See also", case=True)].copy()
    df_no_see = df[~df["text_full"].str.contains("See also", na=False)].copy()
    try:
        df_see[["text_full", "see_also"]] = df_see["text_full"].str.split("See also", 1, expand=True)
    except:
        df_see["see_also"] = 'No Additional'
    df_no_see["see_also"] = "No Additional"
    df = pd.concat([df_see, df_no_see])
    df = df.dropna().copy()
    return(df)

In [9]:
#Reordering in preparation for concatenation
def reorder(df):
    df = df.sort_values(by='index', ascending=True)
    df = df.drop(columns = 'index')
    df = df[['title', 'ref', 'title_url', 'summary', 
             'text_full', 'intro', 'notes', 'see_also']]
    df = df.dropna().copy()
    return df

In [10]:
#Running all functions together
def nav_int_notes(df):
    df = reorder(split_notes(split_see(split_intro(nav_bar(df)))))
    df = df.dropna().copy()
    return df

In [11]:
df_mn = nav_int_notes(df_mn);

In [12]:
df_mn.drop(index = [64, 68], inplace = True)

In [13]:
#df_mn.to_csv(f'./sutta_csv/cleaned/individuals_dfs/df_mn_clean.csv', index = False)

## 2. Digha Nikaya (DN)

In [14]:
df_dnf = pd.read_csv('./sutta_csv/scraped/dn_full.csv')
df_dnt = pd.read_csv('./sutta_csv/scraped/dn_text_2.csv')

low_rep(df_dnf);
low_rep(df_dnt);
df_dn = pd.merge(left = df_dnf, right = df_dnt, on='title_url')

In [15]:
df_dn = nav_int_notes(df_dn);

In [16]:
#df_dn.to_csv(f'./sutta_csv/cleaned/individuals_dfs/df_dn_clean.csv', index = False)

## 3. Samyutta Nikaya (SN)

In [17]:
df_snf = pd.read_csv('./sutta_csv/scraped/sn_full.csv')
df_snt = pd.read_csv('./sutta_csv/scraped/sn_text_2.csv')

low_rep(df_snf);
low_rep(df_snt);
df_sn = pd.merge(left = df_snf, right = df_snt, on='title_url')

#### Handling SN 25

SN 25  1-10 are the same text, each containing all the rest but stored in 10 different separate suttas. Dropping all but one.

In [18]:
df_sn = df_sn.drop(index=[199,200,201,202,203,204,205,206,207])

missing_text = 'The Eye\n Cakkhu Sutta  (SN 25:1)\n Near Sāvatthī. “Monks, the eye is inconstant, changeable, alterable. The ear… The nose… The tongue… The body… The mind is inconstant, changeable, alterable.\n Forms\n Rūpa Sutta  (SN 25:2)\n Near Sāvatthī. “Monks, forms are inconstant, changeable, alterable. Sounds.… Aromas.… Flavors.… Tactile sensations.… Ideas are inconstant, changeable, alterable.…\n Consciousness\n Viññāṇa Sutta  (SN 25:3)\n Near Sāvatthī. “Monks, eye-consciousness is inconstant, changeable, alterable. Ear-consciousness.… Nose-consciousness.… Tongue-consciousness.… Body-consciousness.… Intellect-consciousness is inconstant, changeable, alterable.…\n Contact\n Phassa Sutta  (SN 25:4)\n Near Sāvatthī. “Monks, eye-contact is inconstant, changeable, alterable. Ear-contact.… Nose-contact.… Tongue-contact.… Body-contact.… Intellect-contact is inconstant, changeable, alterable…\n Feeling\n Vedanā Sutta  (SN 25:5)\n Near Sāvatthī. “Monks, feeling born of eye-contact is inconstant, changeable, alterable. Feeling born of ear-contact.… Feeling born of nose-contact.… Feeling born of tongue-contact.… Feeling born of body-contact.… Feeling born of intellect-contact is inconstant, changeable, alterable…\n Perception\n Saññā Sutta  (SN 25:6)\n Near Sāvatthī. “Monks, perception of forms is inconstant, changeable, alterable. Perception of sounds.… Perception of smells.… Perception of tastes.… Perception of tactile sensations.… Perception of ideas is inconstant, changeable, alterable.…\n Intention\n Cetanā Sutta  (SN 25:7)\n Near Sāvatthī. “Monks, intention for forms is inconstant, changeable, alterable. Intention for sounds.… Intention for smells.… Intention for tastes.… Intention for tactile sensations.… Intention for ideas is inconstant, changeable, alterable.…\n Craving\n Taṇhā Sutta  (SN 25:8)\n Near Sāvatthī. “Monks, craving for forms is inconstant, changeable, alterable. Craving for sounds.… Craving for smells.… Craving for tastes.… Craving for tactile sensations.… Craving for ideas is inconstant, changeable, alterable.…\n “One who has conviction & belief that these phenomena are this way is called a faith-follower: one who has entered the orderliness of rightness, entered the plane of people of integrity, transcended the plane of the run-of-the-mill. He is incapable of doing any deed by which he might be reborn in hell, in the animal womb, or in the realm of hungry ghosts. He is incapable of passing away until he has realized the fruit of stream-entry.\n “One who, after pondering with a modicum of discernment, has accepted that these phenomena are this way is called a Dhamma-follower: one who has entered the orderliness of rightness, entered the plane of people of integrity, transcended the plane of the run-of-the-mill. He is incapable of doing any deed by which he might be reborn in hell, in the animal womb, or in the realm of hungry ghosts. He is incapable of passing away until he has realized the fruit of stream-entry.\n “One who knows and sees that these phenomena are this way is called a stream-enterer, steadfast, never again destined for states of woe, headed for self-awakening.”\n See also: MN 70\n Properties\n Dhātu Sutta  (SN 25:9)\n Near Sāvatthī. “Monks, the earth property is inconstant, changeable, alterable. The liquid property.… The fire property.… The wind property.… The space property.… The consciousness property is inconstant, changeable, alterable.…\n Aggregates\n Khandha Sutta  (SN 25:10)\n Near Sāvatthī. “Monks, form is inconstant, changeable, alterable. Feeling.… Perception.… Fabrications.… Consciousness is inconstant, changeable, alterable.'

df_sn['sutta_text'].fillna(missing_text, inplace = True)

In [19]:
df_sn = nav_int_notes(df_sn);

In [20]:
#df_sn.to_csv(f'./sutta_csv/cleaned/individuals_dfs/df_sn_clean.csv', index = False)

## 4. Anguttara Nikaya (AN)

In [21]:
df_anf = pd.read_csv('./sutta_csv/scraped/an_full.csv')
df_ant = pd.read_csv('./sutta_csv/scraped/an_text_2.csv')

low_rep(df_anf);
low_rep(df_ant);
df_an = pd.merge(left = df_anf, right = df_ant, on='title_url')

In [22]:
df_an = nav_int_notes(df_an);

In [23]:
#df_an.to_csv(f'./sutta_csv/cleaned/individuals_dfs/df_an_clean.csv', index = False)

# **Merging the Khuddaka Nikaya (KN)**

Khuddaka Nikaya was much harder to scrape from the website. It is composed of 7 smaller collections, each with its own formatting. The content ranges from poetry to exclamations, quotations and stories. 

## KN 01 - Khuddakapatha (Khp)

In [24]:
df_kkhpf = pd.read_csv('./sutta_csv/scraped/kn/01_kn_khp_full.csv')
df_kkhpt = pd.read_csv('./sutta_csv/scraped/kn/01_kn_khp_text.csv')

low_rep(df_kkhpf);
low_rep(df_kkhpt);
df_kkhp = pd.merge(left = df_kkhpf, right = df_kkhpt, on='title_url')

In [25]:
df_kkhp = nav_int_notes(df_kkhp);

In [26]:
#df_kkhp.to_csv(f'./sutta_csv/cleaned/individuals_dfs/df_kkhp_clean.csv', index = False)

## KN 02 - Dhammapada (Dhp)

In [27]:
df_kdhpf = pd.read_csv('./sutta_csv/scraped/kn/02_kn_dhp_full.csv')
df_kdhpt = pd.read_csv('./sutta_csv/scraped/kn/02_kn_dhp_text.csv')

low_rep(df_kdhpf);
low_rep(df_kdhpt);
df_kdhp = pd.merge(left = df_kdhpf, right = df_kdhpt, right_on='url', left_on = 'field2_links')

In [28]:
## Dropping
df_kdhp = df_kdhp.drop(columns = 'field2_links')

## Renaming
df_kdhp = df_kdhp.rename(columns = {
                            "field1": "title",
                              'url': 'title_url',
                                'text': 'sutta_text'
                               })
## Creating summary 
df_kdhp['summary'] = 'No Summary'

In [29]:
#creating reference
df_kdhp['ref2'] = df_kdhp['title']
df_kdhp['ref2'] = df_kdhp['ref2'].str.replace('Ch.', 'Dhp')
df_kdhp['ref2'] = df_kdhp['ref2'].str.split()
df_kdhp['ref'] = ''

for i in range(df_kdhp.shape[0]):
    df_kdhp['ref'][i] = df_kdhp['ref2'][i][0] + ' ' + df_kdhp['ref2'][i][1]
df_kdhp = df_kdhp.drop(columns = 'ref2')

In [30]:
def nav_int_notes_2(df):
    df = reorder(split_notes(split_see((nav_bar(df)))))
    df = df.dropna().copy()
    return df

In [31]:
df_kdhp['intro'] = 'No Introduction'

In [32]:
df_kdhp['title'] = 'Dhp ' + df_kdhp['title']

In [33]:
df_kdhp = nav_int_notes_2(df_kdhp);

In [34]:
#df_kdhp.to_csv(f'./sutta_csv/cleaned/individuals_dfs/df_kdhp_clean.csv', index = False)

## KN 03 - Udana (Ud)

In [34]:
pwd

'/Users/ae-j/Documents/GA/PL/Capstone'

In [35]:
df_kudf = pd.read_csv('./sutta_csv/scraped/kn/03_kn_ud_full.csv')
df_kudt = pd.read_csv('./sutta_csv/scraped/kn/03_kn_ud_text.csv')

low_rep(df_kudf);
low_rep(df_kudt);
df_kud = pd.merge(left = df_kudf, right = df_kudt, right_on='url', left_on = 'title_url')

In [36]:
## Dropping
df_kud = df_kud.drop(columns = 'url')

## Renaming
df_kud = df_kud.rename(columns = {
                            "field": "ref",
                            'text': 'sutta_text'
                               })
## Adding Summary
df_kud['summary'] = 'No Summary'

In [37]:
df_kud = nav_int_notes(df_kud);

In [38]:
#df_kud.to_csv('./sutta_csv/cleaned/individuals_dfs/df_kud_clean.csv', index = False)

## KN 04 - Itivuttaka (Iti)

In [39]:
df_kitif = pd.read_csv('./sutta_csv/scraped/kn/04_kn_iti_full.csv')
df_kitit = pd.read_csv('./sutta_csv/scraped/kn/04_kn_iti_text.csv')

low_rep(df_kitif);
low_rep(df_kitit);
df_kiti = pd.merge(left = df_kitif, right = df_kitit, right_on='url', left_on = 'field2_links')

In [40]:
## Dropping
df_kiti = df_kiti.drop(columns = 'field2_links')

## Renaming
df_kiti = df_kiti.rename(columns = {
                            "url": "title_url",
                            "field1": "title",
                            'text': 'sutta_text'
                               })

#Creating summary
df_kiti['summary'] = 'No Summary'

In [41]:
#creating reference
df_kiti['ref2'] = df_kiti['title']
df_kiti['ref2'] = df_kiti['ref2'].str.split('—')
df_kiti['ref'] = ''

for i in range(df_kiti.shape[0]):
    df_kiti['ref'][i] = df_kiti['ref2'][i][0]
df_kiti = df_kiti.drop(columns = 'ref2')

In [42]:
df_kiti = nav_int_notes(df_kiti);

In [43]:
#df_kiti.to_csv('./sutta_csv/cleaned/individuals_dfs/df_kiti_clean.csv', index = False)

## KN 05 - Sutta Nipata (Stnp)

In [44]:
df_kstnpf = pd.read_csv('./sutta_csv/scraped/kn/05_kn_stnp_full.csv')
df_kstnpt = pd.read_csv('./sutta_csv/scraped/kn/05_kn_stnp_text.csv')

low_rep(df_kstnpf);
low_rep(df_kstnpt);
df_kstnp = pd.merge(left = df_kstnpf, right = df_kstnpt, right_on='url', left_on = 'title_url')

In [45]:
## Dropping
df_kstnp = df_kstnp.drop(columns = 'url')

## Renaming
df_kstnp = df_kstnp.rename(columns = {
                            "unnamed:_3": "summary",
                            "field": "ref",
                            'text': 'sutta_text'
                               })


In [46]:
#Different formatting in stnp so change to function to split introduction from main sutta text
def split_intro_stnp(df):
    try:
        df['intro'] = df['text_full'].str.split('.\n \n ', 1, expand=True)[0]
        df['text_full'] = df['text_full'].str.split('.\n \n ', 1, expand=True)[1]
    except:
        df['intro'] = 'No Introduction'
    df = df.dropna().copy()
    return(df)

In [47]:
#Incorporating changed introduction split function
def nav_int_notes_stnp(df):
    df = reorder(split_notes(split_see(split_intro_stnp(nav_bar(df)))))
    df = df.dropna().copy()
    return df

In [49]:
df_kstnp = nav_int_notes_stnp(df_kstnp);

In [50]:
#df_kstnp.to_csv('./sutta_csv/cleaned/individuals_dfs/df_kstnp_clean.csv', index = False)

## KN 06 - Theragatha (Thag)

In [51]:
df_kthagf = pd.read_csv('./sutta_csv/scraped/kn/06_kn_thag_full.csv')
df_kthagt = pd.read_csv('./sutta_csv/scraped/kn/06_kn_thag_text.csv')

low_rep(df_kthagf);
low_rep(df_kthagt);
df_kthag = pd.merge(left = df_kthagf, right = df_kthagt, right_on='url', left_on = 'field3_links')

In [52]:
## Dropping
df_kthag = df_kthag.drop(columns = 'field3_links')

## Renaming
df_kthag = df_kthag.rename(columns = {
                            "field1": "title",
                            "field2": "summary",
                            'text': 'sutta_text',
                            'url': 'title_url'
                               })


In [53]:
# Creating ref
df_kthag['ref2'] = df_kthag['title'].copy()
df_kthag['ref2'] = df_kthag['ref2'].str.split().copy()
df_kthag['ref'] = ''

for i in range(df_kthag.shape[0]):
    df_kthag['ref'][i] = df_kthag['ref2'][i][0] + ' ' + df_kthag['ref2'][i][1]
df_kthag = df_kthag.drop(columns = 'ref2')

In [56]:
df_kthag = nav_int_notes(df_kthag)

In [57]:
#df_kthag.to_csv('./sutta_csv/cleaned/individuals_dfs/df_kthag_clean.csv', index = False)

## KN 07 - Therigatha (Thig)

In [58]:
df_kthigf = pd.read_csv('./sutta_csv/scraped/kn/07_kn_thig_full.csv')
df_kthigt = pd.read_csv('./sutta_csv/scraped/kn/07_kn_thig_text.csv')

low_rep(df_kthigf);
low_rep(df_kthigt);
df_kthig = pd.merge(left = df_kthigf, right = df_kthigt, right_on='url', left_on = 'field3_links')

In [60]:
## Dropping
df_kthig = df_kthig.drop(columns = 'field3_links')

## Renaming
df_kthig = df_kthig.rename(columns = {
                            "field1": "title",
                            "field2": "summary",
                            'text': 'sutta_text',
                            'url': 'title_url'
                               })


In [61]:
# Creating ref
df_kthig['ref2'] = df_kthig['title'].copy()
df_kthig['ref2'] = df_kthig['ref2'].str.split().copy()
df_kthig['ref'] = ''

for i in range(df_kthig.shape[0]):
    df_kthig['ref'][i] = df_kthig['ref2'][i][0] + ' ' + df_kthig['ref2'][i][1]
df_kthig = df_kthig.drop(columns = 'ref2')

In [63]:
df_kthig = nav_int_notes(df_kthig);

In [64]:
#df_kthig.to_csv('./sutta_csv/cleaned/individuals_dfs/df_kthig_clean.csv', index = False)

## Merge of Khuddaka Nikaya

In [35]:
df_kkhp_c = pd.read_csv('./sutta_csv/cleaned/individual_dfs/df_kkhp_clean.csv')
df_kdhp_c = pd.read_csv('./sutta_csv/cleaned/individual_dfs/df_kdhp_clean.csv')
df_kud_c = pd.read_csv('./sutta_csv/cleaned/individual_dfs/df_kud_clean.csv')
df_kiti_c = pd.read_csv('./sutta_csv/cleaned/individual_dfs/df_kiti_clean.csv')
df_kstnp_c = pd.read_csv('./sutta_csv/cleaned/individual_dfs/df_kstnp_clean.csv')
df_kthag_c = pd.read_csv('./sutta_csv/cleaned/individual_dfs/df_kthag_clean.csv')
df_kthig_c = pd.read_csv('./sutta_csv/cleaned/individual_dfs/df_kthig_clean.csv')

In [36]:
df_kn = pd.concat([df_kkhp_c, df_kdhp_c, df_kud_c, df_kiti_c, df_kstnp_c, df_kthag_c, df_kthig_c], axis=0)

In [37]:
df_kn.to_csv('./sutta_csv/cleaned/df_kn_clean.csv', index = False)

## Merge all Sutta Nipata

In [38]:
df_mn_c = pd.read_csv('./sutta_csv/cleaned/individuals_dfs/df_mn_clean.csv')
df_sn_c = pd.read_csv('./sutta_csv/cleaned/individuals_dfs/df_sn_clean.csv')
df_dn_c = pd.read_csv('./sutta_csv/cleaned/individuals_dfs/df_dn_clean.csv')
df_an_c = pd.read_csv('./sutta_csv/cleaned/individuals_dfs/df_an_clean.csv')
df_kn_c = pd.read_csv('./sutta_csv/cleaned/individuals_dfs/df_kn_clean.csv')

In [39]:
df_mn_c['nikaya'] = 'Majjhima Nikaya'
df_sn_c['nikaya'] = 'Samyutta Nikaya'
df_dn_c['nikaya'] = 'Digha Nikaya'
df_an_c['nikaya'] = 'Anguttara Nikaya'
df_kn_c['nikaya'] = 'Khuddaka Nikaya'

In [40]:
df_all = pd.concat([df_mn_c, df_sn_c, df_dn_c, df_an_c, df_kn_c], axis=0)

In [41]:
#df_all.to_csv('./sutta_csv/cleaned/df_all_clean.csv', index = False)