# "Four Noble Truths" Processing

The raw text file is already in the right form...in terms of stuff related to the source/website the text is from being taken out. There's still the whole introduction, headers like "§ 1.The Buddha: 'What do you think, Rahula: What is a mirror for?'",  and verse enumerations like "-MN 61" which we can remove to just get the text.

In [1]:
# The variable 'text' will have the full raw text

with open("../Raw_Texts/fournobletruths.txt", encoding='utf8') as f:
    text = f.read()
    
#text

In [2]:
# First, we split the text into all its verses

verses = text.split("§ ")
#verses[0]  This is the introduction

# To remove the introduction, we just get rid of the first element
verses = verses[1:]

# Here's an example of the current output
min(verses, key=len)

In [3]:
# Next, we remove the beginnings and ends of each verse, for which we will use regular expressions
import re

begin = re.compile("([0-9]+\.[ \n]+)")
end = re.compile("([\n]+— [A-Za-z]+ [0-9.]+[ \n]*)")

for i, verse in enumerate(verses):
    begin_index = re.search(begin, verse).end()
    end_index = re.search(end, verse).start()
    verses[i] = verse[begin_index : end_index]

# Again, here's an example of the current output
min(verses, key=len)

In [4]:
# Now, we define a method to stopwords using the NLTK library
import nltk

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
        # note that you need to download the 'stopwords' and 'punkt' libraries from NLTK for this to work
        # just use the line <nltk.download('stopwords')> and <nltk.download('punkt')>

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [5]:
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')

# use this method to remove stop words: from <one entire string> -> to <list of stems>
def cleanData(text):
    final = []
    for item in text.split():
        x = item.lower()
        x2 = re.sub("[^a-zA-Z]+", "", x)
        tokens = word_tokenize(x2)
        wordlst = [stemmer.stem(word) for word in tokens if word not in stop_words and not word.isdigit()]
        final.extend(wordlst)
    return final

In [8]:
# Lastly, we write to a new text file
final_text = "\n".join(verses)

with open("fournobletruths.txt", "w", encoding='utf8') as f:
    f.write(final_text)
    
final_words = cleanData(final_text)

with open("../Processed_Texts/fournobletruths_words.txt", "w", encoding='utf8') as f:
    f.write("\n".join(final_words))
    
print(final_words)

['buddha', 'think', 'rahula', 'mirror', 'rahula', 'reflect', 'sir', 'buddha', 'way', 'rahula', 'bodili', 'act', 'verbal', 'act', 'mental', 'act', 'done', 'repeat', 'reflect', 'whenev', 'want', 'perform', 'bodili', 'act', 'reflect', 'bodili', 'act', 'want', 'perform', 'would', 'lead', 'selfafflict', 'afflict', 'other', 'unskil', 'bodili', 'act', 'pain', 'consequ', 'pain', 'result', 'reflect', 'know', 'would', 'lead', 'selfafflict', 'afflict', 'other', 'would', 'unskil', 'bodili', 'act', 'pain', 'consequ', 'pain', 'result', 'bodili', 'act', 'sort', 'absolut', 'unfit', 'reflect', 'know', 'would', 'caus', 'afflict', 'would', 'skill', 'bodili', 'act', 'happi', 'consequ', 'happi', 'result', 'bodili', 'act', 'sort', 'fit', 'similar', 'verbal', 'act', 'mental', 'act', 'perform', 'bodili', 'act', 'reflect', 'bodili', 'act', 'lead', 'selfafflict', 'afflict', 'other', 'unskil', 'bodili', 'act', 'pain', 'consequ', 'pain', 'result', 'reflect', 'know', 'lead', 'selfafflict', 'afflict', 'other', 'giv