# "Lotus Sutra" Processing

The raw text file is in the form of a book. There's a table of contents, forwards, and an introduction that we have to remove at the beginning, and similarly a glossary and credits at the end. Also, each page from the original book has page numbers and section headers that we must remove.

In [1]:
# The variable 'text' will have the full raw text

with open("../Raw_Texts/lotussutra.txt", encoding='utf8') as f:
    text = f.read()

#text

In [2]:
# Importing some packages that we'll use
import re

# We'll also define a method so that we can see the beginning, middle, and end of the text at any given point
def text_peek(text, n=2000, full=True, begin=False, middle=False, end=False):
    if full:
        begin, middle, end = True, True, True
    
    if begin:
        print(text[:n])
        if middle or end:
            print("\n\n------------------------------\n\n")
    
    if middle:
        l = int((len(text)-n)/2)
        print(text[l:l+n])
        if end:
            print("\n\n------------------------------\n\n")
    
    if end:
        print(text[-n:])

text_peek(text)

THE LOTUS SUTRA

This digital version of the original publication is distributed according to the
Creative Commons “Attribution-Noncommercial-Share Alike 3.0” license agreement and the provisions stated on the website at http://www.numatacenter.com/.
This PDF file may be printed and distributed according to the terms of use established on the website. The file itself is distributed with certain security provisions
in place that disallow modification. However, if any Buddhist group or scholar of
Buddhism has legitimate reason to modify and/or adapt the contents of any such
file (such as for inclusion of the contents in a publically available online database
of Buddhist sources), please contact us for permission and unrestricted files.

dBET PDF Version
© 2009

BDK English Tripiṭaka Series

THE LOTUS SUTRA
(Taishō Volume 9, Number 262)

Translated from the Chinese of Kumārajiva
by
Tsugunari Kubo
and
Akira Yuyama

Numata Center
for Buddhist Translation and Research
2007

Copyright © 20

## Removing the Beginning

First, we will remove the beginning of the raw text file. Looking at the whole file, we see that the parts we want start after an occurrence of "THE LOTUS SUTRA"

In [3]:
begin = re.compile("(THE LOTUS SUTRA[ \n]*)")
begin_matches = [m for m in re.finditer(begin, text)]

for m in begin_matches:
    e = m.end()
    print(text[e:e+500])
    print("\n\n------------------------------\n\n")

This digital version of the original publication is distributed according to the
Creative Commons “Attribution-Noncommercial-Share Alike 3.0” license agreement and the provisions stated on the website at http://www.numatacenter.com/.
This PDF file may be printed and distributed according to the terms of use established on the website. The file itself is distributed with certain security provisions
in place that disallow modification. However, if any Buddhist group or scholar of
Buddhism has legi


------------------------------


(Taishō Volume 9, Number 262)

Translated from the Chinese of Kumārajiva
by
Tsugunari Kubo
and
Akira Yuyama

Numata Center
for Buddhist Translation and Research
2007

Copyright © 2007 by Bukkyō Dendō Kyōkai and
Numata Center for Buddhist Translation and Research
All rights reserved. No part of this book may be reproduced, stored
in a retrieval system, or transcribed in any form or by any means
—electronic, mechanical, photocopying, recording, or otherwise—
wi

From these snippets, it's clear that the text we want is all after the last occurrence.

In [4]:
text = text[begin_matches[-1].end():]

In [5]:
text_peek(text, 200, False, begin=True)

Chapter I

Introduction
Thus have I heard. Once the Buddha was staying in the city of Rājagṛha, on
the mountain called Gṛdhrakūṭa, together with a great assembly of twelve
thousand monks, all of who


## Removing the End

Now, we'll remove the end of the raw text file that we don't want. The table of contents tells us that the glossary immediately follows the main text, so we'll search for that keyword.

In [6]:
ending = re.compile("([ \n]*[Gg]lossary[ \n]*)")
ending_matches = [m for m in re.finditer(ending, text)]

for m in ending_matches[:2]:
    s = m.start()
    print(text[s-800:s+200])
    print("\n\n------------------------------\n\n")

ason, O Samantabhadra, if you see anyone who holds to

316

Chapter XXVIII

this sutra, you should stand up and show your respect even from afar, just as
you would pay homage to the Buddha.”
When this chapter, “Encouragement of Bodhisattva Samantabhadra,”
was being taught, innumerable and limitless bodhisattvas equal in number
to the sands of Ganges Rivers attained hundreds of thousands of myriads of
koṭis of dhāraṇīs named āvartā, and bodhisattvas equal to the number of
atoms in the manifold cosmos mastered the path of Samantabhadra.
When the Buddha had taught this sutra, the entire great assembly of
bodhisattvas including Samantabhadra, the śrāvakas including Śāriputra,
devas, nāgas, humans, and nonhumans rejoiced greatly, accepted the Buddha’s
words, bowed to him and departed.

317

Glossary

anuttarā samyaksaṃbodhi: Complete, perfect enlightenment.
apasmāraka: A class of demonic beings.
arhat (“one who is worthy”): A saint who has completely eradicated the passions and
attained 

We see that the very first match for "Glossary" yields the point where our text ends. The rest of the ocurrences are merely page headers for the actual glossary. Thus, we will remove this part.

In [7]:
text = text[:ending_matches[0].start()]

In [8]:
text_peek(text, full=False, end=True)

ma, and will sit on the
lion seat of the Dharma in the great assembly of devas and humans.
“O Samantabhadra! Those who preserve and recite this sutra in the future
world will not be greedy for clothes, bedding, food and drink, and the necessities of life. Their aspirations will not be unrewarded, and their happy reward
will be attained in this world. If there is anyone who despises them, saying:
‘You are mad. This practice of yours is in vain and will attain nothing at the
end,’ they will have no eyes lifetime after lifetime as a retribution for this
wrongdoing. If there is anyone who pays them homage and praises them, he
will attain tangible rewards in this world. If anyone sees those who preserve
this sutra and speaks maliciously about their faults, whether true or not, such
a person will suffer from leprosy in this life. If anyone scorns them, that person’s teeth will be either loose or missing; their lips will be ugly, their nose
will be ﬂat, their limbs will be crooked; they will 

## Removing Page Numbers and Headers

Our penultimate step is to remove the page numbers and page headers. We can again use regular expressions, although we need to analyze the structure of the pages a little more.

In [9]:
text_peek(text)

Chapter I

Introduction
Thus have I heard. Once the Buddha was staying in the city of Rājagṛha, on
the mountain called Gṛdhrakūṭa, together with a great assembly of twelve
thousand monks, all of whom were arhats whose corruption was at an end,
who were free from the confusion of desire, who had achieved their own
goals, shattered the bonds of existence, and attained complete mental discipline. Their names were Ājñātakauṇḍinya, Mahākāśyapa, Uruvilvakāśyapa,
Gayākāśyapa, Nadīkāśyapa, Śāriputra, Mahāmaudgalyāyana, Mahākātyāyana, Aniruddha, Kapphiṇa, Gavāṃpati, Revata, Pilindavatsa, Bakkula,
Mahākauṣṭhila, Nanda, Sundarananda, Pūrṇamaitrāyaṇīputra, Subhūti,
Ānanda, and Rāhula. All of them were great arhats, known to the assembly.
There were in addition two thousand others, both those who had more to
learn and those who did not. The nun Mahāprajāpatī was there, together with
her six thousand attendants; and also the nun Yaśodharā, Rāhula’s mother,
together with her attendants.
There were 

We see that every odd numbered page also has a header that tells us what chapter we're in, and every even numbered page has the title of the text ("The Lotus Sutra"). There's also some strings like "32c" and "62a", which are presumably footnotes or references for a reader to consult the back of the book. 

We have to remove all of these. First, we modify the text slightly at the beginning and end to have a uniform pattern. Then, we try analysis using a regular expression. Hypothetically, we can check by seeing if there are 316 matches (for pages 3 through 317, plus the header at the very beginning).

In [10]:
# Slight modifcation to text
text = "2\n\n" + text + "\nChapter XXVIII"
text_peek(text, 200, False, begin=True, end=True)

2

Chapter I

Introduction
Thus have I heard. Once the Buddha was staying in the city of Rājagṛha, on
the mountain called Gṛdhrakūṭa, together with a great assembly of twelve
thousand monks, all of 


------------------------------


sattvas including Samantabhadra, the śrāvakas including Śāriputra,
devas, nāgas, humans, and nonhumans rejoiced greatly, accepted the Buddha’s
words, bowed to him and departed.

317


Chapter XXVIII


In [11]:
#fluff = re.compile("[A-Za-z\. \n]*([0-9]+\n[0-9a-z \n]+Chapter [IVXL]+[ \n]*)[A-Z]*")
fluff = re.compile("[\s]*[0-9]+[\s0-9a-z]*((Chapter [IVXL]+[\s]*)|(The Lotus Sutra)[\s]*)([0-9][a-z][\s]*)*")
fluff_matches = [m for m in re.finditer(fluff, text)]

print("Number of matches found:", len(fluff_matches))        
fluff_matches[:5]

Number of matches found: 305


[<re.Match object; span=(0, 16), match='2\n\n\x0c\x0cChapter I\n\n'>,
 <re.Match object; span=(2027, 2062), match='\n\n3\n\n1c\n\n2a\n\n\x0cThe Lotus Sutra\n\n2b\n\>,
 <re.Match object; span=(4530, 4547), match='\n\n4\n\n\x0cChapter I\n\n'>,
 <re.Match object; span=(6843, 6870), match='\n\n5\n\n2c\n\n\x0cThe Lotus Sutra\n\n'>,
 <re.Match object; span=(8009, 8026), match='\n\n6\n\n\x0cChapter I\n\n'>]

From a preview of the first couple matches, we see that we have something similar to what we want. However, we see that only 305 matches were found. Let's take a look at what pages weren't found.

In [12]:
# Finding pages that weren't matched to

for i in range(3,318):
    s = "\n" + str(i) + "\n"
    if s not in fluff_matches[0].string:
        #print("-----\n|{}|\n-----".format(i))
        print(i)

22
112
150
166
178
186
208
222
232
276
288


11 pages weren't found and, manually checking, we see that these 11 pages do not, in fact, exist in the text (to check, we looked for the previous page and following page, i.e. found pages 221 and 223 discovering that page 222 does not exist). This makes sense, as 305 + 11 = 316 as we first expected.

With this, we can now assume that our matches are correct and we can start removing all the headers and footers.

In [13]:
# This method, given a string and a list of matches within the passed string, will return a string with those matches removed

def remove_matches(text, matches):
    text_arr = []
    text_arr.append(text[:matches[0].start()])    # manually adding anything before the first match
    for i, m in enumerate(matches[:-1]):
        s = m.end()
        e = matches[i+1].start()
        text_arr.append(text[s:e])
    text_arr.append(text[matches[-1].end():])    # manually adding anything after the last match
    return "\n".join(text_arr)

In [14]:
text = remove_matches(text, fluff_matches)

In [15]:
for i in [0, 60000]:
    print(text[i:i+1000])
    print("\n\n------------------------------\n\n")


Introduction
Thus have I heard. Once the Buddha was staying in the city of Rājagṛha, on
the mountain called Gṛdhrakūṭa, together with a great assembly of twelve
thousand monks, all of whom were arhats whose corruption was at an end,
who were free from the confusion of desire, who had achieved their own
goals, shattered the bonds of existence, and attained complete mental discipline. Their names were Ājñātakauṇḍinya, Mahākāśyapa, Uruvilvakāśyapa,
Gayākāśyapa, Nadīkāśyapa, Śāriputra, Mahāmaudgalyāyana, Mahākātyāyana, Aniruddha, Kapphiṇa, Gavāṃpati, Revata, Pilindavatsa, Bakkula,
Mahākauṣṭhila, Nanda, Sundarananda, Pūrṇamaitrāyaṇīputra, Subhūti,
Ānanda, and Rāhula. All of them were great arhats, known to the assembly.
There were in addition two thousand others, both those who had more to
learn and those who did not. The nun Mahāprajāpatī was there, together with
her six thousand attendants; and also the nun Yaśodharā, Rāhula’s mother,
together with her attendants.
There were also eighty 

By this point, we are close to the final output that we want. However, we see that there are still some pesky footers that we could not capture with our first regular expression. For these, we will simply apply another removal.

In [16]:
pesky = re.compile("([\s]*[0-9]+[a-z][\s]*)")
pesky_matches = [m for m in re.finditer(pesky,text)]
final_text = remove_matches(text, pesky_matches)

In [17]:
text_peek(final_text)


Introduction
Thus have I heard. Once the Buddha was staying in the city of Rājagṛha, on
the mountain called Gṛdhrakūṭa, together with a great assembly of twelve
thousand monks, all of whom were arhats whose corruption was at an end,
who were free from the confusion of desire, who had achieved their own
goals, shattered the bonds of existence, and attained complete mental discipline. Their names were Ājñātakauṇḍinya, Mahākāśyapa, Uruvilvakāśyapa,
Gayākāśyapa, Nadīkāśyapa, Śāriputra, Mahāmaudgalyāyana, Mahākātyāyana, Aniruddha, Kapphiṇa, Gavāṃpati, Revata, Pilindavatsa, Bakkula,
Mahākauṣṭhila, Nanda, Sundarananda, Pūrṇamaitrāyaṇīputra, Subhūti,
Ānanda, and Rāhula. All of them were great arhats, known to the assembly.
There were in addition two thousand others, both those who had more to
learn and those who did not. The nun Mahāprajāpatī was there, together with
her six thousand attendants; and also the nun Yaśodharā, Rāhula’s mother,
together with her attendants.
There were also eighty 

## Writing to a Text File

We're finally at our last step, where we will output everything to a text file!

In [18]:
with open("lotussutra.txt", "w", encoding='utf8') as f:
    f.write(final_text)