## Imports

In [145]:
import pandas as pd
import numpy as np
import nltk
import os
import PyPDF2
import textract
from bs4 import BeautifulSoup
import html2text

## This Notebook: 
In this notebook, I clean up my documents to get ready for extracting the text. 

Specifically, I remove the cover pages, tables of contents, appedices, publishing information, sanskrit-to-english dictionaries/translations including in many of the texts, etc. In this way, they only contain the actual text analysis information I care about. 

To do this: 
* For PDFs I used pyPDF2 to select only certain desired pages of the documents. 
* For HTML files I manually copy pasted the body of the text into new .TXT files of the same name. It was ultimately just the easiest way to avoid all the links. 
* For the .TXT files, I deleted all the irrelevant text out of the documents. 

## Reading In Files

In [41]:
path = "/Users/anterra/Metis/project_4/mind_body_texts"
documents = os.listdir(path)
documents.sort()
documents

['108upanishads.pdf',
 '2019_Book_InformationConsciousnessRealit.pdf',
 'abhidhamma_Pitaka_karam_tej_sarao.pdf',
 'brahma_Sutra_commentary_swami_sivananda.pdf',
 'brahma_sutras.txt',
 'comprehensive_manual_of_abhidhamma_2.pdf',
 'four_vedas.pdf',
 'indian_philosophy_consciousness.pdf',
 'mind and consciousness in yoga.htm',
 'process_of_consciousness_and_matter.pdf',
 'quantum Approaches to Consciousness.htm',
 'quantum mind.pdf',
 'selfhood_and_Identity_in_Confucianism_Ta.pdf',
 'tao_te_ching.pdf',
 'the Abhidhamma in Practice.htm',
 'the Psychology and Philosophy of Buddhism.pdf',
 'the finer scale of consciousness_ quantum theory.htm',
 'the universe, quantum physics and consciousness.pdf',
 'yoga sutras patanjali.txt']

In [42]:
pdfs = [doc for doc in documents if doc.endswith(".pdf")]
htms = [doc for doc in documents if doc.endswith(".htm")]
txts = [doc for doc in documents if doc.endswith(".txt")]


In [43]:
htms

['mind and consciousness in yoga.htm',
 'quantum Approaches to Consciousness.htm',
 'the Abhidhamma in Practice.htm',
 'the finer scale of consciousness_ quantum theory.htm']

In [44]:
txts

['brahma_sutras.txt', 'yoga sutras patanjali.txt']

## PDFS
For the PDFs, I'm first stripping off title pages, acknowledgements, table of contents pages, and indexes, so the documents contain only the desired body text. 

In [45]:
pdfs

['108upanishads.pdf',
 '2019_Book_InformationConsciousnessRealit.pdf',
 'abhidhamma_Pitaka_karam_tej_sarao.pdf',
 'brahma_Sutra_commentary_swami_sivananda.pdf',
 'comprehensive_manual_of_abhidhamma_2.pdf',
 'four_vedas.pdf',
 'indian_philosophy_consciousness.pdf',
 'process_of_consciousness_and_matter.pdf',
 'quantum mind.pdf',
 'selfhood_and_Identity_in_Confucianism_Ta.pdf',
 'tao_te_ching.pdf',
 'the Psychology and Philosophy of Buddhism.pdf',
 'the universe, quantum physics and consciousness.pdf']

In [79]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[0])
writer = PyPDF2.PdfFileWriter()
for page_num in range(2, 842):
    writer.addPage(reader.getPage(page_num))
pdf_upanishads = "pdf_upanishads.pdf"
with open(pdf_upanishads, "wb") as f: 
    writer.write(f)

In [83]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[1])
writer = PyPDF2.PdfFileWriter()
for page_num in range(21, 629):
    writer.addPage(reader.getPage(page_num))
pdf_information_consciousness = "pdf_information_consciousness.pdf"
with open(pdf_information_consciousness, "wb") as f:
    writer.write(f)

In [151]:
# not working??? 
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[2])
writer = PyPDF2.PdfFileWriter()
for page_num in range(1, 206):
    writer.addPage(reader.getPage(page_num))
pdf_abhidhamma_ts = "pdf_abhidhamma_ts.pdf"
with open(pdf_abhidhamma_ts, "wb") as f:
    writer.write(f)

In [85]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[3])
writer = PyPDF2.PdfFileWriter()
for page_num in range(43, 569):
    writer.addPage(reader.getPage(page_num))
pdf_brahma_sutras_sivananda = "pdf_brahma_sutras_sivananda.pdf"
with open(pdf_brahma_sutras_sivananda, "wb") as f:
    writer.write(f)

In [87]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[4])
reader.decrypt("")
writer = PyPDF2.PdfFileWriter()
for page_num in range(27, 392):
    writer.addPage(reader.getPage(page_num))
pdf_comprehensive_abhidhamma = "pdf_abhidhamma.pdf"
with open(pdf_comprehensive_abhidhamma, "wb") as f:
    writer.write(f)

In [88]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[5])
writer = PyPDF2.PdfFileWriter()
for page_num in range(48, 1445):
    writer.addPage(reader.getPage(page_num))
pdf_four_vedas = "pdf_four_vedas.pdf"
with open(pdf_four_vedas, "wb") as f:
    writer.write(f)

In [90]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[6])
writer = PyPDF2.PdfFileWriter()
for page_num in range(1, 10):
    writer.addPage(reader.getPage(page_num))
pdf_indian_philosophy = "pdf_indian_philosophy.pdf"
with open(pdf_indian_philosophy, "wb") as f:
    writer.write(f)

In [91]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[7])
writer = PyPDF2.PdfFileWriter()
for page_num in range(22, 164):
    writer.addPage(reader.getPage(page_num))
pdf_process_consciousness = "pdf_process_consciousness.pdf"
with open(pdf_process_consciousness, "wb") as f:
    writer.write(f)

In [92]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[8])
writer = PyPDF2.PdfFileWriter()
for page_num in range(2, 157):
    writer.addPage(reader.getPage(page_num))
pdf_quantum_mind = "pdf_quantum_mind.pdf"
with open(pdf_quantum_mind, "wb") as f:
    writer.write(f)

In [93]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[9])
writer = PyPDF2.PdfFileWriter()
for page_num in range(0, 21):
    writer.addPage(reader.getPage(page_num))
pdf_selfhood = "pdf_selfhood.pdf"
with open(pdf_selfhood, "wb") as f:
    writer.write(f)

In [94]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[10])
writer = PyPDF2.PdfFileWriter()
for page_num in range(0, 28):
    writer.addPage(reader.getPage(page_num))
pdf_tao_te_ching = "pdf_tao_te_ching.pdf"
with open(pdf_tao_te_ching, "wb") as f:
    writer.write(f)

In [107]:
# not working??? 
encrypted_file_path = path + "/" + pdfs[11]
output_path = path + "/" + "pdf_psychology_buddhism.pdf"
command=f"qpdf --password='' --decrypt {encrypted_file_path} {output_path};"
os.system(command)

reader = PyPDF2.PdfFileReader(output_path)
writer = PyPDF2.PdfFileWriter()
for page_num in range(24, 355):
    writer.addPage(reader.getPage(page_num))
pdf_psychology_buddhism = "pdf_psychology_buddhism.pdf"
with open(pdf_psychology_buddhism, "wb") as f:
    writer.write(f)

512

In [106]:
reader = PyPDF2.PdfFileReader(path + "/" + pdfs[12])
writer = PyPDF2.PdfFileWriter()
for page_num in range(0, 8):
    writer.addPage(reader.getPage(page_num))
pdf_universe_quantum = "pdf_universe_quantum.pdf"
with open(pdf_universe_quantum, "wb") as f:
    writer.write(f)

In [110]:
pdf_path = "/Users/anterra/Metis/project_4/clean_pdfs"
clean_pdfs = os.listdir(pdf_path)

In [111]:
clean_pdfs

['pdf_quantum_mind.pdf',
 'pdf_information_consciousness.pdf',
 'pdf_upanishads.pdf',
 'pdf_universe_quantum.pdf',
 'pdf_brahma_sutras_sivananda.pdf',
 'pdf_abhidhamma.pdf',
 'pdf_selfhood.pdf',
 'pdf_tao_te_ching.pdf',
 'pdf_indian_philosophy.pdf',
 'pdf_abhidamma_ts.pdf',
 'pdf_process_consciousness.pdf',
 'pdf_four_vedas.pdf']

In [113]:
text = textract.process(pdf_path+"/"+clean_pdfs[0])

In [118]:
text[0:1000]

b'Introduction in quantum aspects of brain function \n\nSince  the  development  of  QM  and  relativistic  theories  in  the  first  part  of  the  20th  century, \nattempts have been made to understand and describe the mind or mental states on the basis of \nQM concepts (see Meijer, 2014, Meijer and Korf, 2013,). Quantum physics, currently seen as a \nfurther  refinement  in  the  description  of  nature,  does  not  only  describe  elementary \nmicrophysics  but  applies  to  classical  or  macro-physical  (Newtonian)  phenomena  as  well. \nHence the human brain and its mental aspects are associated to classical brain physiology and \nare  also  part  of  a  quantum  physical  universe.  Most  neurobiologists  considered  QM  mind \ntheories  irrelevant  to  understand  brain/mind  processes  (e.g.  Edelman  and  Tononi,  2000; \nKoch and Hepp, 2006).  \n \nHowever, there is no  single  theory on QM brain/mind theory.  In fact a spectrum of more or \nless  independent  models  have

## HTMs
After the below attempts, I've decided to just manually copy-paste the text from each of the 4 URLs into .txt files. They are all rather short articles and there are only 4 of them, and best to just save time. 

In [112]:
htms

['mind and consciousness in yoga.htm',
 'quantum Approaches to Consciousness.htm',
 'the Abhidhamma in Practice.htm',
 'the finer scale of consciousness_ quantum theory.htm']

In [134]:
text2 = textract.process(path+"/"+htms[0])

In [135]:
text2[0:2000]



In [136]:
to_text = html2text.HTML2Text()
text3 = to_text.handle(path+"/"+htms[0])


## TXTs 
For the .txt files, I'm just opening them in VSCode and deleting the table of contents and appendix text! 

In [138]:
txts

['brahma_sutras.txt', 'yoga sutras patanjali.txt']

In [142]:
text4 = textract.process(path+"/"+txts[0])
text4[0:1000]

b'The Project Gutenberg EBook of The Vedanta-Sutras with the Commentary by\nRamanuja, by Trans. George Thibaut\n\nCopyright laws are changing all over the world. Be sure to check the\ncopyright laws for your country before downloading or redistributing\nthis or any other Project Gutenberg eBook.\n\nThis header should be the first thing seen when viewing this Project\nGutenberg file.  Please do not remove it.  Do not change or edit the\nheader without written permission.\n\nPlease read the "legal small print," and other information about the\neBook and Project Gutenberg at the bottom of this file.  Included is\nimportant information about your specific rights and restrictions in\nhow the file may be used.  You can also find out about how to make a\ndonation to Project Gutenberg, and how to get involved.\n\n\n**Welcome To The World of Free Plain Vanilla Electronic Texts**\n\n**eBooks Readable By Both Humans and By Computers, Since 1971**\n\n*****These eBooks Were Prepared By Thousands of

In [137]:
text3

'/Users/anterra/Metis/project_4/mind_body_texts/mind and consciousness in\nyoga.htm\n\n'