## Conversion of PDF files into .txt file format

The PdfConverter class allows us to convert pdf file into a string with spaced out words and then save that into a .txt file. The current iteration of the code looks for the PDF file to be converted in the same directory as where the Jupyter Notebooks files are saved (e.g my file path is --> "C:\Users\YourName\Your_Jupyter_Notebook.ipynb").

Running the PdfConverter allows us to convert and save PDF files one by one into text files for further use. Though we were unable to get to this, one immediate way to improve this code is to automate the conversion in such a way that allows us to convert and save files in mass.

In [1]:
import spacy
import pandas as pd
from spacy import displacy
import re
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

class PdfConverter:

   def __init__(self, file_path):
       self.file_path = file_path
# convert pdf file to a string which has space among words 
   def convert_pdf_to_txt(self):
       rsrcmgr = PDFResourceManager()
       retstr = StringIO()
       codec = 'utf-8'  # 'utf16','utf-8'
       laparams = LAParams()
       device = TextConverter(rsrcmgr, retstr, laparams=laparams)
       fp = open(self.file_path, 'rb')
       interpreter = PDFPageInterpreter(rsrcmgr, device)
       password = ""
       maxpages = 0
       caching = True
       pagenos = set()
       for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
           interpreter.process_page(page)
       fp.close()
       device.close()
       str = retstr.getvalue()
       retstr.close()
       return str
# convert pdf file text to string and save as a text_pdf.txt file
   def save_convert_pdf_to_txt(self):
       content = self.convert_pdf_to_txt()
       txt_pdf = open('text_pdf.txt', 'wb')
       txt_pdf.write(content.encode('utf-8'))
       txt_pdf.close()
if __name__ == '__main__':
    # file_path is for local directory where jupyter files are located
    pdfConverter = PdfConverter(file_path='public_guideline__principles_of_climate_adaptation_and_mitigation_for_engineers.pdf')
    print(pdfConverter.convert_pdf_to_txt())


nlp = spacy.load('en_core_web_sm',disable=['ner','textcat'])

Public guideline: Principles of

climate adaptation and
mitigation for engineers

National guideline - May 2018

Notice

Disclaimer

Engineers Canada’s national guidelines and white papers were developed by engineers in collaboration with the provincial and
territorial engineering regulators. They are intended to promote consistent practices across the country. They are not regulations
or rules; they seek to define or explain discrete topics related to the practice and regulation of engineering in Canada.

The national guidelines and white papers do not establish a legal standard of care or conduct, and they do not include
or constitute legal or professional advice.   

In Canada, engineering is regulated under provincial and territorial law by the engineering regulators. The recommendations
contained in the national guidelines and white papers may be adopted by the engineering regulators in whole, in part, or not at
all. The ultimate authority regarding the propriety of any specific 

# Post-Conversion: Working with .txt files

After single conversions of PDF files into .txt files, we can work through each .txt file, read it, and process the text into a dataframe.

In [2]:
file = open(r"C:\Users\Austin\txt_file\public_guideline__principles_of_climate_adaptation_and_mitigation_for_engineers.txt", mode = 'r', encoding = 'utf-8-sig')
line = file.read()
file.close()

In [3]:
text_list = [line]

df = pd.DataFrame(columns={'Pre-NLK Transcript'})
df2 = pd.DataFrame(columns={'Sentence', 'Length'})
df['Pre-NLK Transcript'] = list(text_list)

The sentence function reads through the string found in the specified .txt file and notes the end of a sentence by punctuation. The purpose of this function is to divide the text by sentence and then populate a dataframe with each row containing one sentence.

In [4]:
def sentences(text):
    text = re.split('[.?]', text)
    clean_sent = []
    for sent in text:
        clean_sent.append(sent)
    return clean_sent

df['sent'] = df['Pre-NLK Transcript'].apply(sentences)
df['sent']

0    [Public guideline: Principles of\n\nclimate ad...
Name: sent, dtype: object

The full string found in the .txt file is placed into a dummy dataframe where the sentence function is then applied. Then a second dataframe is then populated as originally planned with each row containing one sentence.

In [5]:
row_list = []
for i in range(len(df)):
    for sent in df.loc[i,'sent']:
        wordcount = len(sent.split())
        charcount = len(sent)
        dict1 = {'Sent':sent,'Len':wordcount,'Len_char':charcount}
        row_list.append(dict1)
    
df2 = pd.DataFrame(row_list)

In creating rules for detecting whether sentences are relevant to us or not, we decided to work using the POS tags and started to create basic, general rules to detect basic sentence sequences. 

First Rule (noun-verb-noun):
-----

The first rule below detects the presence of any noun-verb-noun sequence in a sentence.

In [6]:
# detects any noun-verb-noun sequence in a sentence and returns it if it is present

def rule1(text):
    doc = nlp(text)
    sent = []
    for token in doc:
        if (token.pos_=='VERB'):
            phrase =''
            for sub_tok in token.lefts:
                if (sub_tok.dep_ in ['nsubj','nsubjpass']) and (sub_tok.pos_ in ['NOUN','PROPN','PRON']):
                    phrase += sub_tok.text
                    phrase += ' '+token.lemma_ 
                    for sub_tok in token.rights:
                        if (sub_tok.dep_ in ['dobj']) and (sub_tok.pos_ in ['NOUN','PROPN']):
                            phrase += ' '+sub_tok.text
                            sent.append(phrase)
    return sent

row_list = []

for i in range(len(df2)):
    sent = df2.loc[i,'Sent']
    length = df2.loc[i,'Len']
    length_char = df2.loc[i, 'Len_char']
    output = rule1(sent)
    dict1 = {'Len_word':length,'Len_char':length_char,'Sent':sent,'Output':output}
    row_list.append(dict1)
    
df_rule1 = pd.DataFrame(row_list)

Second Rule (adjective-noun):
----

The second rule below detects the presence of any adjective-noun sequence in a sentence

In [7]:
# detects any adjective-noun sequence in a sentence and returns it if it is present

def rule2(text):
    doc = nlp(text)
    pat = []
    for token in doc:
        phrase = ''
        if (token.pos_ == 'NOUN')\
            and (token.dep_ in ['dobj','pobj','nsubj','nsubjpass']):
            for subtoken in token.children:
                if (subtoken.pos_ == 'ADJ') or (subtoken.dep_ == 'compound'):
                    phrase += subtoken.text + ' '      
            if len(phrase)!=0:
                phrase += token.text
        if  len(phrase)!=0:
            pat.append(phrase)
    return pat

row_list2 = []

for i in range(len(df2)):
    sent = df2.loc[i,'Sent']
    length = df2.loc[i,'Len']
    length_char = df2.loc[i, 'Len_char']
    output = rule2(sent)
    dict1 = {'Len_word':length,'Len_char':length_char,'Sent':sent,'Output':output}
    row_list2.append(dict1)
    
df_rule2 = pd.DataFrame(row_list2)

Third Rule (noun-preposition-noun):
------

The third rule below detects the presence of any noun-preposition-noun sequence in a sentence

In [8]:
# detects any noun-preposition-noun sequence in a sentence and returns it if it is present

def rule3(text):
    doc = nlp(text)
    sent = []
    for token in doc:
        if token.pos_=='ADP':
            phrase = ''
            if token.head.pos_=='NOUN':
                phrase += token.head.text
                phrase += ' '+token.text
                for right_tok in token.rights:
                    if (right_tok.pos_ in ['NOUN','PROPN']):
                        phrase += ' '+right_tok.text
                if len(phrase)>2:
                    sent.append(phrase)
    return sent

row_list3 = []

for i in range(len(df2)):
    sent = df2.loc[i,'Sent']
    length = df2.loc[i,'Len']
    length_char = df2.loc[i, 'Len_char']
    output = rule3(sent)
    dict1 = {'Len_word':length,'Len_char':length_char,'Sent':sent,'Output':output}
    row_list3.append(dict1)

df_rule3 = pd.DataFrame(row_list3)

All three of the rules above are detectors of very basic sentence structures that will undoubtedly have to be worked and built further upon. There may also be other general rules that can be considered as well. 

For example, the first rule that detects noun-verb-noun sequences can be refined upon such that the subject committing an action or the object being acted upon is more narrowly defined. Thus, we can narrow the scope of subjects or objects to specific keywords that are relevant to us. In a similar fashion, the verb detection can be refined to detect keyword verbs.

Disaster Keyword Filter:
-----

The dk_check function acts as a disaster keyword check that detects the presence of any keyword specified. If that keyword is present, then the sentence will be considered and if not present, that sentence will not be considered. 

The keywords listed can be added upon but are a representation of general disaster terms that will likely be present in many texts.

In [9]:
def dk_check(matrix):
    keywords = ['Climate change',
                'Climate breakdown',
                'Flooding',
                'Floodings',
                'Flood',
                'Floods',
                'Deluge rain event',
                'Deluge rain events',
                'Ocean solidification',
                'Ocean solidifications',
                'Natural disaster',
                'Natural disasters',
                'Winter storm',
                'Winter storms',
                'Droughts',
                'Drought',
                'Wildfire',
                'Wildfires',
                'Ice storm',
                'Ice storms',
                'Hail',
                'Hailstorm',
                'Hailstorms',
                'Heat wave',
                'Heat waves',
                'Extreme weather',
                'Hurricane',
                'Hurricanes']
    lower_keywords = [entry.lower() for entry in keywords]
    for i in range(len(matrix)):
        if (any(disaster_keyword in matrix['Sent'][i] for disaster_keyword in keywords) and matrix['Output'][i] != []) or (any(disaster_keyword in matrix['Sent'][i] for disaster_keyword in lower_keywords) and matrix['Output'][i] != []):
            print("(((", "Sentence Number", i,')))', nlp(matrix['Sent'][i]), 100*"-_")

dk_check(df_rule1)

((( Sentence Number 21 )))  Accelerated climate change presents new and evolving challenges, opportunities and risks
that will need to be considered by engineers in the fulfillment of their professional responsibilities -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
((( Sentence Number 45 )))  This understanding imposes a responsibility of due diligence on
the engineering profession to address the issue of climate change within engineering works -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
((( Sentence Number 49 ))) 
Scientific literature indicates significant departures from historical climate averages occurring globally and engineering designs
must account for an expanded range o

((( Sentence Number 383 ))) 

Engineers can also conduct sensitivity analyses to account for the potential consequences of different climate change scenarios -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
((( Sentence Number 414 ))) 

For this climate information, seek the advice from climate scientists and climate experts to define the:

»

Associated uncertainties with the information

15

»

Assumptions made

»

Data sources

»

Relative differences between current climate data derived from measured metrological data and projected climate
information based on modelling

»

Scientific validity of the methods and data used to derive current and future climate parameter values and frequencies

»

The criticality of the impact of the climate assumptions on the overall engineering design and function of the system

»

Assumptions and f

Readability:
---

The row number of the dataframe (aka the sentence number) is enclosed within parentheses at the top of each considered sentence. Following that is the sentence itself along with an arbitrary lined outline for easier reading within the Jupyter Notebook environment.




Findings
-----

The results of these sentences show that there are sentences that are relevant but there also exist some undesirable results.

Within some texts, there exist longer, run-on sentences that is not all relevant to us. Making use of the previously created rules, it may be possible to narrow down the focus of the extracted sentence to the relevant information and exclude the negligent information. 

Some extracted sentences are important informational sentences that are good to know but not very important in everyday use for the average citizen. For these cases, it is likely best to filter out this type of information.

The preliminary results of extracted sentences make use of primarily the first rule that detects noun-verb-noun sequences. One way of refining this extraction is to use a filter for disaster keywords as used in the dk_check function above. However, we think it is possible to further improve this extraction process and also incorporate other elementary rules with their own refinements.

Issues
---

Certain lines that are classified as singular "sentences" are comprised of several sentences, usually in the form of bullet points or other indentation. Thus, the way that the text file is parsed and divided by sentences may be flawed in philosophy or methodology and can be improved upon. A concern with this method is that the parsing of text that includes formatted bullet points or other indentation may be flawed and may require a different approach to properly format and divide it into proper, singular sentences.

Certain PDFS are converted into weird formats that make them difficult to work with. For this, the method of conversion may need to be further looked at but it is difficult to assess how best to approach this conversion method.

The differences amongst texts also poses a problem when trying to further refine the rules-based extraction process. Classifying texts by categories like scientific, government, etc. may be ideal but would also require their own nuanced rules. 