# <center>Climate Action Tracking Prompt:</center>
## <center> PDF Data Extraction using NLP </center>
<hr>

### PDF parsing:
            
PDF is a document format developed by Adobe, based on PostScript. PDFs are highly favoured by the scientific community. Most of the documents and proposals related to climate action are in the PDF format. For parsing these documents, we can consider these PDFs to be divided as follows:

1. <b>Text:</b> Information can be extracted using NLP.
2. <b>Images:</b> Surprisingly there are no Python libraries for extracting images from PDFs, PDF parsers need to be written to extract images.
3. <b>Tables:</b> Although libraries are available for extracting tables such as camelot and tabula, they do not always work properly nor do they give consistent and accurate performance.

### 1) Parsing Text in PDFs using Python and NLP:

The methodology used is as follows:


#### Reading PDF using PyPDF2:

In [177]:
import PyPDF2 
   
pdf = open('action_example_dublin.pdf', 'rb') 
   
pdfReader = PyPDF2.PdfFileReader(pdf) 
  
#print(pdfReader.numPages) 
  
# creating a list of page objects 
page = [pdfReader.getPage(x) for x in range(pdfReader.numPages)]
  
# extracting text from page 
#page_text = [page[x].extractText() for x in range(len(page))] 
  
# closing the pdf object 
#pdf.close() 

#### Extracting data using Spacy:

In [178]:
import spacy
import pandas as pd

nlp = spacy.load('en_core_web_md')

In [179]:
doc = nlp(page[20].extractText().replace('\n', ''))
print(doc)

Dublin City Sustainable Energy Action Plan 2010-2020 v2   19 Fiscal Incentives Congestion charges:  The costs, benefits and effects of congestion charging may be assessed, especially in terms of greenhouse gas emissions. This charge will possibly be introduced before the completion of Transport 21 Free parking for electric vehicles:  New generation electric hybrid vehicles are cleaner and more energy efficient than conventional petrol/diesel units.  On the issue of Dublin City Council providing free parking for electric vehicles, there is a need to clarify existing legislation on assigning individual parking bays for charging electrical vehicles and to examine the costs and benefits of introducing widespread charging facilities for electric vehicles National and EU incentives:  There is a variety of grants available under national and European programmes that support specific local sustainable energy projects, helping to deliver Irish and EU policy objectives for energy and climate cha

#### Entitities in the text:

In [180]:
from spacy import displacy
displacy.render(doc, style='ent')

#### Analysis:

As we can see from the above visualizations, the Name Entity Recognizer (NER) does a pretty good job of identifying different entities even though it is not trained in the context of the given documents. For eg. the quantity NER fails to identify units of measurement such as CO2 and CO2/per employee as part of Quantity Entity. Datasets and Entity relations need to be created in the context of climate action proposals.

Using the default Spacy models will definitely not yield good results for data extraction. However for the sake of prototyping, let us look at an idealized example text and how we might be able to extract information from it using Spacy into Pandas Dataframes.

In [181]:
text = """Dublin is a great city. Dublin has proposed to reduce emissions in the city by 50% till 2025.
At this rate Dublin will thus have net emissions of minus 5% by 2040. The carbon footprint of The Dublin Finance Department
is 1719 tonnes, with the number emissions expected to rise by 2% next year in France. Apple in Dublin has reduced 
emissions by 2000 tonnes this year.
"""
text

'Dublin is a great city. Dublin has proposed to reduce emissions in the city by 50% till 2025.\nAt this rate Dublin will thus have net emissions of minus 5% by 2040. The carbon footprint of The Dublin Finance Department\nis 1719 tonnes, with the number emissions expected to rise by 2% next year in France. Apple in Dublin has reduced \nemissions by 2000 tonnes this year.\n'

In [182]:
doc = nlp(text.replace('\n', ' '))
displacy.render(doc, style='ent')

In [183]:
doc = nlp(text)
displacy.render(doc, style='dep')

In [184]:
Columns = {'Organization or Actor': [], 'Location': [], 'Action': [], 'Quantity': [], 'Date or Timeline': []}

df = pd.DataFrame(Columns)

In [185]:
df

Unnamed: 0,Organization or Actor,Location,Action,Quantity,Date or Timeline


In [186]:
for idx, sent in enumerate(doc.sents):
    sentence = nlp(str(sent))
    
    actor_list, location_list, action_list, date_list, quantity_list = [], [], [], [], [] 
    
    #print(sentence.ents)
    
    for token in sentence:
        
        if token.pos_ == "ADJ":
            action_list.append(token.text)
        
        if token.pos_ == "VERB":
            action_list.append(token.text)
        
        if token.dep_ == "pobj" and token.ent_type_ == "GPE":
            location_list.append(token.text)
        
        elif token.ent_type_ == "GPE":
            actor_list.append(token.text)
    
    for ent in sentence.ents:
        
        if ent.label_ == "ORG":
            actor_list.append(ent.text)
            
        if ent.label_ == "GPE":
            span = doc[ent.start:ent.end]
            #for token in sentence:
                #if token.dep_ == "pobj" and token.ent_type == "GPE":
                    #location_list.append(ent.text)
                #else:
                   # actor_list.append(ent.text)
                
        if ent.label_ == "DATE":
            date_list.append(ent.text)
            
        if ent.label_ == "QUANTITY":
            quantity_list.append(ent.text)
            
        if ent.label_ == "PERCENT":
            quantity_list.append(ent.text)
            
    #print(actor_list, location_list, action_list, date_list, quantity_list)
    
    if len(actor_list)!=0 and len(quantity_list)!=0 and len(date_list)!=0:
        
        df = df.append({
                   'Organization or Actor': ', '.join(actor_list), 
                   'Location': ', '.join(location_list), 
                   'Action': ', '.join(action_list), 
                   'Quantity': ', '.join(quantity_list),
                   'Date or Timeline': ', '.join(date_list)
                  }, ignore_index = True)

In [187]:
df

Unnamed: 0,Organization or Actor,Location,Action,Quantity,Date or Timeline
0,Dublin,,"proposed, reduce",50%,2025
1,Dublin,,"will, net",minus 5%,2040
2,The Dublin Finance Department,France,"expected, rise, next","1719 tonnes, 2%",next year
3,Apple,Dublin,reduced,2000 tonnes,this year


Thus we get the required dataframe which helps to quantify data in the text. This could be further processed using pandas and some text manipulation to get better results.

## References Used:

This Jupyter notebook borrows heavily from the following sources:
- https://course.spacy.io/
- https://towardsdatascience.com/python-for-pdf-ef0fac2808b0
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5551970/

## Summary:

The above code for information was an oversimplified attempt of the research paper titled [Prescription Extraction Using CRFs and Word Embeddings](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5551970/). By training context specific datasets and entity relationships much better results could be obtained for the above approach.

In the course of working on this project, it was found that there is still a lack of standardization around the documents used for climate action proposals. A standard structure would go a long way in aiding such information extraction from PDFs, especially due to the lack of availability of proper libraries for working with PDFs.