# Extracting Paragraphs from the EU Taxonomy Document


In [8]:
!pip install textract

Defaulting to user installation because normal site-packages is not writeable


In [9]:
import re

import textract
import pandas as pd

## Objective

Process the EU sustainable finance taxonomy PDF file and extract and clean all the paragraphs in the document

## Download the EU sustainable finance taxonomy PDF from Taxonomy Report: Technical Annex.

## Load the EU sustainable finance taxonomy PDF file using the textract library and decode it. 

Look through the text to ensure that you have got all the text and that the decoding did not produce any bad characters.

In [20]:
text = textract.process('C:/Users/USER/Downloads/question-answering-with-deep-learning-main/question-answering-with-deep-learning-main/31.pdf')

ShellError: The command `pdftotext C:/Users/USER/Downloads/question-answering-with-deep-learning-main/question-answering-with-deep-learning-main/31.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------


In [14]:
text = text.decode()

NameError: name 'text' is not defined

In [15]:
text = textract.process('31.pdf', method='pdfminer').decode()

ShellError: The command `pdf2txt.py 31.pdf` failed with exit code 127
------------- stdout -------------
------------- stderr -------------


## Use regular expressions to split the paragraphs and clean the text. 

The loaded text will be in raw format and will need to be segmented into paragraphs. These paragraphs will also need to be cleaned by removing newline characters and other characters that do not bring any semantic value to the paragraph (such as tabs or bullet points).

In [4]:
len(text)

1320996

In [127]:
text[0:1000]

'Updated methodology & Updated Technical Screening Criteria\n- 1-\n\nMarch 2020\n\n\x0cAbout this report\nThis document includes an updated Part B: Methodology from the June 2019 report and an updated Part\nF: Full list of technical screening criteria. The other original sections from the June 2019 report can be\nfound as labelled in the June 2019 report.\nPART A\n\nExplanation of the Taxonomy approach. This section sets out the role and importance of\nsustainable finance in Europe from a policy and investment perspective, the rationale for\nthe development of an EU Taxonomy, the daft regulation and the mandate of the TEG.\n\nPART B\n\nMethodology. This explains the methodologies for developing technical screening\ncriteria for climate change mitigation objectives, adaptation objectives and ‘do no\nsignificant harm’ to other environmental objectives in the legislative proposal.\nThis has been updated since 2019.\n\nPART C\n\nTaxonomy user and use case analysis. This section provides pr

In [120]:
paragraphs = re.split(r"\s*?\n\s*?\n\s*?", text)

In [121]:
len(paragraphs)

8984

In [122]:
paragraphs[2]

'\x0cAbout this report\nThis document includes an updated Part B: Methodology from the June 2019 report and an updated Part\nF: Full list of technical screening criteria. The other original sections from the June 2019 report can be\nfound as labelled in the June 2019 report.\nPART A'

In [129]:
paragraphs[3]

'Explanation of the Taxonomy approach. This section sets out the role and importance of\nsustainable finance in Europe from a policy and investment perspective, the rationale for\nthe development of an EU Taxonomy, the daft regulation and the mandate of the TEG.'

In [124]:
def clean_paragraph(text):
    text = text.replace("\n", " ").replace("  ", " ").strip(" ")
    return re.sub(r'[^\w\s]', '', text).strip(" ")

## Store the paragraphs in a DataFrame with the column “paragraph” using the pandas library and save the DataFrame.

In [130]:
df = pd.DataFrame(data=paragraphs)
df.columns=['paragraph']

In [131]:
df.head()

Unnamed: 0,paragraph
0,Updated methodology & Updated Technical Screen...
1,March 2020
2,About this report\nThis document includes an ...
3,Explanation of the Taxonomy approach. This sec...
4,PART B


In [132]:
df['paragraph'] = df['paragraph'].apply(clean_paragraph)

In [133]:
df.head()

Unnamed: 0,paragraph
0,Updated methodology Updated Technical Screeni...
1,March 2020
2,About this report This document includes an u...
3,Explanation of the Taxonomy approach This sect...
4,PART B


In [134]:
df.to_csv("paragraphs.csv")