### PDF to plaintext

Task: Demonstrate how we can extract plain-text content from PDF files. Useful for processing Climate Watch, NDC pdfs (e.g. as can be [downloaded from CAIT](http://cait.wri.org/indc/)). Examples of these pdfs are contained in this repo in the *NDF_pdfs* subfolder.

We will use the [pdfminer](http://euske.github.io/pdfminer/index.html), a pure Python 2.7 (only) library.

Note. I also investigated [PyPDF2](https://pythonhosted.org/PyPDF2/PdfFileReader.html) but found it produced poorer results.

In [1]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from pprint import pprint
from cStringIO import StringIO

In [2]:
def clean_block(text):
    if '\r' in text:
        test = ''.join(text.split('\r'))
    if '\t' in test:    
        text = ''.join(text.split('\t'))
    if '\n' in test:
        text = ''.join(text.split('\n'))
    return text

def convert_pdf_to_txt(path, clean=False):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    text = retstr.getvalue()
    retstr.close()
    if clean:
        text = clean_block(text)
    return text

In [8]:
fname = './NDC_pdfs/INDC_AFG_Paper_En_20150927_.docx FINAL.pdf'
test = convert_pdf_to_txt(fname, clean=False)

In [9]:
test

' \n\n \n\n \n\n \n\n \n\nISLAMIC REPUBLIC OF AFGHANISTAN \n\nIntended Nationally Determined Contribution  \n\nSubmission to the United Nations Framework Convention on Climate Change \n\n21 September 2015 \n\n***** \n\nThe Islamic Republic of Afghanistan hereby communicates its Intended Nationally Determined \nContribution (INDC) and information to facilitate understanding of the contribution.  \n\nExecutive Summary \n\nBase Year: \n\n2005 \n\nTarget Years: \n\n2020 to 2030 \n\nContribution Type:  Conditional \n\nSectors: \n\nEnergy, natural resource management, agriculture, waste management and mining \n\nGases Covered: \n\nCarbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O) \n\nTarget: \n\nThere will be a 13.6% reduction in GHG emissions by 2030 compared to a business \nas usual (BAU) 2030 scenario, conditional on external support. \n\nFinancial Needs: \n\nTotal: USD 17.405 billion  \n\n\xef\x82\xb7 Adaptation: USD 10.785 billion \n\xef\x82\xb7 Mitigation: USD 6.62 billion (

Save the plain text to a file as follows.

In [None]:
with open("test.txt", "w") as text_file:
    text_file.write(test)

To process multiple files, you can identify the target files, and process them all as folows:

In [3]:
import os

In [4]:
pdf_directory = './NDC_pdfs'
files = [ filename for filename in os.listdir(pdf_directory) if filename.endswith('.pdf')]

In [5]:
d = {}
for fname in files:
    target = ''.join([pdf_directory,'/', fname])
    print("Reading {}".format(target))
    try:
        d[fname] = convert_pdf_to_txt(path=target, clean=False)
    except:
        d[fname] = None

Reading ./NDC_pdfs/INDC-Ethiopia-100615.pdf
Reading ./NDC_pdfs/INDC_2015_of_Bangladesh.pdf
Reading ./NDC_pdfs/INDC_AFG_Paper_En_20150927_.docx FINAL.pdf
Reading ./NDC_pdfs/LV-03-06-EU INDC.pdf
Reading ./NDC_pdfs/Norway INDC 26MAR2015.pdf


In [7]:
d  # the text for each file is accessible via a key from the d object