Class,

As promised, here's a quick notebook on converting a directory full of PDFs into another directory folder full of equivalent text files.

I expect you've followed install instructions and have the following modules loaded.
> ## installation and setup for py 3.x (takes longtime to install!)
conda install -c conda-forge pdfminer.six

I have as a demo a folder with 3 DC slide deck pdf files. Let's read from and write to the same folder. 

In [1]:
## setup chunk
import os
import pdfminer
import time
import re
import io

Next we first explore and list out all directory contents (as a list of filenames).

In [2]:
# list files from target directory
from os import listdir
path1 = u"/Users/anmol/Dropbox/isb/Term1/Data Collection/Lec5/PDF in py/pdfs in py/"
filesList = os.listdir(path1)

In [3]:
# using regex to retain only .pdf files
import re
filesFiltered = [f for f in filesList if re.search(r'\.pdf',f)]
filesFiltered

['bookdown-tweepy.pdf.txt',
 'lec 2 slide deck april 2019.pdf',
 'bookdown-tweepy.pdf',
 'lec 1 slide deck for LMS.pdf']

### Write Func to convert one pdf file to text

Idea is simple. If I can convert one pdf file, I can loop over the list and convert them all.

In [4]:
# converting one pdf file to text
from io import StringIO
from io import BytesIO
from io import BytesIO as StringIO
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt

In [5]:
## func to convert pdf, return its text content as a string
def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)
    print(os.path.join(fname))
    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text 

Notice the exception-handling done above for when num of pages to convert is not specified.

Best practice is to always test drive a func after writing it.

Any errors that show up will have to be handled with exception-handling etc.

In [6]:
## trial the func
fname = os.path.join(path1 + filesFiltered[2])
#print(fname)
#fname = fname.replace(" ", "\\ ")
print(fname)
t1 = time.time()
text1 = convert(fname, pages=None)
time.time() - t1  # 0.89 secs

/Users/anmol/Dropbox/isb/Term1/Data Collection/Lec5/PDF in py/pdfs in py/bookdown-tweepy.pdf
/Users/anmol/Dropbox/isb/Term1/Data Collection/Lec5/PDF in py/pdfs in py/bookdown-tweepy.pdf


0.5651040077209473

In [7]:
print(text1[0:200])   # view first 200 odd chars of the output

b'Working with twitter in Python\n\nAashish Pandey\n\n2018-01-09\n\n\x0c2\n\n\x0cContents\n\n1 tweepy - A Python library for accessing the Twitter API\n\n2 - Creating the app\n\n3 - Getting Started\n\n4 - Getting all tweets '


### Looping above func over all files in Dir

In [11]:
## func to convert all pdfs in input directory (inpDir), save resulting txt files to outDir
def convertMultiple(inpDir, outDir):
    if inpDir == "": pdfDir = os.getcwd() + "\\" #if no pdfDir passed in 
    for pdf in os.listdir(inpDir): #iterate through pdfs in pdf directory
        fileExtension = pdf.split(".")[-1]
        if fileExtension == "pdf":
            pdfFilename = inpDir + pdf 
            text = convert(pdfFilename) #get string of text content of pdf
            textFilename = outDir + pdf + ".txt"
            textFile = io.open(textFilename, "wb", encoding = "utf-8") # make text file
            textFile.write(text) #write text to text file

In [12]:
# define trial folders
inpDir = path1
outDir = path1

In [13]:
# test-driving the func
t1 = time.time()
convertMultiple(inpDir, outDir)
time.time() - t1  # 3.8 secs

/Users/anmol/Dropbox/isb/Term1/Data Collection/Lec5/PDF in py/pdfs in py/lec 2 slide deck april 2019.pdf


ValueError: binary mode doesn't take an encoding argument

Done. Nice and easy.

Sudhir

In [None]:
import sys 
reload(sys)
sys.setdefaultencoding('utf-8')