In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [7]:
import slate
import codecs

# Extracting plain text from text-embedded PDFs

In [Module 1](https://diging.atlassian.net/wiki/pages/viewpage.action?pageId=4358148), we briefly touched on the problem of extracting usable plain-text content from PDF documents. In this notebook, we will use a Python package called [slate](https://github.com/timClicks/slate) to extract text from PDF documents that have embedded text (see Module 1 for details on what that means).

## Start with one file

Like most things in life, you should try to do the whole job all the way through until you're fairly certain of the procedure. So, we'll start with a single file.

First we have to open the file. The [open()](https://docs.python.org/2/library/codecs.html#codecs.open) function creates a [``file``](https://docs.python.org/2/library/stdtypes.html#bltin-file-objects) object. You can open a file like this:

In [4]:
f = codecs.open('../data/example.pdf')

But you have to remember to close the file!! If you don't, Bad Things Will Happen.

In [5]:
f.close()

A better way to open a file is to use a [``with``](https://docs.python.org/2/reference/compound_stmts.html#with) statement. The basic idea is that we can open the file, do some procedures (in the indented block), and then the file will be automatically closed after the end of the indented block. So:

In [32]:
with codecs.open('../data/example.pdf') as f:
    pass    # do something

To read the PDF document with slate, we use slate's PDF method.

In [8]:
with codecs.open('../data/example.pdf') as f:
    extracted_text = slate.PDF(f)

Now we have a PDF object.

In [9]:
extracted_text

['I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\xe2\x80\x99s kick and the wind\xe2\x80\x99s song and the white sail\xe2\x80\x99s shaking,And a grey mist on the sea\xe2\x80\x99s face, and a grey dawn breaking.\x0c',
 'I must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \xef\xac\x82ying,And the \xef\xac\x82ung spray and the blown spume, and the sea-gulls crying. \x0c',
 'I must go down to the seas again, to the vagrant gypsy life,To the gull\xe2\x80\x99s way and the whale\xe2\x80\x99s way where the wind\xe2\x80\x99s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\xe2\x80\x99s over.\x0c']

Hey, it's text! Notice that the text is actually represented by a list of strings. Each page is represented as a separate string; how nice! If I wanted to keep these pages separate, I could save each one as a separate text file. Or I could stitch them together into a single string, and save the document as one text file.

To stitch the pages together, I could use the the [join()](https://docs.python.org/2/library/stdtypes.html#str.join) method.

In [11]:
# This concatenates the pages, with two newlines intervening between them.
joined = "\n\n".join(extracted_text)    
joined

'I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\xe2\x80\x99s kick and the wind\xe2\x80\x99s song and the white sail\xe2\x80\x99s shaking,And a grey mist on the sea\xe2\x80\x99s face, and a grey dawn breaking.\x0c\n\nI must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \xef\xac\x82ying,And the \xef\xac\x82ung spray and the blown spume, and the sea-gulls crying. \x0c\n\nI must go down to the seas again, to the vagrant gypsy life,To the gull\xe2\x80\x99s way and the whale\xe2\x80\x99s way where the wind\xe2\x80\x99s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\xe2\x80\x99s over.\x0c'

## Encoding issues

Notice all of those pesky special characters? e.g. ``\xe2\x80\x99`` ? ``\xe2\x80\x99`` is a fancy-looking apostrophe. That means that there was some unicode data in the document -- that is, characters that aren't represented in the default ASCII encoding (basically just the keys on your keyboard). When we use the default ``open`` command to read a file, we just read everything in as ASCII. If we don't convert this string to its proper encoding, we'll run into Big Problems in the future.

We will decode the string that we retrieved from our document using the ``decode()`` method. Well first try the UTF-8 encoding. 

In [15]:
joined.decode('utf-8')

u'I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\u2019s kick and the wind\u2019s song and the white sail\u2019s shaking,And a grey mist on the sea\u2019s face, and a grey dawn breaking.\x0c\n\nI must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \ufb02ying,And the \ufb02ung spray and the blown spume, and the sea-gulls crying. \x0c\n\nI must go down to the seas again, to the vagrant gypsy life,To the gull\u2019s way and the whale\u2019s way where the wind\u2019s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\u2019s over.\x0c'

Now instead of ``\xe2\x80\x99``, we see ``\u2019``. Success! If we had gotten the encoding wrong, those characters would not have been transformed successfully. For example:

In [16]:
joined.decode('latin-1')

u'I must go down to the seas again, to the lonely sea and the sky,And all I ask is a tall ship and a star to steer her by;And the wheel\xe2\x80\x99s kick and the wind\xe2\x80\x99s song and the white sail\xe2\x80\x99s shaking,And a grey mist on the sea\xe2\x80\x99s face, and a grey dawn breaking.\x0c\n\nI must go down to the seas again, for the call of the running tideIs a wild call and a clear call that may not be denied; And all I ask is a windy day with the white clouds \xef\xac\x82ying,And the \xef\xac\x82ung spray and the blown spume, and the sea-gulls crying. \x0c\n\nI must go down to the seas again, to the vagrant gypsy life,To the gull\xe2\x80\x99s way and the whale\xe2\x80\x99s way where the wind\xe2\x80\x99s like a whetted knife;And all I ask is a merry yarn from a laughing fellow-rover,And quiet sleep and a sweet dream when the long trick\xe2\x80\x99s over.\x0c'

Our last step is to write this text to a file. We can ``open()`` a new file (one that doesn't exist yet), and write to it. Note the ``'w'`` that we are passing to ``open()`` -- that means that we want to open the file for writing. We also state that we want to use UTF-8 encoding.

In [22]:
with codecs.open('../data/example.txt', 'w', encoding='utf-8') as f:
    f.write(joined.decode('utf-8'))

You should now be able to find and open the text file at the path above.

## Converting a whole bunch of PDFs

Now let's scale up. We'll assume that we have a folder containing all of our PDFs. First we have to generate a list of all of the files in that folder. Then we have to iterate over all of those files, and extract the text (as above), and write the text into a new set of files that we can use in our computational workflow.

The [os](https://docs.python.org/2/library/os.html) package has a handy method called [listdir()](https://docs.python.org/2/library/os.html#os.listdir) that will give us a list of files in a directory.

In [18]:
import os

In [19]:
os.listdir('../data/PDFs/')

['example2.pdf', 'example3.pdf', 'example4.pdf', 'example5.pdf']

We can generate the full paths to these files using the ``os.join()`` method.

In [20]:
base_path = '../data/PDFs/'
for filename in os.listdir(base_path):
    print os.path.join(base_path, filename)

../data/PDFs/example2.pdf
../data/PDFs/example3.pdf
../data/PDFs/example4.pdf
../data/PDFs/example5.pdf


So now we want to apply our procedure to each of these files: 
* open the file, 
* extract the text with ``slate``,
* join the pages together into a single string,
* create a new text file,
* and write the string into that file.

It's a good idea to create a new folder to hold the new text files. 

In [23]:
text_basepath = '../data/PDFs_extracted'

Now the action.

In [28]:
basepath = '../data/PDFs/'   # Folder containing PDF files.
for filename in os.listdir(basepath):
    filepath = os.path.join(basepath, filename)
    # Open the file.
    with codecs.open(filepath, 'r') as f:
        # Extract the text.
        extracted_text = slate.PDF(f)
    
    # Join the pages, and decode.
    joined_text = "\n\n".join(extracted_text).decode('utf-8')
    
    # Figure out where to put the new text file.
    textpath = os.path.join(text_basepath, filename.replace('pdf', 'txt'))
    
    # Open (create) the new text file
    with codecs.open(textpath, 'w', encoding='utf-8') as f:
        # Write the string to the file.
        f.write(joined_text)

Now we can try to open the corpus with NLTK.

In [29]:
import nltk

In [30]:
documents = nltk.corpus.PlaintextCorpusReader(text_basepath, '.*\.txt')

In [35]:
whoops = nltk.Text(documents.words())

In [46]:
whoops.concordance('nothing')

Displaying 25 of 33 matches:
lass and a cousin , a spectacle and nothing strange a single hurt color and an 
places not empty . They see cover . NOTHING ELEGANT . A charm a single charm is
A kind of green a game in green and nothing ﬂ at nothing quite ﬂ at and more ro
en a game in green and nothing ﬂ at nothing quite ﬂ at and more round , nothing
nothing quite ﬂ at and more round , nothing a particular color strangely , noth
hing a particular color strangely , nothing breaking the losing of no little pi
between the circular side place and nothing else , nothing else . To choose it 
cular side place and nothing else , nothing else . To choose it is ended , it i
show it , that it shows it and that nothing , that there is nothing , that ther
it and that nothing , that there is nothing , that there is no more to do about
 , best to make the length tall and nothing broader , anything between the half
es and this necessarily spread into nothing . Spread into nothing . A FIRE . Wh
y spread in