# Natural language processing part 1: 
# Parsing PDFs

## Lecture objectives
* Learn how to load in text data from PDFs
* Use `regex` to clean and simplify text data

In subsequent notebooks, we'll do the actual text analysis such as topic modeling and sentiment analysis.

## Before you start
There is another Python package to install. You can do this from within the Anaconda GUI, but it's easier from the command line as follows:

`conda activate uds`

`conda install pdfminer.six --channel=conda-forge`

## Getting the text into Python
Before we process any text, we need to take a step back and figure out how to get that text into Python. Typically, plans and other policy documents come as PDFs, which are a pain to read. There are dozens of PDF readers for Python, all of which are flawed in different ways. (See some discussions [here](https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7), [here](https://johannesfilter.com/python-and-pdf-a-review-of-existing-tools/) and [here](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file).) We'll use `pdfminer.six`, which is fairly robust and is easier to install than some alternatives. YMMV.

Let's start with an adaptation of the [LA Times analysis of California High-Speed Rail](https://github.com/datadesk/hsr-document-analysis). If you look at their code, they use the `urllib` library to download files. You can do the same but with a couple of extra steps using `requests`. 

But for now, let's work with just one of their files: [the EIR section on air quality and climate change, for the Bakersfield to Palmdale segment](https://hsr.ca.gov/wp-content/uploads/docs/programs/bakersfield-palmdale/BP_Draft_EIRS_Vol_1_CH_3.3_Air_Quality_and_Global_Climate_Change.pdf). It's in your git repository.

We'll read the text using `pdfminer.six`. Its simplest function is `extract_text`. Full documentation is [here]

Note that you will often have to experiment with other PDF parsers if you get unintelligible results. `PyPDF2` is another commonly used package.

In [1]:
from pdfminer.high_level import extract_text

fn = 'data/BP_Draft_EIRS_Vol_1_CH_3.3_Air_Quality_and_Global_Climate_Change.pdf'
eirtext = extract_text(fn)
print('Text is {} characters long'.format(len(eirtext)))

Text is 465131 characters long


Let's look at a few random extracts. We read the file into a string, so we can use our standard string slicing syntax.

For example, let's look at 1,000 characters starting at the 200,000th character.

In [2]:
print(eirtext[200001:201001])

y 

Bakersfield to Palmdale Project Section Draft Project EIR/EIS  

February 2020 

 Page | 3.3-55 

 
 
Section 3.3  Air Quality and Global Climate Change 

Antelope Valley Air Quality Management District 

Emission inventory data for the AVAQMD for 2012 are summarized in Table 3.3-15. In the 
AVAQMD, mobile-source emissions account for over 91 percent and 69 percent of the CO and 
NOX emissions inventory, respectively. Area sources made up over 55 percent of the particulate 
emissions, whereas stationary sources made up 45 percent of particulate emissions. Mobile 
sources were 64 percent of the SOX emissions. Stationary sources made up 43 percent of the 
areawide ROG emissions.  

Table 3.3-15 Estimated Annual Average Emissions for the Antelope Valley Air Quality 
Management District (tons per day)  

Source Category 

TOG 

ROG 

CO 

NOX 

SOX 

Particulate 
Matter 

PM10 

PM2.5 

Stationary Sources 

Fuel Combustion 

Waste Disposal 

Cleaning and Surface Coatings 

Petroleum P

And another slice.

In [3]:
print(eirtext[400001:401001])

PM2.5 hot-spot analysis, regardless of a medium or high ridership scenario. In December 
2010, the USEPA released its Transportation Conformity Guidance for Quantitative Hot-spot 
Analyses in PM2.5 and PM10 Nonattainment and Maintenance Areas (USEPA 2015b), which was 
used for this analysis. Although this analysis is normally associated with the Transportation 
Conformity Rule, this project is subject to the General Conformity Rule. The decision to use this 
analytical structure notwithstanding, additional analysis or associated activities required to comply 
with Transportation Conformity will be carried out only if discrete project elements become 
subject to those requirements in the future. In accordance with this guidance, if a project meets 
one of the following criteria, it is considered a project of air quality concern and a quantitative 
PM10/PM2.5 analysis is required.  

•  New or expanded highway projects that have a significant number of or significant 
increase in diesel 

## Cleaning up the text
So we've got a bunch of text in, but clearly the formatting leaves something to be desired. In particularly, there are a lot of random line breaks. Let's use `regex` to convert all whitespace (spaces, tabs (`\t`), and newlines (`\n` or `\r\n`) to a single space. 

`regex` is short for "regular expression," and is essentially a pattern matching tool for text. Think of it as a souped-up version of `replace`. 

`regex` is extremely powerful and has an extremely unfriendly syntax. But there are thousands of examples online. [Here's a good place to start](https://regexone.com/) if you want to explore more. And [this website](https://regex101.com) helps you test and debug your expressions.

Let's look at an example – `r"\s+"`:
- The `r` tells Python that what follows is a "raw string," and thus the `\` character should be interpreted literally
- `\s` matches whitespace
- `+` matches multiple occurences

So basically, we are matching all whitespace, however long.

Let's then use `re.sub` to replace that whitespace. The second argument is what we replace our matched substrings with. The third argument is the string to apply the substitution to. Note that we have some spaces, some tabs (`\t`), and some newlines (`\n`).

In [4]:
import re
print(re.sub(r"\s+", " ", "HSR\tis     an\nexpensive    boondoogle"))

HSR is an expensive boondoogle


If we omit the `+` and just specify `r"\s"`, we don't match multiple occurences. So 4 spaces are replaced with 4 spaces, rather than a single space. But the tabs and newlines are still converted to spaces.

In [5]:
print(re.sub(r"\s", " ", "HSR\twill     \ntransform     California"))

HSR will      transform     California


I won't pass judgment on the content of either of these claims.

Let's apply the `regex` to our text that we pulled out of the EIR

In [6]:
eirtext = re.sub(r"\s+", " ", eirtext)
print(eirtext[200001:201001])

-Speed Rail Authority Bakersfield to Palmdale Project Section Draft Project EIR/EIS Section 3.3 Air Quality and Global Climate Change Figure 3.3-3 Sensitive Receptors within the High-Speed Rail Project Vicinity California High-Speed Rail Authority Bakersfield to Palmdale Project Section Draft Project EIR/EIS (Sheet 1 of 11) February 2020 Page | 3.3-59 Section 3.3 Air Quality and Global Climate Change Figure 3.3-3 Sensitive Receptors within the High-Speed Rail Project Vicinity (Sheet 2 of 11) February 2020 3.3-60 | Page California High-Speed Rail Authority Bakersfield to Palmdale Project Section Draft Project EIR/EIS Section 3.3 Air Quality and Global Climate Change Figure 3.3-3 Sensitive Receptors within the High-Speed Rail Project Vicinity California High-Speed Rail Authority Bakersfield to Palmdale Project Section Draft Project EIR/EIS (Sheet 3 of 11) February 2020 Page | 3.3-61 Section 3.3 Air Quality and Global Climate Change Figure 3.3-3 Sensitive Receptors within the High-Speed R

In [None]:
print(eirtext[400001:401001])

We can also use `regex` to get rid of punctuation, digits, etc. 

Here:
* `[]` means match anything within the brackets
* `^` means not
* `A-z` is any letter in any case
* `\s` is any whitespace (which is just spaces, since we converted other whitespace like tabs to spaces

So `[^A-z\s]` captures anything that is not a letter or whitespace. 

Since we might want the punctuation at a later date, let's assign our cleaned text to a new variable, `eirtext_wordsonly`.

In [7]:
eirtext_wordsonly = re.sub(r"[^A-z\s]", "", eirtext)
eirtext_wordsonly[400001:401001]

'd avoid microscale CO impacts and localized PMPM hotspot impacts It is also anticipated that all of the BP Build Alternatives would avoid localized air quality impacts to sensitive receptors including schools All of the BP Build Alternatives would also avoid impacts related to other emissions such as those leading to odors adversely affecting a substantial number of people All of the BP Build Alternatives would avoid impacts related to compliance with applicable air quality plans during project operation and would result in anticipated net reduction in criteria pollutant and GHG emissions within the SJVAB and the MDAB All of the BP Build Alternatives would avoid cumulative impacts during project operation and would result in anticipated net reduction in criteria pollutant and GHG emissions within the SJVAB and the MDAB  CEQA Significance Conclusions Table  provides a summary of the CEQA determination of significance for all construction and operations impacts discussed in Section  If 

Notice that removing some digits, etc. means that we now have extra spaces. For example, `Table 3.3-46 provides a summary` becomes `Table  provides a summary.`

So let's use our same process from before to remove duplicate spaces.

In [8]:
eirtext_wordsonly = re.sub(r"\s+", " ", eirtext_wordsonly)
eirtext_wordsonly[394890:395200]

'CEQA Significance Conclusions Table provides a summary of the CEQA determination of significance for all construction and operations impacts discussed in Section If there are differences in impacts before or after mitigation among the BP Build Alternatives these are noted in the table Where there is no differ'

This looks much better! We now have some clean text to analyze.

Let's pause here. We'll save the text to a file, so that we can load it in at the start of the next lecture.

Note here that `open` opens the file object `f`. We then write `eirtext` to the file. The `with` syntax helps because it automatically closes the file afterwards.

In [9]:
with open('eirtext.txt', 'w') as f:
    f.write(eirtext)

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>PDFs are difficult to work with. pdfminer is a good starting point, but make sure to inspect your output.</li>
  <li>regex is a powerful tool to clean up text, e.g. removing whitespace and punctuation.</li>
</ul>
</div>