# Weeks 5 and 6. Natural language processing

## Part 1. Loading, cleaning, and processing the text
In this notebook, you'll learn to:
* Load in text data from PDFs
* Use `regex` to clean and simplify text data
* Do simple word counts
* Tokenize (split) text into words and sentences

In subsequent notebooks, we'll do the actual text analysis such as topic modeling and sentiment analysis.

You'll need to install the `nltk` and `PyPDF2` libraries for this module. `nltk` is the standard for processing text, with functionality to eliminate stopwords, split text into words and sentences, etc. `PyPDF2` reads PDF files. 

From the Terminal (Mac) or Anaconda Prompt (Windows):
   
1. Switch to your environment (skip this if you don't have a separate environment set up for class)

Windows:
`activate your_environment_name` 

Mac:
`conda activate your_environment_name` 

2. Install

`conda install nltk PyPDF2 -c conda-forge`

## Getting the text into Python
Before we process any text, we need to take a step back and figure out how to get that text into Python. Typically, plans and other policy documents come as PDFs, which are a pain to read. There are dozens of PDF readers for Python, all of which are flawed in different ways. (See some discussions [here](https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7), [here](https://johannesfilter.com/python-and-pdf-a-review-of-existing-tools/) and [here](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file).) We'll use `PyPDF2`, which is fairly robust and is easier to install than some alternatives. YMMV.

Let's start with an adaptation of the [LA Times analysis of California High-Speed Rail](https://github.com/datadesk/hsr-document-analysis). They use the `urllib` library to download files. You can do the same but with a couple of extra steps using `requests`. 

But for now, let's just download one file manually: [the EIR section on air quality and climate change, for the Bakersfield to Palmdale segment](https://hsr.ca.gov/wp-content/uploads/docs/programs/bakersfield-palmdale/BP_Draft_EIRS_Vol_1_CH_3.3_Air_Quality_and_Global_Climate_Change.pdf). Save it to your computer.

In [140]:
import PyPDF2
path = '/Users/adammb/Desktop/'  # change this to the path where you downloaded the EIR file
fn = 'BP_Draft_EIRS_Vol_1_CH_3.3_Air_Quality_and_Global_Climate_Change.pdf'

f = open(path + fn, 'rb') # rb means "read binary"
pdf = PyPDF2.PdfFileReader(f)

Explore what you can do with the `pdf` object

In [141]:
type(pdf)

PyPDF2.pdf.PdfFileReader

In [142]:
pdf.getNumPages()

148

In [143]:
# This is pretty comprehensive, but personally I find it easier to look for examples on the web
help(pdf)

Help on PdfFileReader in module PyPDF2.pdf object:

class PdfFileReader(builtins.object)
 |  
 |  Initializes a PdfFileReader object.  This operation can take some time, as
 |  the PDF stream's cross-reference tables are read into memory.
 |  
 |  :param stream: A File object or an object that supports the standard read
 |      and seek methods similar to a File object. Could also be a
 |      string representing a path to a PDF file.
 |  :param bool strict: Determines whether user should be warned of all
 |      problems and also causes some correctable problems to be fatal.
 |      Defaults to ``True``.
 |      ``sys.stderr``).
 |      ``True``).
 |  
 |  Methods defined here:
 |  
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  cacheGetIndirectObject(self, generation, idnum)
 |  
 |  cacheIndirectObject(self, generation, idnum, obj)
 |  
 |  decrypt(self, password)
 |      When using an encrypted / secured PDF file with the PDF Standard
 |      encryp

We have to go page by page to extract the text. Since we don't really care about what material is on which page, let's just put all the text into a big string.

In [144]:
eirtext = ''
for page in pdf.pages:
    eirtext += page.extractText() 
print('Text is {} characters long'.format(len(eirtext)))

# remember to close the file, now we are done with it
f.close()

Text is 422597 characters long


In [145]:
# look at a few random extracts
print(eirtext[200000:201000])
print(eirtext[400000:401000])

Antelope Valley
1 
100 100 100 50 Penny Lane Family Center Clinic
2 
1,000
 1,000
 1,000
 Œ Penny Lane Centers
2 
1,000
 1,000
 1,000
 Œ 
1 
100 100 100 50 
1 
500 500 500 450 R. Rex Pa
r
1 
600 600 600 600 Doctor Robert C
. St. Clair Parkway
1 
200 200 200 200 1 Receptor type: youth, cultural, and educational facility
 
2 Receptor type: 
health
-
care facility
 
3 Receptor type: hospital
 
4 Receptor type: miscellaneous
 
  
Section 
3.3
 
Air Quality and Global Climate Change
  
 Cal
ifornia High
-
Speed Rail Authority
 
February 2020
 
Bakersfield to Palmdale Project Section Draft Project EIR/EIS 
  
Page | 
3.3
-
59
  Figure 
3.3
-3 
Sensitive Receptors 
within 
the 
High
-
Speed Rail 
Project Vicinity
 
(Sheet 1 of 11)
 
Section 
3.3
 
Air Quality and Global Climate Change
 
 February 2020
 
Cal
ifornia High
-
Speed Rail
 
Authority
 
3.3
-
60
 | Page
  
Bakersfiel
d to Palmdale Project Section Draft Project EIR/EIS
 
 Figure
 
3.
3-3 Sensitive Receptors 
within 
the 
High
-
Speed

## Cleaning up the text
So we've got a bunch of text in, but clearly the formatting leaves something to be desired. In particularly, there are a lot of random line breaks. Let's use `regex` to convert all whitespace (spaces, tabs (`\t`), and newlines (`\n` or `\r\n`) to a single space. 

`regex` is short for "regular expression," and is essentially a pattern matching tool for text. Think of it as a souped-up version of `replace`. 

`regex` is extremely powerful and has an extremely unfriendly syntax. But there are thousands of examples online. [Here's a good place to start](https://regexone.com/) if you want to explore more. And [this website](https://regex101.com) helps you test and debug your expressions.

In [146]:
import re
# The "r" tells Python that what follows is a "raw string," and thus the \ character should be interpreted literally
# \s matches whitespace
# + matches multiple occurences
# so basically, we are replacing all whitespace, however long, with a single space

# compare the following
print(re.sub(r"\s+", " ", "HSR\tis     an\nexpensive    boondoogle"))

# no + sign (so only matches the first occurence of the space)
print(re.sub(r"\s", " ", "HSR\twill     \ntransform     California"))

HSR is an expensive boondoogle
HSR will      transform     California


I won't pass judgment on the content of either of these claims.

Let's apply the `regex` to our text that we pulled out of the EIR

In [147]:
eirtext = re.sub(r"\s+", " ", eirtext)
print(eirtext[200000:201000])
print(eirtext[400000:401000])

ion Draft Project EIR/EIS Table 3.3 -18 Bakersfield to Palmdale Project Section Construction Regional Emissions ŠTotal (Tons/Construction Duration ) Alternative Emissions 1 VOC CO NOX SO2 PM102 PM2.5 2 Alternative 1 196 3,089 1,834 25 164 104 Alternative 1 with Refined CCNM 197 3,094 1,892 25 165 105 Alternative 2 191 3,471 1,860 21 167 99 Alternative 2 with Refined CCNM 191 3,476 1,918 21 168 100 Alternative 3 187 3,089 1,843 21 123 92 Alternative 3 with Refined CCNM 188 3,094 1,901 21 124 93 Alternative 5 212 3,997 2,062 21 137 109 Alternative 5 with Refined CCNM 213 4,002 2,120 21 137 109 Source: California High - Speed Rail Authority , 201 9 1 2 The PM 10 and PM 2.5 emissions CO = carbon monoxide NOX s PM 2.5 = particulate matter PM 10 = particulate matter RTP SO 2 = sulfur dioxide Details of emissions from the four B- P Build Alternatives (including the CCNM Design Option, the Refined CCNM Option and the portion of the F - B LGA alignment from the intersection of 34th Street and L

We can also use `regex` to get rid of punctuation, digits, etc. 

Here:
* `[]` means match anything within the brackets
* `^` means not
* `A-z` is any letter in any case
* `\s` is any whitespace (which is just spaces, since we converted other whitespace like tabs to spaces

So `[^A-z\s]` captures anything that is not a letter or whitespace. 

Since we might want the punctuation at a later date, lets assign it to a new variable, `eirtext_wordsonly`

In [149]:
eirtext_wordsonly = re.sub(r"[^A-z\s]", "", eirtext)
eirtext_wordsonly[100000:101000]

' Guidance  CFR Part   These data were found to be the most representative of the conditions existing in the project vicinity  Peak  hour concentrations of CO were obtained by multiplying the highest peak  hour CO estimates by a persistence factor The persistence factor accounts for the following Over an  hour period as distinct from a single hour vehicle volumes wil l fluctuate downward from the peak hour Vehicle speeds may vary Meteorological conditions including wind speed and wind direction will vary compared with the conservative assumptions used for the single hour Section  Air Quality and Global Climate Change Cal ifornia High  Speed Rail Authority February  Bakersfield to Palmdale Project Section Draft Project EIREIS Page     A persistence factor of  was used in this analysis as recommended in the CO Protocol Caltrans  Microscale modeling is used to predict CO concentrations resulting f rom emissions from motor vehicles using roadways immediately adjacent to the locations at wh

In [150]:
# removing some digits, etc. means that we now have extra spaces
# e.g. "there are 1000 tons" becomes "there are  tons"
# so let's use our same process from before
eirtext_wordsonly = re.sub(r"\s+", " ", eirtext_wordsonly)
eirtext_wordsonly[100000:101000]

' Union Avenue monitoring station for the Bakersfield Station were used The use of these monitoring stations is conservative because while they are the closest monitors to the general study area and have a neighborhood spatial scale they are influenced by traffic related emissions The second highest monitored value was used as a background concentration In addition future CO background levels are anticipated to be lower than existing levels because of mandated emission source reductions The second highest monitored values were used as background concentrations The second highest monitored hour CO concentration based on the years to was ppm for the S Union Avenue monitoring station The second highest hour average was ppm at the S Union Avenue monitoring station Traffic data for the air quality analysis were derived from traffic counts and other information developed as part of an overall traffic analysis for the HSR project refer to Section Transportation of this document The microscale

This looks much better! We can now do something simple like count the number of words.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Conceptually, how would you create a dataframe with the counts of the number of occurences of each word. Write some code if you can, but the most important is to think through the steps.
</div>

In [151]:
# solution
import pandas as pd

# put this in a function so we can use it again later on
def countWords(wordlist):
    counts = {} # a dictionary to hold the counts
    for word in wordlist:
        lword = word.lower() # convert to lowercase
        if lword in counts:
            counts[lword] +=1
        else:
            # doesn't exist in the dictionary
            counts[lword] = 1

    # convert the dictionary to a dataframe
    # https://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe
    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])

    # sort it by the word_count column
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'
    return df

# create a list of all the words, through splitting on the spaces
wordlist = eirtext_wordsonly.split()
df = countWords(wordlist)

In [152]:
df.head(10)

Unnamed: 0_level_0,word_count
word,Unnamed: 1_level_1
the,3721
and,1910
of,1545
to,1332
no,1256
in,899
for,888
emissions,828
project,782
would,758


## Processing text
I guess it's good that an EIR section on air quality mentions emissions. But the other words aren't particularly informative. This type of analysis might be useful in some applications, but here, we really need to push further.

Let's use the `nltk` library to get rid of the little words like "the," "for," etc. These are called *stop words* in natural language processing jargon. 

In [167]:
# nltk is a mammoth library, and has lots of submodules
# we'll use the tokenize and stopwords for now
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# The first time we use them, we have to download the "corpus". 
# If you don't do this, you'll get a helpful error message reminding you of this
# See http://www.nltk.org/nltk_data/ for all the corpora that you can download
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/adammb/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/adammb/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/adammb/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [154]:
# let's take a look. The stopwords.words just gives us a list of words
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [158]:
# in several languages too
print(stopwords.words('spanish')[:10])
print(stopwords.words('arabic')[:10])

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se']
['إذ', 'إذا', 'إذما', 'إذن', 'أف', 'أقل', 'أكثر', 'ألا', 'إلا', 'التي']


In [None]:
# but with limitations
print(stopwords.words('chinese')[:10])

In [190]:
# sent_tokenize splits into sentences
sent_tokenize(eirtext)   # note I used the original eirtext, not the version with punctuation stripped out

[' Section 3.3 Air Quality and Global Climate Change Cal ifornia High - Speed Rail Authority February 2020 Bakersfield to Palmdale Project Section Draft Project EIR/EIS Page | 3.3 -1 3.3 Air Quality and Global Climate Change This section provides an analysis of air quality and global climate change associated with the Bakersfield to Palmdale Project Section (B-P) of the California High - Speed Rail (HSR) System .',
 'Air Quality and Global Climate Change The Clean Air Act (CAA) is the comprehensive federal law that regulates air emissions from stationary and mobile sources.',
 'This law authorizes the U.S. Environmental Protection Agency ( US EPA) to establish National Ambient Air Quality Standards (NAAQS) to pr otect public health and public welfare and to regulate emission s of hazardous air pollutants.',
 'California has also implemented state - specific clean air requirements in order to protect the health and welfare of California citizens.',
 'Summary of Results Project construct

In [161]:
# for our purposes, we want to split into words
wordlist = word_tokenize(eirtext_wordsonly)
df = countWords(wordlist)
df.head()

Unnamed: 0_level_0,word_count
word,Unnamed: 1_level_1
the,3721
and,1910
of,1545
to,1332
no,1257


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> How would you exclude the stopwords from this dataframe? You could either adapt the `countWords` function, or filter the returned dataframe.
</div>

In [165]:
# solution
# method 1: modify the function
def countWords_nostop(wordlist):
    
    # this might be overkill, but we probably want a version of stopwords that excludes the punctuation  
    # this is a list comprehension - a shorthand for a loop
    swords = [re.sub(r"[^A-z\s]", "", sword) for sword in stopwords.words('english')]
    
    # in longhand, that was
    # swords = []
    # for sword in stopwords.words('english'):
    #     swords.append(re.sub(r"[^A-z\s]", "", sword))
    
    counts = {} # a dictionary to hold the counts
    for word in wordlist:
        lword = word.lower()
        if lword in swords:
            # skip the stop words
            continue
        elif lword in counts:
            counts[lword] +=1
        else:
            # doesn't exist in the dictionary
            counts[lword] = 1

    # convert the dictionary to a dataframe
    # https://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe
    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])

    # sort it by the word_count column
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'

    return df

df = countWords_nostop(wordlist)
print(df.head())

# method 2: filter the dataframe
print('\nMethod two')
swords = [re.sub(r"[^A-z\s]", "", sword) for sword in stopwords.words('english')]

df = countWords(wordlist)
df.reset_index(inplace=True) # we want to do an apply on word, so it has to be a regular column, not the index

df['stopword'] = df.word.apply(lambda x: True if x in swords else False)
print(df.head(10))

df = df[df.stopword==False] # filter the dataframe
print(df.head(10))

           word_count
word                 
emissions         828
project           782
would             758
air               711
quality           504

Method two
        word  word_count  stopword
0        the        3721      True
1        and        1910      True
2         of        1545      True
3         to        1332      True
4         no        1257      True
5         in         899      True
6        for         888      True
7  emissions         828     False
8    project         782     False
9      would         758     False
            word  word_count  stopword
7      emissions         828     False
8        project         782     False
9          would         758     False
10           air         711     False
12       quality         504     False
13            pm         454     False
14  construction         431     False
16       section         422     False
19            co         342     False
24          year         284     False


Finally, we might want to *lemmatize* the words. We saw that process used in the [Brinkley & Stahmer](https://journals.sagepub.com/doi/abs/10.1177/0739456X21995890) paper. Lemmatization groups words with the same stem, e.g. `highway` and `highways`, or `constructing` and `construction`, through reducing them to their *root*.

In [187]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem('construction'))
print(ps.stem('highways'))
print(ps.stem('housingelementifcation'))  # even if it doesn't know the (made-up) word, it takes a decent guess

construct
highway
housingelementifc


In [189]:
# same function as before, just with one extra line

def count_roots(wordlist):
    
    # this might be overkill, but we probably want a version of stopwords that excludes the punctuation  
    # this is a list comprehension - a shorthand for a loop
    swords = [re.sub(r"[^A-z\s]", "", sword) for sword in stopwords.words('english')]
        
    
    counts = {} # a dictionary to hold the counts
    for word in wordlist:
        # this is the extra line
        lword = ps.stem(word.lower())

        if lword in swords:
            # skip the stop words
            continue
        elif lword in counts:
            counts[lword] +=1
        else:
            # doesn't exist in the dictionary
            counts[lword] = 1

    # convert the dictionary to a dataframe
    # https://stackoverflow.com/questions/18837262/convert-python-dict-into-a-dataframe
    df = pd.DataFrame.from_dict(counts, orient='index', columns=['word_count'])

    # sort it by the word_count column
    df.sort_values('word_count', ascending=False, inplace=True)
    df.index.name = 'word'

    return df

df = count_roots(wordlist)
df.head(10)

Unnamed: 0_level_0,word_count
word,Unnamed: 1_level_1
emiss,1044
project,871
would,758
air,711
qualiti,504
pm,454
construct,441
section,431
impact,430
co,342


Whether the roots are more useful than the original words is obviously a matter for your specific task.

So now we've got the tools to bring in some text to a useful form. In the next notebooks, we'll interpret the text using topic modeling and sentiment analysis.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>PDFs are difficult to work with. PyPDF2 is a good starting point, but make sure to inspect your output.</li>
  <li>regex is a powerful tool to clean up text, e.g. removing whitespace and punctuation.</li>
  <li>Before analyzing a text, you will probably need to do additional clean-up such as removing stopwords, converting to lower case, and possibly lemmatizing the words.</li>
</ul>
</div>