## Module 8 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class. We'll parse some PDFs, and then implement some simple "bag of words" models.

Our running example: the [City of Palm Springs General Plan update](https://www.psgeneralplan.com).

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Reading and cleaning PDFs

Let's read in [this PDF of public comments](https://www.psgeneralplan.com/_files/ugd/89af76_0b8c3cd9a25140f4a9791570af8d6ba0.pdf). It's in the `data/` folder in your GitHub repository.

I suggest using `pdfminer` (as in Lecture 17), but you could try other PDF parsing libraries if you like. If you didn't install it already while watching the video lectures, you'll need to do that first.

<div class="alert alert-block alert-info">

<strong>Exercise:</strong> Read in the public comments to a string.
</div>

In [None]:
from pdfminer.high_level import extract_text

# your code here
fn = '../classes/data/PS_VP_Survey_Results_FINAL.pdf'
txt = extract_text(fn)
print('Text is {} characters long'.format(len(txt)))


<div class="alert alert-block alert-info">

<strong>Exercise:</strong> Inspect the string. Does it look usable? Write down what you think you'd need to do to clean it up. 
    
Explain your reasoning to your neighbor.

Write down your thoughts here.

You may have come up with better ideas (implement them!).

First, note that the first 9 pages are survey responses - not comments. The first "real" comment is "It doesn't speak about the people..."

You can exclude those in a couple of ways. We could have ignored the first 8 pages when loading in via `pdfminer`. Or we can search for the text, "It doesn't speak about the people," and remove all of the earlier part of the string as follows:
* `txt.find("It doesn't")` will give you the location of where that comment starts
* `txt = txt[txt.find("It doesn't"):]` will give you a new string that starts at that location.

Once we've done that, we can convert all tabs, newlines, and multiple spaces to a single space.

In [None]:
# Example
my_text = 'This is the part of the string I want to exclude. We can start here.'
print(my_text.find('We can start'))
print(my_text[my_text.find('We can start'):])

# this is the same
print(my_text[50:])

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Clean up the comments, by (i) removing the first 8 pages, (ii) removing the characters that are not letters or spaces, and (iii) replacing whitespace (e.g. newlines, tabs, and multiple spaces) with a single space.
</div>

*Hints:*
* You'll need `regex`
* Start with the first comment - write a regex that will clean up the first element in your list
* Once you've figured that out, apply it to everything in your list with a loop or list comprehension. 

In [None]:
import re
#  quotes are a pain - curly and straight quotes are represented differently
# so I just use It doesn rather than It doesn't

print(txt.find("It doesn"))
txt = txt[txt.find("It doesn"):]

txt = re.sub(r"\s+", " ", txt)  # remove excess whitespace
txt = re.sub(r"[^A-z\s]", "", txt)  # remove characters that are not letters\
txt = re.sub(r"\s+", " ", txt)      # repeat step 1 to remove any new whitespace created in the previous step

# look at the first few hundred characteres
txt[:500]

### Word counts
Now we have our cleaned up string. Let's look at a bag of words model - basically, what are the most frequent words. 

First, we need to split our string into a list of words, and remove the stopwords. Hint: Think about case too!

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Create a list of words, removing stopwords.</div>

In [None]:
# your code here
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopwords.words('english')

wordlist = [word for word in word_tokenize(txt.lower()) if word not in stopwords.words('english')]

In the lecture video, we wrote our own custom word count function. But `nltk` actually has that built in. It does counts of individual words, but also of N-grams. For example, a bigram (N-gram of 2) could be "Palm Springs" or "affordable housing."

In [None]:
# example
from nltk import ngrams, FreqDist
sample_words = ['a', 'list','of','words','some', 'repeat', 'and','repeat','themselves', 'repeat','themselves']
FreqDist(ngrams(sample_words, 1))

In [None]:
# example bigrams
FreqDist(ngrams(sample_words, 2))

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Plot the frequencies of words (and bigrams) in your own wordlist. Experiment with the plot arguments, and what you can do with the object returned by <strong>FreqDist</strong>. Based on your results, you might want to add more words to your list of stopwords.</div>

In [None]:
# your code here
n = 1 # could change to 2 for a bigram

FreqDist(ngrams(wordlist, n)).plot(20)

# I don't like that some of the words don't help us understand the substance of the comments. Let's remove them
more_stopwords = ['pm', 'palm', 'springs', 'city', 'need', 'needs', 'would']
new_wl = [word for word in wordlist if word not in more_stopwords]
FreqDist(ngrams(new_wl, n)).plot(20)

# I'm curious what I can do with FreqDist
fd = FreqDist(ngrams(new_wl, n))

# maybe it can be a dataframe
# https://stackoverflow.com/questions/15145172/nltk-conditionalfreqdist-to-pandas-dataframe
import pandas as pd
df = pd.DataFrame(fd.items(), columns=['word', 'frequency']).sort_values(by='frequency', ascending=False)
print(df.head())

# clean up the word column - it's a tuple, and we just want the first element
df['word'] = df['word'].apply(lambda x: x[0])
print(df.head())

## Extensions to bags of words
One potential use of these "bags of words" models is to look at differences across space or across time. For example, do building or land use permits change over time - when does "add garage" become less common, and "add ADU" more common. 

Or we could look at how word frequency changes across space or cities.

Let's try this with the San Francisco Board of Supervisors meetings. [This webpage](http://sanfrancisco.granicus.com/ViewPublisher.php?view_id=10) gives the archived transcripts ("caption notes").

You could scrape all of the URLs (that's a good exercise!), but for now, just manually create a list with a half dozen or so of the caption notes.

Write a function that for a given URL:
* Gets the text (use `requests`)
* Cleans the text
* Gets the word counts

Then, think about how you might chart the results over time.

This is an open-ended prompt, so spend some time thinking through the steps conceptually, even if you don't get far in implementing it. For example, how will you organize the text of each documents and the counts? In a list? A dataframe? Will you loop through each URL?

In [None]:
# your code here

<div class="alert alert-block alert-info">
<h3>You should now be able to:</h3>
<ul>
  <li>Read a PDF into a Python string</li>
  <li>Clean the string</li>
  <li>Count words and n-grams, and plot the results</li>
</ul>
</div>