## Module 9 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class. We'll estimate topic models and do some sentiment analysis.

We'll continue with our example from last class: the [City of Palm Springs General Plan update](https://www.psgeneralplan.com).

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Reading and cleaning PDFs

Let's read in [this PDF of public comments](https://www.psgeneralplan.com/_files/ugd/89af76_0b8c3cd9a25140f4a9791570af8d6ba0.pdf). It's in the `data/` folder in your GitHub repository.

This is the code from last class. We read in the text of the PDF, and exclude the first 9 pages which are survey responses, not comments.

In [None]:
from pdfminer.high_level import extract_text

fn = 'data/PS_VP_Survey_Results_FINAL.pdf'
txt = extract_text(fn)
txt = txt[txt.find("It doesn"):]

txt[:200] # see what it looks like

Before we clean up the text further, let's split this into a list of comments. Note that each comment seems to be separated by `AM\n\n` or `PM\n\n`. So if we split on `M\n\n`, we should get a list of comments.

Once you have a list, now we can clean it up using regex. Here's my suggestion:
* write a function that takes a string, and returns a clean string (remove excess whitespace and characters that are not letters or a space)
* create a new list by applying this function to each element of your list of comments

The latter can be done with something like:

`newlist = [clean_string(comment) for comment in oldlist]`.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Create a list of cleaned comments.
</div>

In [None]:
import re
def clean_string(comment):
    # your code here
    return cleaned_comment

# create a list of comments
comments = 999 # # your code here 

# then a list of cleaned comments

One final cleaning step: you probably notice that all your comments end in `P` or `A`. 

Remove these terminal letters from each comment (or just delete the last two characters of each comment). And remove comments that are just ` P` or ` A` (perhaps you can ignore all comments that are less than, say, 5 characters long).

*Hint*: Try another list comprehension (or two).

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Clean up your comments further and drop the short ones.
</div>

In [None]:
# your code here

### Sentiment analysis
Now we have our cleaned up list of comments. Let's do some sentiment analysis. If we create a dataframe with our comments (as one column) and the polarity score (as a second column), the analysis later on becomes easier.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Create a list of polarity scores, one for each comment. Then create a dataframe, with one column for the comment and one for the polarity score.</div>

In [None]:
from textblob import TextBlob
import pandas as pd

# your code here

Take a look at the comments with some of the highest and lowest polarity scores. Do the scores make sense? What words is it picking up on?

Note that a simple sentiment analyzer like TextBlob won't capture nuances, but in aggregate the results can be useful.

In [None]:
# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Do some plots of the sentiment and other analyses that try and capture sentiment on particular issues.</div>

For example, after you plot the overall sentiment scores (a histogram?), you might want to plot the scores where people mention specific issues. For example, you could add a column that is `True` if the comment mentions housing, and then plot the scores only for those rows. Experiment!

In [None]:
# your code here

## Topic modeling
Now let's see if we can identify different topics in the list of comments.

First, we'll need to do a bit more cleanup. For each comment, turn it into a list of words, and exclude stopwords. (After you see the results of your topic modeling, you might want to add more stopwords.) You should end up with a list of lists.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Turn your list of comments into a list of lists of words, excluding stopwords.</div>

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Estimate and visualize a topic model from your wordlists. Experiment with the number of topics and the other hyperparameters.</div>

Remember: 
* `alpha` controls the expected distribution of topics across documents. A higher value of `alpha` means that each document is expected to contain more of a mix of topics, rather than focusing on a few topics.
* `eta` (sometimes called beta) controls the expected distribution of words across topics. A higher value of `eta` means that topics are more similar in terms of their mixture of words.

In [None]:
import gensim
import pyLDAvis
import pyLDAvis.gensim_models 

# your code here

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Explain your topic model - write a sentence or two, and explain to your neighbor. What does each topic signify, and how did you choose alpha and eta?</div>

Your comments here.

## Extensions to topic modeling and sentiment analysis
One potential use of these models is to look at differences across space or across time. 

Or we could look at how word frequency changes across space or cities.

You could also feed the text to a generative AI API (e.g. the Google Gemini API which we experimented with in Week 2).

If you have time, try this with the San Francisco Board of Supervisors meetings. [This webpage](http://sanfrancisco.granicus.com/ViewPublisher.php?view_id=10) gives the archived transcripts ("caption notes").

You could scrape all of the URLs (that's a good exercise!), but for now, just manually create a list with a 10-20 or so of the caption notes.

Write a function that for a given URL:
* Gets the text (use `requests`)
* Cleans the text
* Tokenizes (splits into words)

Then, think about how you might estimate a topic model.

Would sentiment analysis be useful here?

This is an open-ended prompt, so spend some time thinking through the steps conceptually, even if you don't get far in implementing it. For example, how will you organize the text of each documents and the counts? In a list? A dataframe? Will you loop through each URL?

In [None]:
# your code here

<div class="alert alert-block alert-info">
<h3>You should now be able to:</h3>
<ul>
  <li>Do further cleaning of text documents</li>
  <li>Estimate and interpret sentiments of texts</li>
  <li>Estimate and interpret topic models</li>
</ul>
</div>