## Homework 5: Natural language processing

In this assignment, we'll practice topic modeling and sentiment analysis.

Please help me grade by observing the following:
 
* Do not rename this notebook (that messes up the autograder)
* Do not include large sections of output (that makes it hard to find your code). For example, use `df.head()` to show the first few rows, rather than printing an entire dataframe. The same goes for printing long strings.

The same ChatGPT policy applies as in previous homework assignments.

We'll use a dataset of transcripts from San Francisco Planning Commission meetings from 2006-2025. In order to focus your time on the natural language processing, I did the web scraping part for you. However, I encourage you to review and tinker with the code below as a review of the earlier course material. All the transcripts are downloaded from the [Board of Supervisors meetings page](http://sanfrancisco.granicus.com/ViewPublisher.php?view_id=10).

In [1]:
# if 0 is always False, so this just means this part of the code won't run 
# (you can load in the file instead of scraping it yourself)
if 0:   
    import requests
    import pandas as pd 
    from bs4 import BeautifulSoup
    
    r = requests.get('https://sanfrancisco.granicus.com/ViewPublisher.php?view_id=20') # Planning Commission homepage
    soup = BeautifulSoup(r.content, 'html.parser')
    links = [link.get('href') for link in soup.find_all('a')] # get all weblinks on the page
    links = [link for link in links if 'TranscriptViewer' in link] # restrict to those that are transcripts (there are also video files)

    # now loop over all the URLs, retrieve the transcript, and store it in a dictionary
    transcripts = {}
    for ii, link in enumerate(links):
        clipid = link.split('clip_id=')[-1] # get the clip id, just for the dictionary key
        if clipid in transcripts: continue
        print ('Fetching link {} of {}: https:{}'.format(ii+1, len(links),link))
        r = requests.get('http:'+link)
        soup = BeautifulSoup(r.content)    
        transcripts[clipid] = soup.text  # store the result in the dictionary with key=clip_id

    # save to a zipped csv with the clip id and the transcript
    outFn = 'sf_pc_transcripts.zip'
    pd.DataFrame.from_dict(transcripts, orient='index', columns=['transcript']).to_csv(outFn)

The transcripts are in the file `sf_pc_transcripts.zip`. Load them in to a `pandas` dataframe called `meetings`. (Note that you don't have to unzip first.)

*Important:* Please use a *relative* filepath so that I can grade without editing your code. For example:
- `pd.read_csv('a_file_name.csv`)` will work on any computer where the input file is in the same directory
- `pd.read_csv('/Users/adammb/Documents/homework/a_file_name.csv')` will only work on my computer 

In [None]:
# your code here
meetings = 999

### BEGIN SOLUTION
import pandas as pd
fn = 'sf_pc_transcripts.zip'
meetings = pd.read_csv(fn)
### END SOLUTION

In [None]:
# autograder tests - do not edit
print(len(meetings), type(meetings))
assert len(meetings) == 857
assert isinstance(meetings, list) or isinstance(meetings, pd.DataFrame)

Let's clean up each of these transcripts. 

First, create a function that removes from a string:
- excess whitespace, punctuation, and stop words
- words of two letters or less
- the standard nltk stopwords
- the following words that are common in pretty much every transcript (see the suggested list in the cell below)

Your function should take a string and return a cleaned-up string in the form of a list of words.

*Hint*: You'll first want to use regex to remove the excess whitespace and punctuation (and then whitespace again). Then create a list of words using `split()` or `word_tokenize()` and remove the stopwords. 

(Don't apply the function to your transcripts yet — that's the next question.)

In [None]:
# a list of words for you to add to the standard nltk stopwords
extra_stopwords = ['san', 'francisco', 'board', 'supervisors', 'supervisor', 'thank', 'people', 'clerk','planning',
           'commissioners','commissioner','commission', 'still','day','maybe','something',
           'see','continue','okay','right','motion','thats','really','make','kind','could',
           'city','want','president','would','like','public','please','aye', 'good','afternoon','second','may',
          'item','next','also','think','year','time','speaker','know','one','many','much','get',
          'great','two', 'one', 'years', 'actually', 'going', 'staff', 'hearing',
        'three', 'inaudible', 'actual', 'yeah', 'look', 'said', 'work', 'theres', 'sure',]

def clean_string(text):
    # your code here
    # it turns text (a string) into a cleaned list of words
    return cleaned_list_of_words


### BEGIN SOLUTION
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

swords = [re.sub(r"[^A-z\s]", "", sword) for sword in stopwords.words('english')]


def clean_string(text):
    # remove whitespace
    text = re.sub(r"\s+", " ", text)
    
    # remove punctuation
    text = re.sub(r"[^A-z\s]", "", text)

    # remove whitespace again
    text = re.sub(r"\s+", " ", text)
    
    cleaned_list_of_words = [word for word in word_tokenize(text.lower()) if word not in swords+extra_stopwords and len(word)>2]

    
    return cleaned_list_of_words

### END SOLUTION

In [None]:
# autograder tests - do not edit
newstr = clean_string('A    very dirty 934\t999 string of what the San Francisco planning Commission discusses us IS like  this')

print(newstr)
assert newstr == ['dirty', 'string', 'discusses']

Now, use your function to clean up the transcripts. Create a new column in your `transcripts` dataframe called `cleaned`. This should be a *list* of words.

*Hint*: the `apply` method is the simplest way to do this.

In [None]:
# your code here

### BEGIN SOLUTION
meetings['cleaned'] = meetings.transcript.apply(clean_string)
### END SOLUTION

In [None]:
# autograder tests - do not edit
print(meetings.loc[0,'cleaned'][:50])
print(len(meetings.loc[10,'cleaned']))

assert 'cleaned' in meetings.columns
assert meetings.loc[0,'cleaned'][:4] == ['county', 'thursday', 'welcome', 'thursday']
assert len(meetings.loc[10,'cleaned'])==18441


Estimate an LDA topic model on transcripts. For the hyperparameters, I'd suggest `num_topics=15`, `alpha=1` and `eta=0.05` given that most meetings are likely to discuss many different topics. The challenge problem asks you to go deeper and experiment with different values, but feel free to do so here if you are inclined.

Visualize your topic model using `pyLDAvis`. Remember, moving the relevance slider to the left will emphasize words that are unique to that topic.

In [None]:
# your code here

### BEGIN SOLUTION

import gensim
dictionary = gensim.corpora.Dictionary(meetings.cleaned.values)
corpus = [dictionary.doc2bow(wl) for wl in meetings.cleaned.values]
model = gensim.models.LdaMulticore(corpus, id2word=dictionary, num_topics=15, alpha = 1, eta=0.05)


import pyLDAvis
import pyLDAvis.gensim_models   # note that in previous versions this was called pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(model, corpus, dictionary)

### END SOLUTION

How do you interpret your topic model results, and how would you describe some of the topics? 

Explain in few sentences or bullet points. 

Your answer here.

Now, it's time for some sentiment analysis!

Analyzing the sentiment of an entire transcript is probably not that helpful, because the positive and negative will balance each other out. So let's do this at the sentence level.

Let's go back to our original dataframe (no need to worry about the cleaned version - sentiment analysis should be robust to the inclusion of punctuation and stop words). For each transcript (a string), we'll need to split (tokenize) it into sentences, creating a list. Then, we need to "flatten" (merge) these lists into a single list.

Create a single list, where each element is a sentence. Call it `sentences`. It should look like this:

`['Sentence 1', 'Sentence 2', 'Sentence 3']`


*Hints*:
- apply the `sent_tokenize()` function to the `transcript` column in your original dataframe. You import `sent_tokenize()` in the same way as `word_tokenize()`
- you can flatten a list of lists using `itertools.chain`. For example, `import itertools` and then try `list(itertools.chain(*[[1,2],[3,4]]))`. Or Google "flatten Python list" for other options.

In [None]:
# your answer here
sentences = 999 # replace with your code

### BEGIN SOLUTION
import itertools
from nltk.tokenize import sent_tokenize

# for each transcript, tokenize the sentences
meetings['sentences'] = meetings.transcript.apply(sent_tokenize)
# flatten the list
sentences = list(itertools.chain(*meetings['sentences'].values))
### END SOLUTION

In [None]:
# autograder tests - do not edit
print(len(sentences))
print(sentences[:2])
assert len(sentences) == 1546925
assert sentences[1] == '''Okay good afternoon and welcome to the san francisco planning\n\ncommission hearing for thursday\n\nMAY 22nd, 2025 when we reach the item you're interested in speaking to we ask that you\n\nline up on the screen side of the room or to your right.'''

Write a function, `get_sentiment()`, that calculates the sentiment score (polarity) for each comment in your (cleaned) list. 
The function should take a sentence and return a score.

In [None]:
def get_sentiment(sentence):
    # your code here
    return polarity

### BEGIN SOLUTION
from textblob import TextBlob
def get_sentiment(sentence):
    polarity = TextBlob(sentence).sentiment.polarity
    return polarity

### END SOLUTION

In [None]:
# autograder tests - do not edit
print(get_sentiment('I hate the idea of higher densities'))
assert get_sentiment('I hate the idea of higher densities')==-0.275

Now, apply the sentiment score to every string in your list (`sentences`). Create a new list of polarities, and assign that to a new list `sentiment_scores`.

*Hint*: a list comprehension might be in order.

In [None]:
sentiment_scores = []  # your list here

### BEGIN SOLUTION
sentiment_scores = [get_sentiment(sentence) for sentence in sentences]
### END SOLUTION

In [None]:
# autograder tests - do not edit
import numpy as np
print(np.round(sentiment_scores[9], 2))
assert np.round(sentiment_scores[9], 2)==0
assert len(sentiment_scores) == len(sentences)

What is the mean sentiment score? Assign it to a variable, `mean_score`.

In [None]:
mean_score = 999 # replace with your code

### BEGIN SOLUTION
import numpy as np
mean_score = np.mean(sentiment_scores)
### END SOLUTION

In [None]:
# autograder tests - do not edit
print(mean_score)
import numpy as np
assert np.round(mean_score,3) == 0.081

Plot a histogram of your scores. Make sure to add axis labels where appropriate.

In [None]:
# your code here

### BEGIN SOLUTION
import seaborn as sns
ax = sns.histplot(sentiment_scores)
ax.set_xlabel('Sentiment score')
ax.set_ylabel('Number of sentences')
### END SOLUTION

Look at a couple of the highest and lowest-ranked sentences. What appears to be driving their sentiment? Comment in a couple of bullet points.

In [None]:
# your scratch code here

### BEGIN SOLUTION
# I made a dataframe for easy sorting
df = pd.DataFrame([sentences, sentiment_scores], index=['sentence','sentiment']).T
df.sort_values(by='sentiment', inplace=True)

# Now print the first few and last few rows
print(df.sentence.head().values)
print(df.sentence.tail().values)

### END SOLUTION

Your answer here

# Challenge Problem
Remember, you need to do at least two of these challenge problems this quarter.

This challenge problem is open ended for you to take in a direction that you are most interested in. Here are some suggestions (do 1 of these in depth, 2 more quickly, or something analagous of your choice).

* For the topic modeling, experiment with `num_topics`, `alpha` and `eta` to get a meaningful set of topics. You might want to clean the data further as well, e.g. through lemmatizing and dropping other words
* How has the coverage of topics changed over time? (Note that the transcripts are in date order; the date of the meeting is also in the text.)
* Analyze and plot the sentiment scores for sentences that mention different keywords. Perhaps you want to include the sentence before and after too. Do you see a difference for those that mention "density," "zoning", "housing," "parking," "parks," etc.?
* Has sentiment changed over time?
* Other ideas?

In all cases, write some brief interpretation in a markdown cell.